Subset Sum without Cursor

  • Still on my quest to remove all the cursors from my database... or at least be capable of removing them if it's detrimental to performance.

    Subset Sum is a pattern we're using on a few procedures and we're doing it with cursors. None of us can think of how you would solve this sort of issue in general with just set based logic.

    If anyone knows of this or knows where to find the answer, that would be great. My google was weak today looking this up.

  • Did you look into SUM(Field) OVER (PARTITION BY ...)?

    For more detailed information to we need more information from you.

    Please see http://www.sqlservercentral.com/articles/Best+Practices/61537/ on how to post sample data.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Shawn Therrien (5/8/2009)


    Subset Sum is a pattern we're using on a few procedures and we're doing it with cursors. None of us can think of how you would solve this sort of issue in general with just set based logic.

    !!! ??? !!!

    Are you serious?!? That's an NP-complete decision problem and you're solving it on SQL Server? This I gotta see... 🙂

    Could you please post the SQL code you are currently using and the table definitions (Create Table statements). If you could give us some sample data to, and/or an example that would be great. I will give this very serious attention if so.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Yes, I'm planning on doing that tomorrow. It'll likely take a few hours to scrub through the 300+ lines of code and make it self contained and generic.

    If thats what it takes, thats what it takes.

    Here is the gist without the code sample (it will follow likely tomorrow):

    A student can only study one subject a day, but often is studying 3-4 subjects over the course of a month (defined as 20 school days) while receiving help/tutoring from an instructor. A student might be transferred to another instructor midmonth and only one instructor can claim days for a student and course in a month. We have information showing which instructors helped the student over the month and for how many days. We have information for which courses a student studied for how many days. We know how many days each instructor claimed for the month per student, just know which subjects. The actual correct subject doesn't matter, just that the days are correct.

    So you might have an easy one:

    Student studies: Math, Science and English for 5 days each. Student only has a single instructor for a full 20 days. Since all the days fit in the instructors bank of possible days and there are no other teachers involved, the instructor can claim all three subjects. This isn't even an issue for the Subset Sum.

    A simple Subset Sum breakdown would be:

    Student Studies Math (5), Science (2), English (3) and has two Instructors. Instructor 1 has 5 days and instructor 2 has 5 days. We have to run through and give both teachers 5 days worth of subjects.

    A more complex Subset Sum would be:

    Student Studies Math (3), Science (4), English (1), History (2) and Poetry (1). Two instructors, one has 5 days, the other has 6. The way we solved this was looping through and adding values until you hit an even match.

    Instructor 1 (5): Math, English, Poetry

    Instructor 2 (6): Science, History

    or

    Instructor 1 (5): Math, History

    Instructor 2 (6): Science, English, Poetry

    or

    Instructor 1 (5): Science, English

    Instructor 2 (6): Math, History, Poetry

    Easy to pick out on sight, but coming up with logic to deduce is tough. We ended up using loops and tables in the pattern of a Subset Sum.

    I will boil this down into readable SQL, just our working version is about 300 lines. So it's going to take a bit to distill it down. I know this reads like a crazy word problem in a math class...

    Just what happens when developers do the bare minimum because nobody would ever care to know what instructors tutored which course... until someone does care and then it's some complex work to weed out some details if thats even possible.

  • RBarryYoung (5/8/2009)


    Shawn Therrien (5/8/2009)


    Subset Sum is a pattern we're using on a few procedures and we're doing it with cursors. None of us can think of how you would solve this sort of issue in general with just set based logic.

    !!! ??? !!!

    Are you serious?!? That's an NP-complete decision problem and you're solving it on SQL Server? This I gotta see... 🙂

    Could you please post the SQL code you are currently using and the table definitions (Create Table statements). If you could give us some sample data to, and/or an example that would be great. I will give this very serious attention if so.

    I'll get a runnable version of the code up in the next day.

    It's code thats being used in ETL's from an older systems database that was sort of data lossy at times into our new system where we don't want to throw out data. We know that this old data isn't accurate to the point of the correct subject/instructor, but it's required that we have those subjects mapped to someone and that the days fit.

  • OK, a couple of questions:

    1) Are you using an exact solution? (sounds like you are)

    2) Do you just need a yes/no answer (possible, not possible) or do you need a valid partitioning (or all valid partitionings)?

    3) Whats the maximum number of days? (20?)

    4) Whats the maximum # of instructors to take into account? (2?)

    5) Whats the maximum number of subjects? (5?)

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (5/8/2009)


    OK, a couple of questions:

    1) Are you using an exact solution? (sounds like you are)

    2) Do you just need a yes/no answer (possible, not possible) or do you need a valid partitioning (or all valid partitionings)?

    3) Whats the maximum number of days? (20?)

    4) Whats the maximum # of instructors to take into account? (2?)

    5) Whats the maximum number of subjects? (5?)

    1. Yes, exact.

    2. Just the first valid partitioning. Whats the first exact match we come across.

    3. Max days per month is 20.

    4. 3

    5. 5 is about right.

  • First, I need to apologize for not paying enough attention that there is a link which refers to complexity theory. :angry:

    I was under the impression that the question is regarding to the trivial part of subset sum (given set with given deterministic criteria to sum over).

    In addition to Barry's question I'd like to know if there is one criteria missing: which instructor is qualified for what course?

    With the scenario you provided it must be assumed that each instructor is able to tutor on every single course.

    Given the following scenario:

    Student A: Math (4), Science (4), English (2), History (2).

    Student B: Math (2)

    Instructor 1: 12 Hours

    Instructor 2: 2 Hours

    Out of the range of possible scenarios I'd like to pick two practical solutions:

    I) Instructor 1 tutors Student A full time, covering 4 courses. Instructor 2 tutors Student B.

    II) Instructor 1 tutors Math for Student A and Student B + Science and English for Student A. Instructor 2 tutors B on History.

    What would be the more likely scenario from the "practical" point of view? Do you focus on having the minimum number of tutors per student assuming the tutors are capable of teaching all subjects or do you have specialized tutors per subject?

    If either the first or the last scenario is applicable you could narrow the problem down dramatically, I think...

    What also needs to be answered:

    What happens if the total number of days for all students exceeds the total number of available instructor days? What should the solution be optimized for?: Trying to cover tutoring for as many students as possible leaving out one student almost untutored; split the gap over all students or reducing the tutoring for a special class?

    Another "trivial solution" would be applicable, if every instructor can tutor on every subject and it doesn't matter if a student get help from two instructors on the same class. In this case you could just do a running total on both, student classes and instructor days and match the two. By changing the order in the student table (per student or per class) you could change the focus for tutoring (student focused or calls focused).

    Using the sample you provided and keeping the order this would end up with

    Instructor 1: Math (3), Science(2)

    Instructor 2: Science (2), English (1), History (2) and Poetry (1)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (5/9/2009)


    First, I need to apologize for not paying enough attention that there is a link which refers to complexity theory. :angry:

    No problem, just looked at that portion and the links on this forum don't seem very obvious. The text is the same color and slightly different sized until you mouse over. Live and learn, I'll call more attention to a link in the future.

    lmu92 (5/9/2009)


    With the scenario you provided it must be assumed that each instructor is able to tutor on every single course.

    Great question! Legally every instructor is a fully certified teacher able to tutor any course. I'm sure some are better/worse in areas and specialize, but for the history we know that we can't figure this out accurately. From this point forward in our new system it will be accurate.

    lmu92 (5/9/2009)


    Out of the range of possible scenarios I'd like to pick two practical solutions:

    I) Instructor 1 tutors Student A full time, covering 4 courses. Instructor 2 tutors Student B.

    II) Instructor 1 tutors Math for Student A and Student B + Science and English for Student A. Instructor 2 tutors B on History.

    What would be the more likely scenario from the "practical" point of view? Do you focus on having the minimum number of tutors per student assuming the tutors are capable of teaching all subjects or do you have specialized tutors per subject?

    Over a given time period (a month) we know how many days an instructor tutored each student.

    Probably about 80% of the time you have 1 student being tutored the whole month by one teacher. With this case, we can attach an instructor to each subject without the need of a Subset Sum. The Student has 15 days of being tutored and the instructor has 15 days of tutoring the student, that matches so we know the instructor tutored every class the Student worked on that month.

    In the other 20% we're strictly dealing with a single student with 2+ instructors. So not having to deal with Many Students to Many Instructors, only One to Many - this really makes things by far more manageable.

    If either the first or the last scenario is applicable you could narrow the problem down dramatically, I think...

    lmu92 (5/9/2009)


    What happens if the total number of days for all students exceeds the total number of available instructor days? What should the solution be optimized for?: Trying to cover tutoring for as many students as possible leaving out one student almost untutored; split the gap over all students or reducing the tutoring for a special class?

    Well if that happened, there would be corruption somewhere in the data. This is actually possible due to clients demanding a dangerous features for deletion and the old developers taking a laymen request literally and actually deleting records rather than just giving the user the illusion of what they want, which is really what they want - the ability to remove records, but know what was removed. ie, just marking the records as "deleted" or moving them to another table. But these cases are known. We know to skip them and we have the users going through their paperwork to find the data they deleted in error to fix those data issues.

    Given that the data is valid, you have a student displays X days amongst all their subjects for a month, and all instructors who tutored the student also have that same X days between them. So with valid data there isn't a question of days not matching days. It'll fit, we just have to arrange the blocks to make them fit properly for each teacher.

    lmu92 (5/9/2009)


    Another "trivial solution" would be applicable, if every instructor can tutor on every subject and it doesn't matter if a student get help from two instructors on the same class. In this case you could just do a running total on both, student classes and instructor days and match the two. By changing the order in the student table (per student or per class) you could change the focus for tutoring (student focused or calls focused).

    Using the sample you provided and keeping the order this would end up with

    Instructor 1: Math (3), Science(2)

    Instructor 2: Science (2), English (1), History (2) and Poetry (1)

    To keep things simple (there is another level of granularity that we don't need to go into here), within a month, a single students subject can only be tutored by a single instructor as far as paperwork is concerned. So in that example, Science couldn't be split between the teachers.

  • Shawn Therrien (5/8/2009)

    I'll get a runnable version of the code up in the next day.

    What happened today? I had time to work on this today, shame it didn't work out.

    I still might be able to get time for it tomorrow... ?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I do apologize. I didn't realize that VPN software didn't work with a 64bit OS and was stranded away from my work materials.

    I'm making a VERY simplistic version of the subset sum. Mine, with temp tables and throwing data in is easily 500+ lines.

    I don't need someone to do my work, just need help with the technique to handle the subset sum without cursors.

    And I understand that you may not have time right now. This isn't an urgent request, I'm just spending my training time to get this sorted out to learn these new techniques.

  • It's OK, this problem interests me. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Shawn Therrien (5/11/2009)


    I do apologize. I didn't realize that VPN software didn't work with a 64bit OS and was stranded away from my work materials.

    Not sure what VPN software you use, but if you are looking for an IPSec solution on x64 that can replace a Cisco client, you can go to:

    http://www.ncp-e.com/en.html

    I had the same issue with our Cisco client and use this now. Works very well - in fact, I think it works better. However, it is not inexpensive - but not too bad.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Here is the first part, which is the data side. This takes data being tracked for each student/subject with instructor information and sumarizes this by Instructor in another table. Then it removes the initial instructor value from the first table.

    This is exactly what happens in the old system. The data is recorded then an ETL transforms and sumarizes it and then the old data is destroyed. So this data set provides a solution as well as the means for getting to that solution.

    IF object_id('tempdb..#Assignment') IS NOT NULL

    DROP TABLE #Assignment

    CREATE TABLE #Assignment (

    Id INT PRIMARY KEY IDENTITY (1,1)

    ,StudentId INT

    ,Course VARCHAR(50)

    ,Days INT

    ,InstructorId INT

    ,MonthId INT

    )

    -- Populating the work done during the month as it actually happens.

    INSERT INTO #Assignment

    (StudentId,Course,Days,InstructorId,MonthId)

    SELECT 101 ,'English 1A',3 ,501 ,201

    UNION SELECT 101 ,'Algebra 1A',1 ,501 ,201

    UNION SELECT 101 ,'Reading 2B',1 ,501 ,201

    UNION SELECT 101 ,'Weaving',2 ,502 ,201

    UNION SELECT 101 ,'PE-Running' ,3 ,502 ,201

    UNION SELECT 102 ,'English 2B',5 ,505 ,201

    UNION SELECT 102 ,'MusicTheory',5 ,505 ,201

    UNION SELECT 102 ,'Geography 3',5 ,505 ,201

    UNION SELECT 102 ,'Geometry 1A',5 ,505 ,201

    UNION SELECT 103 ,'Algebra 1B',5 ,505 ,201

    UNION SELECT 103 ,'Cryptology',2 ,505 ,201

    UNION SELECT 103 ,'Constriction',3 ,502 ,201

    UNION SELECT 103 ,'Woodshop',2 ,502 ,201

    UNION SELECT 103 ,'Geometry 1C',3 ,502 ,201

    UNION SELECT 103 ,'Reading 1A',2 ,501 ,201

    UNION SELECT 103 ,'PE-Baseball',1 ,501 ,201

    UNION SELECT 103 ,'Weaving',1 ,501 ,201

    UNION SELECT 103 ,'MusicTheory',1 ,501 ,201

    IF object_id('tempdb..#InstructorSummary') IS NOT NULL

    DROP TABLE #InstructorSummary

    CREATE TABLE #InstructorSummary (

    Id INT PRIMARY KEY IDENTITY (1,1)

    ,InstructorId INT

    ,StudentIdINT

    ,DaysTotal INT

    ,MonthId INT

    )

    -- The old application keeps track of this information by teacher/student within a month as the data changes, so there are always these SUMS.

    INSERT INTO #InstructorSummary (MonthId, InstructorId, StudentId, DaysTotal)

    SELECT MonthId, InstructorId, StudentId, SUM(Days)

    FROM #Assignment

    GROUP BY MonthId, InstructorId, StudentId

    -- During a reporting ELT, old data is removed and we lose the information to link an Instructor to a students work/tutoring.

    UPDATE #Assignment SET InstructorId = NULL[/Code]

    After this point, there is code to check for any students who only have one instructor and those are filled in, then comes a woozy.

    A Cursor that loops for every student.

    Within is another cursor that checks for exact matches (a subject = 5 days, an instructor has 5 days for the same student)

    Next after this cursor is a loop (runs through for every Assignment and every teacher a student has, and another loop that further handles things.

    I understand the concept of a subset sum, and can write one, but in making this example easier (from 10 tables down to 2), I'm still having a problem with the logic. Once I get that right, I'll post it as well. I might just have to start over and rewrite this thing.

    This is the data we started with (basically what it looks like after a few joins and is simplified down) and build the Subset Sum to solve a deaggregation of this aggregated data

  • Jeffrey Williams (5/11/2009)


    Shawn Therrien (5/11/2009)


    I do apologize. I didn't realize that VPN software didn't work with a 64bit OS and was stranded away from my work materials.

    Not sure what VPN software you use, but if you are looking for an IPSec solution on x64 that can replace a Cisco client, you can go to:

    http://www.ncp-e.com/en.html

    I had the same issue with our Cisco client and use this now. Works very well - in fact, I think it works better. However, it is not inexpensive - but not too bad.

    Thanks, I'll look into it.

    My IT dept told me to install Virtual PC on my computer and throw a 32bit OS onto it. :rolleyes: I guess it's a solution to get me remoting into my work PC. I guess a little better than just installing a 32bit OS on my home box and ignore the extra ram.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply