Eliminating Cursors

  • Yes, all true.  Dodgy data is dodgy data, and it would be brilliant if SQL Server 2008 had a "SET DODGY_DATA OFF" switch.

    However, you can only cleanse bad stuff you know about.  In this situation (and the other) it's user data that's being moved "real time", every few minutes or so.  That's why we can't allow 1 record to hold up several thousand.  And if an error does occur it sends an email, so each minute (or whatever) 1 error spams everyone on the support list.  (Not ideal, but we're workin' on it. )

    But my question has been answered - there is no way to pick out erroring rows with set logic, and we need to RBAR to find them.

    Simon.

    "Danger, danger!  Cursors, Will Robinson!"

  • Here's the thing...

    You say the data is "dodgy", and errors out the process at times.  I assume by "dodgy" you mean that the data is probably violating some constraints on the target table, the data is being truncated, or there are type casting errors (that's just a WAG on my part though).

    Dealing with these types of errors in a set-based fashion is a fairly well-defined problem, as is validating user-supplied data and pulling dodgy data out of a table using set-based logic.  Without knowing more about what makes your data "dodgy" and what actual errors you encounter, it's difficult to give a specific fix.  There is something wrong about the rows that error out -- SQL doesn't use a random number generator to pick and choose which rows of your source data to insert or not.

    You are right, though, there is no set-based solution to trap rows that throw an error, *while they are throwing an error*, on an individual basis.  It's entirely possible (and preferred) to preemptively validate data and make sure it meets the "non-dodgy" criteria as opposed to trying to blindly slam potentially dodgy data into a table.  That applies whether you're using a set-based solution or those Wünderful cürsors.

  • I mean how are is it to prevalidate the data?  You know the datatype and you know the lengh (or range of value) of the target table.

    Len(ColName) will validate the length of all text fields

    NOT LIKE '%[^0-9]%' WILL Validate that all characters are numbers (tinyint, smallint, int, bigint)

    NOT LIKE '%[^0-9]%' can also be modified to validate numeric, decimal, [small]money and float datatypes.

    if that test pass then you can check for the datatype range : IE tinyint 0-255 etc.

    You know the foreign keys, you just need to check for existence of the master record with exists or with a join.

    Then you can check for all columns that do not allow nulls and don't have defaults so that every row have a value set.

    Then you only have to revalidate the check constraints and triggers (copy paste).

     

    It's a lot of work, but once it's done then that puppy can run unattended for years (as long as you update the check list when the main table is modified).  Please note that if you can use the same columns then you have a lot of checks to skips.  All that's left are the triggers, check constraints and foreign keys.

  • May I weigh in here, and forgive my inexperience, but I keep running into this one.

    Seems weird to me that one would have all these criteria for valid data stored in rules in the database and yet there is no set-based way to exercise them.

    Way back when in mainframe days I remember some sort of translate function that would take all your valid data and transform it into another table, but it would also create an errors table loaded with the invalid rows and add a column specifying the complaint.

    Even Microsoft Access will load data on a set basis and reject invalid rows (but unfortunately does not place the invalid rows in another table).

    So why is it that SQL Server has no set-based function to do an INSERT and place invalid rows in another place for investigation? Instead we have to go back to row-by-row investigation at a programming level. Duh.

  • And why can't you?  Triggers are a well-defined and pretty flexible solution.  Here's a very simple example that constrains the value of an INT column on a table to values between 1 and 100, inclusive; and a CHAR column to strings that begin with "CAT".  All rows that meet the criteria are inserted into the Good table.  Values outside of this range are inserted into the Dodgy table.  The trigger can easily be extended to capture a wide range of potentially "dodgy" data and automatically revert that dodgy data to the Dodgy table.  If you wanted to get *really* fancy, you could even place an indicator in a column of the Dodgy table to specify which columns are causing the heartache and grief.  This example adds a "reason" indicator column to the Dodgy table to tell you which column caused the problem.  And if you're on SQL 2K5, you can even use the ENABLE TRIGGER Dodgy_Data to simulate a "SET Dodgy_Data ON" statement.

    Of course client side pre-validation works better for me, especially since a lot of the data I import are strings and many client-side programming languages are much more efficient than T-SQL at string manipulation.

    CREATE TABLE Good (i INT, c CHAR(30))

    CREATE TABLE Dodgy (i INT, c CHAR(30), reason VARCHAR(100))

    GO

    CREATE TRIGGER Dodgy_Data

    ON Good

    INSTEAD OF INSERT

    AS

    BEGIN

        INSERT INTO Good (i, c)

        SELECT i, c

        FROM INSERTED

        WHERE i >= 1 AND i <= 100

        AND c LIKE 'CAT%'

        INSERT INTO Dodgy (i, c, reason)

        SELECT i, c, CASE WHEN i < 1 OR i > 100 THEN 'number out of range in i'

            WHEN c NOT LIKE 'CAT%' OR c IS NULL THEN 'c is the problem'

            END

        FROM INSERTED

        WHERE NOT ((i >= 1 AND i <= 100)

        AND c LIKE 'CAT%')

    END

    GO

    INSERT INTO Good (i, c)

    SELECT -100, 'Catwoman'

    UNION SELECT -10, 'Cat burglar'

    UNION SELECT 0, 'Cat food'

    UNION SELECT 1, 'Kitty Litter'

    UNION SELECT 10, 'Catch me'

    UNION SELECT 50, 'Catheter'

    UNION SELECT 100, 'Whoops'

    UNION SELECT 1000, 'Fish'

    GO

    SELECT * FROM Good

    SELECT * FROM Dodgy

    GO

  • I like that idea, because it gets us cleaning the data on a set basis.

    But, not to be to persistent I hope, but my point is that in a database where you have all the constraints defined so that the db will only hold 'clean' data.. e.g. data type, relationships, unique key on some column(s), anythng else you can define as a constraint, you still have to check all the data yourself either by coding a trigger as you describe or in some client end thing or in some row by row iteration.

    Since the database already knows what is valid, and won't accept invalid data, why not just have it tells us what it's kicking out and why?

    Then we wouldn't have to write anything but something to deal with the data that gets kicked out.

    I'm a lazy developer I guess.

     

     

  • I prefer to do the client-side scrubbing and add the check constraints just as a safety net (mostly for developers who like to try to shove crap data into tables via QA).  You can use insert and update triggers in place of check constraints to do what you are asking.  Then you don't need the actual check constraints on the columns, since the Triggers effectively replace them as more flexible/powerful check constraints.  In fact, this is a common usage of triggers - to fill in the gaps left by other types of constraints (like cross-database referential integrity, complex constraining criteria that might incorporate some business logic, etc.).

    Alternatively it might be possible to use the existing check constraints in a trigger with dynamic SQL to perform these checks (via INFORMATION_SCHEMA views or system tables/views) and kick out or keep the results, although I haven't tried it and I imagine performance would suffer.

    I understand your point, and triggers aside, it might be closer to possible with some of the new features of SQL 2K5.  The OUTPUT clause comes first to mind, although I don't have 2K5 here at work to run any tests with...

  • SQL does provide some assistance as to what is erroring out, and why.

    Given the topic, I probably already wrote something about this routine some time ago. However, this is a slightly different take, so....

    I had to load huge amounts of data. The rows I was given might include both updates to existing data, and new records. More to the point, I might receive both a row to insert, and an update to that row, in the same batch.

    I used SET ROWCOUNT and a while loop to handle the situation. I'd select a large set of rows (say 10000) that didn't already exist (per the primary key)., and try to insert them. As long as I didn't have both the first and second occurrence of the row in my data, this would work fine; however, sometimes that situation would happen. I then deleted (from the work table) the rows I had successfully inserted (into the permanent table), and ran through again.

    If the insert failed, I'd catch the error (simply by pulling the error info from the relevant @@ variables), cut the rowcount in half, and try again. Repeat as needed. Once I was successful, I'd start bumping the rowcount back up to its maximum value by doubling it.

    It may not have been the world's most elegant solution, but it did work.

    Its relevance to your situation is that you could do much the same thing; It's possible to catch and handle an error even before 2K5. How well this method would work depends on the amount of data you have to insert, the time limits you have to work with, and the number of bad rows you have.


    R David Francis

  • That's a novel solution   Way to think outside the box!

    The only issue I would raise with this method is that the efficiency would seem to have an inverse relationship to the amount of bad data you start with.  Large amounts of bad data could make this a very inefficient process.  A load that takes 15 minutes with all good data could take significantly longer with large quantities of bad data.  With a method like using triggers, you will still take some CPU time to figure out what's bad and what's not, but the efficiency should remain relatively flat; so whether you have 5% bad data or 95% bad data wouldn't affect performance.

    In fact, you have me very curious now -- I think I'll run some tests on this when I get home   I'll post the results here later for those interested.

  • There's a raft of reasons why the process tries to insert dodgy data.  Not to bore your socks off, but I'll try and cover them briefly and without justice:

    We have completely different systems that are not necessarily SQL Server; there are about 10 databases (currently) that are involved with numerous non-identical tables in each; everything is continually being altered by developers; and the systems have a squat-load of users that complain about performance at the best of times.

    I did consider the "reducing rowcount" method (am still considering) and two issues stall me: i) the process is "real time" so we should avoid delays; ii) we want to avoid spam error emails.  If you have 10000 rows, and row 1 is dodgy....

    So yes, I do prefer the data pre-cleansed by the system concerned, but where it's dodgy I can't have it cause issues, so I need a "catch all" situation.  This meant either the if-set-failed-then-loop method or the reducing-rowcount method (which is really just another loop. )

    Simon.

  • Sorry in advance for the long post:

    Question - who's in charge of all these disparate systems?  I.e., do you have different DBA's and developers who work on DBMS A and another team that works with DBMS B?  Or is it just one group of DBA's/developers for all?

    My first thought is that you should probably decide on a standard data file format for all systems.  Maybe something as simple as a standard flat file that all systems need to create, with all the data required by your "aggregating" (for lack of a better word) SQL Server DB.  Could even be an XML file if you're so inclined, but the point being that all systems would need to export data in the same format every time.  (I know, sounds good in theory, but getting everyone on the same page...  ugh...)  Or create a set of Views that reflect a common schema, regardless of the source, if you're using linked servers or something.

    I do know what you're going through (honest!)  Part of my current job is writing import and export programs for SQL Server, and I support about 4,000+ SQL Servers spread around the country.  There are about a gazillion different import and export applications in place, most of them just slight variations of one another.  The database structures, input file formats, and other requirements are constantly changing.

    The input source files and the target database schemas are very "fluid" (nice word I use to keep from cursing).  So my solution was to create a basic set of "generic" applications that use a set of XML "mapping files" to map flat file fileds to corresponding database tables and columns.  It also uses the Bulk Import APIs to import data to SQL Server as quickly as possible.  In those regards it's very much like bcp or any of the other bulk load tools that come with SQL Server.  One generic app easily eliminates about 1/2 of those gazillion apps in place.

    In fact, the only reason I created these custom apps instead of using bcp was that I wanted to add specific validations on the client side before committing the data to the database; that and I have to deal with files in an old-style Header + Sub-Header + Detail Line + Sub-Footer + Footer format.  The validations are defined in the XML file, and are specified by our business rules (so they're not hard-coded into scripts, triggers, or the client-side apps).

    Some of the validations are things like EAN/UPC/ISBN validation including check digit computation; date/time validations that go deeper than standard format validations, like a new overdue date cannot be dated "in the past", etc.

    Because the apps are written in client-side languages that have more efficient string manipulation routines, I get very good performance and have a huge amount of control over the error handling and reporting.  And a change in the database and/or flat file structure is just a simple edit to an XML file.  I still have check constraints on the tables to prevent other developers from inserting bad data via QA, but I find that the client-side validations are just much more efficient, more easily extensible, more scalable, and easier to modify than check constraints, triggers, or even cursors

  • Mike C - sounds like you have a lot more scope to work with than I do, but then sounds like you task is considerably larger than mine. 

    For your questions, we have a DBA team to look after server instances, backups and other admin tasks; while we have a bunch of teams with dedicated SQL Server guys (like me) to do the development.  There is no real clear demarcation between *this* bit and *that* bit for the purposes of authority.

    Otherwise, I did get a prototype reducing-rowcount method worked out, though in terms of efficiency and code-readability there was not much to gain over the RBAR method.  After all, for each failed attempt it's a pass through the table to select N rows, then N/2, etc, then back up via N*2 again once the dodgy data is found.  For the row counts I'm expecting the gain would be negiligble (say, 20-50 per 5 mins).  Mind you, a row count of 10,000 might offer a better situation for this method.

    So, yes - at the end of the day there is a set-method for pulling dodgy needles out of the data haystack.  Just need to optimise it when time permits.

    Simon.

    P.S.  Have a Happy Xmas Guys!

  • I started some testing of my own, but haven't been able to devote enough time to get any solid results yet.  Based on what I have so far, the "reducing-rowcount" method appears to be slightly better than a cursor with low percentages of bad data, but worse with high percentages   Of course there are a lot of other factors that I haven't considered yet, and I haven't optimized any of the tests yet, so my final results could be much different. I also need to generate a much larger set of test data and I would like to test with much larger rows than I am now.

    Merry X-Mas, Happy Hannukah, and Cool Kwanzaa!

  • Have been using the same method for quite a few years now as very rarely use cursors, but then maybe thats why he is new to SQL Server.

  • How much of the speed up was due to the shift to a table variable and how much due to removing cursors? It would be interesting if you would retry the experiment but generate the table variable and then do the cursor from that.

    (The query plans and everything else is thus wildly different between these two examples. For all I know, the 'cursor code' locks a lot more tables than the non-cursor, table-variable using code.)

Viewing 15 posts - 106 through 120 (of 296 total)

You must be logged in to reply to this topic. Login to reply