Difficult data transformation problem..

  • Hey guys,

    I have a difficult data transformation issue that is in serious need of an overhaul. The table structure, which I have no control over, is:

    Table: IM_BarCodes

      Item_no char(15),

      Barcode char(20),

      Color_Index char(2),

      Size_Index char(2)

      Table: IM_Quantities [quantities for each item / color / size]

      Item_No char(15),

      Color_Index char(2),

      Size_Index char(2),

      QuantityOnHand int

      Table: IM_Grid [sizes available for each store / item / color / size]

      Item_no char(15),

      Store_No char(15),

      color_1 char(9),

      color_2 char(9),

      color_3 char(9),

      color_4 char(9), (there are actually 40 of these columns)

      Size_1 char(5),

      Size_2 char(5),

      Size_3 char(5),

      Size_4 char(5) (there are actually 40 of these columns)

      The above tables feed into the one below:

      Table: INV_Barcodes

      Store_no char(15),

      item_no char(15),

      Barcode char(20),

      Color_Index char(9),

      Size_Index char(5),

      QuantityOnHand int

      First off, the first two tables (IM_Barcodes and IM_Quantities) makes total sense and is easy to join together. However, due to memory and security constraints, I can only send the relevant data to each store... So, if a particular store doesn't carry the specific color / size combination then they don't get the prices. Which leads to the tie in on the IM_Grid table and my problem.

      I have a stored procedure executing nightly that uses a combination of a cursor and dynamic sql (yes, I know, both are bad) to pull out which stores get the pricing / barcode data for a particular item; however, the process takes about 8 hours to run.

      Does anyone see a better, or even just a different, way?

      Thanks,

      Chris

    1. OK, after re-reading your post I decided to post this.. I'm not sure whether it helps, but let's hope so.

      You didn't mention number of rows in table IM_Grid, but since it is taking such a long time, it will be a huge table probably. Did you ever try to write a procedure that normalizes data from this table, and inserts the result into a "normal" table"? You could run this procedure every night before the one that prepares pricing, and then you could rewrite your pricing procedure to use JOIN with the new table. I am not sure whether this will improve the time considerably or not, but it could be worth trying.

      Of course the best way would be to replace the horrible Grid table with something more useful... maybe there could be a way to put pressure on those who CAN change the tables? I know it is hard to achieve such things, but you might just get lucky :-).

    2. By the way, I'm not sure I understand everything.. in the table IM_Grid, color is CHAR(9) and size is CHAR(5). In the first two tables, you have color_index and size_index, both with CHAR(2)... and in your output, the columns have again 5 and 9 respectively. Are there "hidden" tables for Color and Size, which translate the respective index into a value?

    3. Step 1, I think, (normally not so adamant) would be to make sure that whoever designed this mess needs to be identified, captured, disemboweled in public, and have his/her pulsating remains shot butt first out of a cannon into a stone wall.

      There… I feel much better now…

      This is a classic problem of taking someone else’s denormalized garbage and normalizing it. Obviously, I’ve not tested it, but the following code should be very close to what you need and it should be light years faster than the 8 hour cursor method that you identified because of the "divide and conquer" methods used. You are correct… Cursors pretty much stink up the house especially if someone tried to join these three tables to create the fourth using, what I suspect is, nested cursors or at least a cursor nested in a WHILE loop. Dynamic SQL isn’t nearly as bad if it is properly formed and contains few, if any, joins.

      Here’s the code… lemme know how it works… read the comments… this pit-wuffie drops a table...

      --===== If the build table exists, drop it
           IF OBJECT_ID('dbo.INV_Barcodes_Work') IS NOT NULL
              DROP TABLE dbo.INV_Barcodes_Work
      --===== Create the build table schema
       CREATE TABLE dbo.INV_Barcodes_Work
              Store_No       CHAR(15),
              Item_No        CHAR(15),
              Barcode        CHAR(20),
              Color_Index    CHAR(9),
              Size_Index     CHAR(5),
              QuantityOnHand INT
      --===== Declare the local variables
      DECLARE @SQL VARCHAR(8000)
      DECLARE @Counter = INT
      DECLARE @strCounter VARCHAR(2)
      --===== Build the basic data by normalizing the data in the IM_Grid table
          SET @Counter = 1
        WHILE @Counter <= 40
        BEGIN
                  SET @strCounter = CAST(@Counter AS VARCHAR(2))
                  SET @SQL = ' INSERT INTO dbo.INV_Barcodes_Work (Store_No,Item_No,Color_Index,Size_Index)' + CHAR(13)
                  SET @SQL = @SQL + 'SELECT Store_No,Item_No,Color_'+@strCounter+',Size_'+@strCounter+' FROM dbo.IM_Grid WITH (NOLOCK)'+CHAR(13)
                  SET @SQL = @SQL + 'WHERE Color_'+@strCounter+' IS NOT NULL AND Size_'+@strCounter+' IS NOT NULL'
                 EXEC @SQL
                  SET @Counter = @Counter + 1
          END
      --===== Update the Barcode column in the build table
       UPDATE w
          SET Barcode = bc.Barcode
         FROM dbo.INV_Barcodes_Work w, --Cannot NOLOCK object of update
              dbo.IM_BarCodes bc WITH (NOLOCK)
        WHERE w.Item_No     = bc.Item_No
          AND w.Color_Index = bc.Color_Index,
          AND w.Size_Index  = bc.SizeIndex
      --===== Update the QuantityOnHand column
       UPDATE w
          SET QuantityOnHand = q.QuantityOnHand
         FROM dbo.INV_Barcodes_Work w, --Cannot NOLOCK object of update
              dbo.IM_Quantities q WITH (NOLOCK)
        WHERE w.Item_No     = q.Item_No
          AND w.Color_Index = q.Color_Index,
          AND w.Size_Index  = q.SizeIndex
      --====== Cascade rename the tables.... (takes about 65 milliseconds)
      TRUNCATE TABLE dbo.INV_Barcodes_Work --Saves log space because is not logged
          DROP TABLE dbo.INV_Barcodes_Work
          EXEC dbo.sp_Rename 'INV_Barcodes_Work','INV_Barcodes'
      --====== Add any indexes you want here but at least add a primary key...
      

       {Edit} By the way, the result table is actually denormalized because it contains data from the first 2 tables.  But, like someone said, "Normalize 'till it hurts... denormalized 'till it works".  Since the table is rebuilt every night and you only use 1 column from each of the tables, it's "useful" denormalization that should give you some pretty good performance for whatever the 4th table is used for.

      --Jeff Moden


      RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
      First step towards the paradigm shift of writing Set Based code:
      ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

      Change is inevitable... Change for the better is not.


      Helpful Links:
      How to post code problems
      How to Post Performance Problems
      Create a Tally Function (fnTally)

    4. Hey Jeff,

      I think you're method is pretty close to what I need. The only difference I see off the top of my head is that the WHILE loop really needs to have another WHILE loop in it to capture (color_1, size_1), (color_1,size_2), etc. And it looks like you drop the work table prior to renaming it, I'm guessing that was just a typo.

      Believe me, I tried step 1. However, the IM_ tables are copied from a popular Point of Sale system. It's backend database is actually Pervasive SQL 7, which is really just a front for BTrieve... I tried to convince my client that they were probably better off switching POS vendors, but that didn't work.

    5. Yep... copy and paste error... the cascade rename should look like...

      --====== Cascade rename the tables.... (takes about 65 milliseconds)
      TRUNCATE TABLE dbo.INV_Barcodes --Saves log space because is not logged
          DROP TABLE dbo.INV_Barcodes
          EXEC dbo.sp_Rename 'INV_Barcodes_Work','INV_Barcodes'

      It didn't make sense to me to do a Nested WHILE loop because you said there were 40 colors and 40 sizes...  I surmized that Color_1 went with Size_1... it just didn't make sense to have 1600 color and size combinations per item and THAT might be part of why it takes your cursor so long... are you sure you need to cross-join colors and sizes like that?

      --Jeff Moden


      RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
      First step towards the paradigm shift of writing Set Based code:
      ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

      Change is inevitable... Change for the better is not.


      Helpful Links:
      How to post code problems
      How to Post Performance Problems
      Create a Tally Function (fnTally)

    6. Just out of curiousity, how many rows are you dealing with?

      Sven

    7. Better yet......

      1) How many rows are you dealing with

      2) What indexes, if any, exist on the source data

      3) Are you able to recommend the creation of any indexes on the source data

      4) Are you able to dictate the format and design of the destination table(s)

      Sounds like an interesting problem to resolve. 

      Sven

       

    Viewing 8 posts - 1 through 7 (of 7 total)

    You must be logged in to reply to this topic. Login to reply