Working with Vertical Datasets

  • I have a specific problem that can be handled in a number of ways, but I would like feedback from others to determine the best way to handle my situation.

    We get fixed width flat files that are 185 columns wide. The first 2 columns are the key columns, and the other 183 columns are Months. So the header row looks like this: Product Scenario Jan1999 Feb1999 Mar1999 Apr1999 May1999 Jun1999, etc.

    My first approach was to use DTS. I used Bulk Insert Task and ActiveX transformation to normalize the data by first loading it into a working table, and then creating another table keyed by Product, Scenario, and Date. In other words, I flipped the Date from the column axis to the row axis. This turned a 4000 row flat file into a 4000 * 183 row relational table. I need to do this for approximately 32 files each month. For 32 files, it took about 6-8 min for DTS to transform the all the files.

    Because we port a lot of our SQL Server code to Oracle, I hesitate to be dependent on DTS, so I wanted to write a stored procedure to accomplish this instead. The stored procedure process took only 3 minutes to load and transform all 32 files, which performed better than the DTS solution.

    Here comes the trickier part: I need to perform date-specific calculations on this data to create an output table for analysts to look at. The calculations use both the current period and the previous period. I do this in a view by joining the table to itself with a one month date-offset. And because the analysts require the Months to be on the column axis as output, I have to create a view that flips the Data field up to the column axis, like a cross-tab report. I create this view dynamically using a stored procedure because the beginning and ending months will change in the flat files every month. Creating and storing the calculations (to speed up queries), and creating and running the cross-tab view altogether takes 1/2 hour. So the entire process takes approx. 45 minutes each month.

    Ok, now everyone has the background. I think there must be a better way to accomplish this. The reason I think that is because I am doing a lot of flipping of the data during input, only to flip it back on output. Because bulk-insert imports all 32 files in < 1 minute, I think there should be a quick way to do the date-offset calculation without normalizing the data first. That way I could load all 32 tables, run the calc and store the output in another table theoretically in less than 5 minutes by cutting out the normalization. Could I use a table-based UDF in this process to accomplish this?

  • Very interesting situation, and congrats for getting this far! A couple questions: I am assuming that you are not reloading the same data month after month, just because you get an additional month's data for a product scenario, right? Meaning, if you get a flat file for a product that you loaded last month, you only take the data from the file that represents the current month, right? If this isn't true, you might be able to trim some time by doing so.

    --

    Secondly, I think you are right on track with your thoughts to include the date-offset calculation in the loading procedure. Storing the calculated fields (I am assuming some sort of median/delta from previous period??) in a denormalized table structure will represent a hefty performance increase in analysis querying, and eliminate the need for your crosstabbing view, which I would imagine would be a mess (bunch of CASE statements??). I would recommend a solution where you make one table (or a vertically partitioned indexed view) for each of the years (or fiscal years) that you deal with, and in the loading process, add something to your loading stored procedure which builds a dynamic SQL string that populates the relevant table:

    
    
    -- Something like:
    CREATE TABLE ProdScenarioAnalysis1999
    (
    Product CHAR(20) NOT NULL -- Just a guess
    , Scenario CHAR(20) NOT NULL
    , JanRawData MONEY NOT NULL
    , FebRawData MONEY NOT NULL
    -- ... more months
    )

    You could optionally either store the calculated deltas/stats computations as regular or calculated fields...

    HTH,

    Jay

  • The reason I recommend using one table for each year, BTW, is that you ensure a fixed number of columns in the table, which means that in some of the SQL used in reports and adhoc stuff, you don't have to look up the number or names of the columns...

  • Well, I'm not sure if I understand everything said above, but a crosstab must not necessarily contain many case structures. The by far best explanation how to do a crosstab query can be found here http://www.stephenforte.net/owdasblog/PermaLink.aspx?guid=2b0532fc-4318-4ac0-a405-15d6d813eeb8

    You've mentioned 'Analyst' ?

    What about providing the raw data as an Excel file and have an Excel macro that will pivot it and do some nice looking stuff?

    Frank

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Frank,

    nice link. neat technique. Actually, this goes to the last point I was making...the technique shown will only be easily accomplished if the tables have a fixed number of 12 columns for monthly data, otherwise, how would one calculate the offset that the technique is using to calculate sums (I have highlighted the offset):

    
    
    SUM((UnitPrice*Quantity)*(1-ABS(SIGN(DatePart(mm,OrderDate)-1)))) AS Jan,
    SUM((UnitPrice*Quantity)*(1-ABS(SIGN(DatePart(mm,OrderDate)-2)))) AS Feb...

    Cheers,

    Jay

  • Thanks for the replies so far.

    To answer some of your questions: We do use Excel pivot tables extensively. This was one of the reasons I chose to normalize the data. By allowing Excel to pivot the data, I could cut out the last step in my process. Unfortunately, I ran into an artificial memory barrier in Excel XP. It has to read approx. 2 million rows into the cache before display, but failed above 1.6 million rows. After doing some internet research, I discovered that although Microsoft states Excel is only limited by the memory on your machine, it actually manages it's own memory and has limitations (I was only using 50% of the memory on my machine). I tested Excel 2003 beta, and it was successful, but that isn't an option in my company.

    Yes, I am using a lot of Case statements for my T-SQL cross-tab view. I build it dynamically, so I don't have to type all of the code.

    Yes, the data is completely different month-to-month, so I cannot just append a new months data.

    Kory

  • It actually sounds like you are working on a datawarehouse solution. Have you used Analysis Services for this kind of stuff?

  • It may sound like a DW solution, but actually it isn't. It will end up being more transactional in nature. The 180 months are projections into the future based on current market data. The calculations I am performing are financial in nature based on these projections as inputs. Our intention is to precalculate these forecasts, and store them. Then actual business transactions will be matched to an individual forecast based on characteristics. So if we have 10,000 new transactions, the process will match each one with a "similar" forecast projection (out of the 4000) and return a 180 day projection matrix for each transaction.

    For fun, I tried creating a cube in AS to do the calculations, etc. The processing was fast, and because the # of dimensions were very small, I aggregated 100%. Unfortunately, this didn't increase the performance in retrieving the data more than other methods I've tried.

  • Well, it was just a thought...what about the original suggestion of splitting to yearly tables?

  • Hi Kory,

    quote:


    To answer some of your questions: We do use Excel pivot tables extensively. This was one of the reasons I chose to normalize the data. By allowing Excel to pivot the data, I could cut out the last step in my process. Unfortunately, I ran into an artificial memory barrier in Excel XP. It has to read approx. 2 million rows into the cache before display, but failed above 1.6 million rows. After doing some internet research, I discovered that although Microsoft states Excel is only limited by the memory on your machine, it actually manages it's own memory and has limitations (I was only using 50% of the memory on my machine). I tested Excel 2003 beta, and it was successful, but that isn't an option in my company.


    although Excel is bread and butter to my daily work, I never dealt with that much rows in Excel. And I strongly believe that Excel isn't developed to dealt with such a number. After all it's a spreadsheet application.

    But Access has also some nice crosstab features.

    Haven't tried it either with that number of rows, though.

    I guess I would go with Jay's original suggestion.

    Frank

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • quote:


    It may sound like a DW solution, but actually it isn't. It will end up being more transactional in nature. The 180 months are projections into the future based on current market data. The calculations I am performing are financial in nature based on these projections as inputs. Our intention is to precalculate these forecasts, and store them.


    Hm, if you are able to predict precisely where stock markets end this year, let me know. I'll organize the money, and we'll meet in Nassau or on the Cayman's

    Frank

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply