Sort CSV with uneven columns

  • I have an unusual requirement and am hoping somebody here can suggest something to take the pain out of it - please bear with the painful explanation.

    I've been asked to supply a CSV file to represent hierachical data - please don't point out the folly of this as it's only too obvious to me that we should be using XML, but anyway....

    To work around being able to express a hierarchy in CSV we have a "record types" flag - A, B or C. A is for the customer, B and C are items of clothing the customer bought where C is footwear and B is anything else. An A record has three fields - record type, customer id, name. B and C records have 5 fields - record type, customer id, clothing type, colour, price. Currently I have a SSIS package which outputs the record types seperately - all the A records, then B then C, so that if I concatenate the results I get e.g:

    A, 1, Tom

    A, 2, Bill

    B, 1, Shirt, Blue, 19.99

    B, 1, Jeans, Grey, 24.99

    B, 2, Hat, Black, 9.99

    C, 2, Socks, Grey, 3.99

    But the format requested is that all of the same customers records are sorted together so that the CSV output should be:

    A, 1, Tom

    B, 1, Shirt, Blue, 19.99

    B, 1, Jeans, Grey, 24.99

    A, 2, Bill

    B, 2, Hat, Black, 9.99

    C, 2, Socks, Grey, 3.99

    I'm looking for ideas on the simplest way of doing this, bearing in mind that this is very simplified example data and that really I'm dealing with thousands of records. Any help or suggestions greatly appreciated.

  • Simply foolish choice.

    Anyway, you could pre-process the files splitting them into three files depending on the A/B/C flag.

    -- Gianluca Sartori

  • I would opt for using 3 files as suggested, but if this not possible then you could put the results of your current package into a single staging table and then select fromt that ordering by CustomerId, Flag

    What is the purpose of this file? i can not really see this format being useful for any data processing systems..

  • Crazy as it seems apparently this is the format the destination system wants. Let me explain further - at the moment I already have the three files produced by my package, each the result of a separate query. The last step of my package is to literally concatenate the three, but this obviously produces a file ordered according to record type whereas the request is to have records grouped by customer.

    I already thought of using a staging table and maybe there is no better way - of course this nearly doubles the processing time which is also a factor.

    I was just hoping somebody out there might have some inspiring new angle.

  • You might find that using a staging table actually improves the performance of the package since SQL server is more efficent at joining/concatenating large sets of data

  • Taking the idea of a staging table to the next level, how does this work out for you?

    declare @test-2 table (data varchar(100), cust_id int null)

    insert into @test-2 (data)

    select 'A, 1, Tom' UNION ALL

    SELECT 'A, 2, Bill' UNION ALL

    SELECT 'B, 1, Shirt, Blue, 19.99' UNION ALL

    SELECT 'B, 1, Jeans, Grey, 24.99' UNION ALL

    SELECT 'B, 2, Hat, Black, 9.99' UNION ALL

    SELECT 'C, 2, Socks, Grey, 3.99'

    ;with CTE as

    (

    select cust_id, data, pos = CHARINDEX(' ', data)

    from @test-2

    )

    update cte

    set cust_id = SUBSTRING(data, pos+1, charindex(',', data, pos+1) - pos -1)

    select data

    from @test-2

    order by cust_id, data

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Thanks everyone and WayneS in particular for reminding me of that possibility - I may well base my solution around that idea.

  • I've seen this sort of format before. It seems to be common for transferring data to mainframe and midrange based systems that may not use a RDBMS. I've seen this in interfaces to SAP as well, where the record type field is crucial to proper processing.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply