Wow that was fast!

  • I am very pleased with myself today.

    I was tasked with reconciling data in a database with a csv file.

    The CSV file is 53MB in size, containing 42471 rows of data.

    I needed to:

  • Do a record count
  • sum up a field containing monetary values
  • sum up another field containing monetary values
  • find the earliet date in the 'ledger date' column
  • find the latest date in the 'ledger date' column
  • output a record count grouped by a char2 column
  • find the minimum sequence number
  • find the maximum sequence number
  • find the sum of te sequence numbers
  • my options were (what sprang to mind at least):

  • bring in file via ssis and process with SQL
  • query file directly using openquery
  • process the csv using a php script and pipe the data to a table
  • write a compiled binary program to do all the counts etc and output them to the command line and use insert/exec xp_cmdshell
  • I went for the compiled binary option because I wanted it to be fast.

    I have written a C++ program which loads the ENTIRE csv file into a single CHAR * array in RAM.

    it then loops through the file processing the data - (I handled the date fields by converting them into an integer julian date using nothing but pure maths!)

    I was a little dubious of the insert/exec thinking that might slow it down a bit, but...

    The entire process, including pulling the results into sql server variables takes 250MS, yes, 250 milliseconds exactly!

    this makes me happy 🙂

    Ben

    ^ Thats me!

    ----------------------------------------
    01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
    ----------------------------------------

  • Well done.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Lordy.... I can't imagine taking the trouble to write a dedicated C++ program just to do all of those specific aggregations on a raw CSV file.

    Don't get me wrong, I love C++/C# and it is fast for doing low level processes quickly and repeatedly. But I'm confident it would take no more than a few minutes to import a CSV file to a table (just using the file import applet in SSMC) and run a few queries on it to produce repeatable and reliable results. I'm certain it would take me far less time to do that than it would take me to write a C++/C# program to do the same thing on the faw file data.

    The execution time might be 250Ms but how many hours did it take to write the program?

    The probability of survival is inversely proportional to the angle of arrival.

  • heh heh about 4 hours I think it took me.

    The idea behind it was the less steps that had to take place, the less could fall over when processing in batch mode.

    In my estimation a well written 15KB exe with a few simple functions is less likely to fall over than some overly complicated ssis package and a bunch of aggregate queries.

    that and it was more fun to do it this way 🙂

    Ben

    ^ Thats me!

    ----------------------------------------
    01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
    ----------------------------------------

  • BenWard (5/11/2011)


    heh heh about 4 hours I think it took me.

    ... that and it was more fun to do it this way 🙂

    Seems like you cost your employer/customer 3+ hours of your productivity, all because it "was more fun to do it this way". Having fun is great. I have fun at work all the time. It's a natural part of doing a good job.

    Does your employer pay you for correct answers, or how much fun you had getting them? If it's the latter, are they hiring?

    [font="Verdana"]Please don't go. The drones need you. They look up to you.[/font]
    Connect to me on LinkedIn

  • lol it's a program that will be run regularly as part of a big process. the fast execution time going forward is definitely worth 3 hours initial development costs. 😉 and unfortunately no, we arent hiring!

    Ben

    ^ Thats me!

    ----------------------------------------
    01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
    ----------------------------------------

  • If the process occurs every day on the same file, same format I'd agree there was a valid argument for doing it in a dedicated process and that the time taken to develop and test it will be amortized out over a long period of time.... given that the file format and/or data types never change.

    The probability of survival is inversely proportional to the angle of arrival.

  • A simple bulk insert into SQL Server and then a SQL query would probably have been my approach but hey sounds like you did in a way that was fun for you.

  • I'm just happy you're happy. :smooooth:

  • Viewing 9 posts - 1 through 8 (of 8 total)

    You must be logged in to reply to this topic. Login to reply