The struggle to renormalize data

  • xsevensinzx - Friday, September 15, 2017 6:00 PM

    Chris Harshman - Friday, September 15, 2017 7:39 AM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

    Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them. 

    This is why I use þorn over pipe.

    That þorn can be a thorn in ones side if the text is Icelandic😉
    😎

  • Eirikur Eiriksson - Saturday, September 16, 2017 1:02 AM

    xsevensinzx - Friday, September 15, 2017 6:00 PM

    Chris Harshman - Friday, September 15, 2017 7:39 AM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

    Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them. 

    This is why I use þorn over pipe.

    That þorn can be a thorn in ones side if the text is Icelandic😉
    😎

    Yep, thorns rock. I try to use it where I can.

  • Jeff Moden - Friday, September 15, 2017 11:31 PM

    xsevensinzx - Friday, September 15, 2017 6:00 PM

    Chris Harshman - Friday, September 15, 2017 7:39 AM

    Jeff Moden - Thursday, September 14, 2017 5:38 PM

    You mentioned what you don't like.  What method for large data transfers do you prefer?

    I actually prefer pipe symbol delimited text files.  I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file.  I do prefer the fixed width to comma delimited though.

    Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them. 

    This is why I use þorn over pipe.

    If I had my preference, I'd use CHAR(31) for the delimiter and CHAR(30) for the "end-of-line".

    They'd certainly work. Trying to get people to use them would be the hard part.  I have one feed that comes to me every month that's both delimited (semicolon) and fixed-width.  It's needlessly large because of all the padding, but it still beats the bloated formats hands-down.

  • If you have known columns represented in fixed width or delimited file there is so little processing to do that it is always going to be fast on any platform.

    I can remember Buck Woody joking that you could upset a NOSQL fan by observing how like ancient COBOL flat files their brand new format was.  I'm probably committing heresy by wishing they really had used  such a format.  At least awk, grep and bcp could make a decent fist of it.

    I've never had any particular problems with XML for less than 2GB data.  That's mainly because I don't shred it using SQL.  I use XSLT to get it into a format where bcp can be brought to bear.

    A format where random attributes can be missing is always going to be problematic.  Apache Spark contains the means to generate a JSON schema from a JSON document. Obviously if elements are missing then that schema has to be manually edited, but once you have the schema document you get a guaranteed contract.  You can write ANSI SQL2011 against the documents without needing strange constructs in the from clause.
    To give a crude performance guideline 2 billion records BCP'd takes <20 minutes.  The same thing in JSON in a small Spark cluster will take >24 hours.

  • David.Poole - Monday, September 18, 2017 1:21 AM

    If you have known columns represented in fixed width or delimited file there is so little processing to do that it is always going to be fast on any platform.

    I can remember Buck Woody joking that you could upset a NOSQL fan by observing how like ancient COBOL flat files their brand new format was.  I'm probably committing heresy by wishing they really had used  such a format.  At least awk, grep and bcp could make a decent fist of it.

    I've never had any particular problems with XML for less than 2GB data.  That's mainly because I don't shred it using SQL.  I use XSLT to get it into a format where bcp can be brought to bear.

    A format where random attributes can be missing is always going to be problematic.  Apache Spark contains the means to generate a JSON schema from a JSON document. Obviously if elements are missing then that schema has to be manually edited, but once you have the schema document you get a guaranteed contract.  You can write ANSI SQL2011 against the documents without needing strange constructs in the from clause.
    To give a crude performance guideline 2 billion records BCP'd takes <20 minutes.  The same thing in JSON in a small Spark cluster will take >24 hours.

    Understood on the "crude" performance comparison.  I am curious, though.  How big and how many columns of what type of columns were each record in each?  A SWAG will satisfy that curiosity.  Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff.  5 integers, datetime, 6 varchars, 2 of which were varchar(max) for the bcp version.

    Similar for the JSON version but with some additional attributes and pointless nesting.  I quite like Spark and used for the right thing it has a lot to offer.  I'm just not convinced that something that can be achieved with bcp should be replaced with something that cannot.

  • David.Poole - Monday, September 18, 2017 8:15 AM

    I'm just not convinced that something that can be achieved with bcp should be replaced with something that cannot.

    I strongly second that!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply