The struggle to renormalize data

Question

The struggle to renormalize data

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182916 More actions · Answer 1

xsevensinzx - Friday, September 15, 2017 6:00 PM
Chris Harshman - Friday, September 15, 2017 7:39 AM
Jeff Moden - Thursday, September 14, 2017 5:38 PM
You mentioned what you don't like. What method for large data transfers do you prefer?
I actually prefer pipe symbol delimited text files. I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file. I do prefer the fixed width to comma delimited though.
Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them.
This is why I use Ã¾orn over pipe.

That Ã¾orn can be a thorn in ones side if the text is Icelandic😉
😎

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 2

Eirikur Eiriksson - Saturday, September 16, 2017 1:02 AM
xsevensinzx - Friday, September 15, 2017 6:00 PM
Chris Harshman - Friday, September 15, 2017 7:39 AM
Jeff Moden - Thursday, September 14, 2017 5:38 PM
You mentioned what you don't like. What method for large data transfers do you prefer?
I actually prefer pipe symbol delimited text files. I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file. I do prefer the fixed width to comma delimited though.
Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them.
This is why I use Ã¾orn over pipe.
That Ã¾orn can be a thorn in ones side if the text is Icelandic😉
😎

Yep, thorns rock. I try to use it where I can.

Ed Wagner SSC Guru Points: 287026 More actions · Answer 3

Jeff Moden - Friday, September 15, 2017 11:31 PM
xsevensinzx - Friday, September 15, 2017 6:00 PM
Chris Harshman - Friday, September 15, 2017 7:39 AM
Jeff Moden - Thursday, September 14, 2017 5:38 PM
You mentioned what you don't like. What method for large data transfers do you prefer?
I actually prefer pipe symbol delimited text files. I've almost never seen them used in the value of a column as opposed to commas which are much more frequent, and it makes the file more compact than a fixed width text file. I do prefer the fixed width to comma delimited though.
Depends on the industry. In advertising, Pipes, hyphens and even underscores are commonly imputed by the end user on the front-end, which causes you to still escape them in a pipe delimited export or have the application deny them.
This is why I use Ã¾orn over pipe.
If I had my preference, I'd use CHAR(31) for the delimiter and CHAR(30) for the "end-of-line".

They'd certainly work. Trying to get people to use them would be the hard part. I have one feed that comes to me every month that's both delimited (semicolon) and fixed-width. It's needlessly large because of all the padding, but it still beats the bloated formats hands-down.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

David.Poole SSC Guru Points: 76035 More actions · Answer 4

If you have known columns represented in fixed width or delimited file there is so little processing to do that it is always going to be fast on any platform.

I can remember Buck Woody joking that you could upset a NOSQL fan by observing how like ancient COBOL flat files their brand new format was. I'm probably committing heresy by wishing they really had used such a format. At least awk, grep and bcp could make a decent fist of it.

I've never had any particular problems with XML for less than 2GB data. That's mainly because I don't shred it using SQL. I use XSLT to get it into a format where bcp can be brought to bear.

A format where random attributes can be missing is always going to be problematic. Apache Spark contains the means to generate a JSON schema from a JSON document. Obviously if elements are missing then that schema has to be manually edited, but once you have the schema document you get a guaranteed contract. You can write ANSI SQL2011 against the documents without needing strange constructs in the from clause.
To give a crude performance guideline 2 billion records BCP'd takes <20 minutes. The same thing in JSON in a small Spark cluster will take >24 hours.

LinkedIn Profile

Jeff Moden SSC Guru Points: 1004407 More actions · Answer 5

David.Poole - Monday, September 18, 2017 1:21 AM
If you have known columns represented in fixed width or delimited file there is so little processing to do that it is always going to be fast on any platform.
I can remember Buck Woody joking that you could upset a NOSQL fan by observing how like ancient COBOL flat files their brand new format was. I'm probably committing heresy by wishing they really had used such a format. At least awk, grep and bcp could make a decent fist of it.
I've never had any particular problems with XML for less than 2GB data. That's mainly because I don't shred it using SQL. I use XSLT to get it into a format where bcp can be brought to bear.
A format where random attributes can be missing is always going to be problematic. Apache Spark contains the means to generate a JSON schema from a JSON document. Obviously if elements are missing then that schema has to be manually edited, but once you have the schema document you get a guaranteed contract. You can write ANSI SQL2011 against the documents without needing strange constructs in the from clause.
To give a crude performance guideline 2 billion records BCP'd takes <20 minutes. The same thing in JSON in a small Spark cluster will take >24 hours.

Understood on the "crude" performance comparison. I am curious, though. How big and how many columns of what type of columns were each record in each? A SWAG will satisfy that curiosity. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David.Poole SSC Guru Points: 76035 More actions · Answer 6

@jeff. 5 integers, datetime, 6 varchars, 2 of which were varchar(max) for the bcp version.

Similar for the JSON version but with some additional attributes and pointless nesting. I quite like Spark and used for the right thing it has a lot to offer. I'm just not convinced that something that can be achieved with bcp should be replaced with something that cannot.

LinkedIn Profile

Jeff Moden SSC Guru Points: 1004407 More actions · Answer 7

David.Poole - Monday, September 18, 2017 8:15 AM
I'm just not convinced that something that can be achieved with bcp should be replaced with something that cannot.

I strongly second that!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)