Import data in parquet format

Question

Post reply

Import data in parquet format

Aaron N. Cutshall

SSCrazy Eights

Points: 9018
More actions
December 5, 2016 at 8:09 am

#316401

I have a situation where I need to import data from a Hadoop file that is in parquet format. Preferably, I'd like to do this from within a stored procedure that can truncate the staging table, import the data, then process the data as needed. I know that polybase features are in 2016, but our server will be running 2014. Any suggestions?
LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 1

Hi Aaron,

I have no idea what the "parquet" format is. I also don't know what the source is. Would it be from a text file?

If it's from a text file and you can explain what the "parquet" format is and maybe even attach such a file (no proprietary or PII info, please), I could take a whack at it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 2

Heh... I just looked up "Parquet Format" online. You would probably be better off writing a magic decoder ring for this in Java to expand the data into a CSV file and import that with SQL.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Aaron N. Cutshall SSCrazy Eights Points: 9018 More actions · Answer 3

That's the thing I like about you Jeff. You're always full of optimism and hope! 😛

However, I've not given up hope yet. Who know, since support for XML is here and JSON nearly so, who knows? I've also considered a Linked Server approach although I generally don't favor those due to performance issues.

LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 4

Just trying to use the right tool for the right thing. As with most things, shredding the parquet format in SQL Server could be done but, like using even built in features for XML and Jason, SQL Server probably isn't the right place to do it.

Can't Hadoop do the data expansion into a nice neat high performance TAB delimited file? I'd be disappointed if it couldn't.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Aaron N. Cutshall SSCrazy Eights Points: 9018 More actions · Answer 5

The team did identify a method of pushing the data to SQL Server but this approach requires coordinating the stored procedure execution on SQL Server to occur AFTER the data push. I was hoping to avoid such a scheduling/dependency nightmare by having it all coordinated within the SQL Server SP, but I may have to take a different approach. Of course, if this were 2016, we'd have Polybase at our disposal...

LinkedIn: https://www.linkedin.com/in/sqlrv
Website: https://www.sqlrv.com

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 6

Aaron N. Cutshall (12/5/2016)
The team did identify a method of pushing the data to SQL Server but this approach requires coordinating the stored procedure execution on SQL Server to occur AFTER the data push. I was hoping to avoid such a scheduling/dependency nightmare by having it all coordinated within the SQL Server SP, but I may have to take a different approach. Of course, if this were 2016, we'd have Polybase at our disposal...

Why wouldn't a trigger do it for you? For that matter, why couldn't Hadoop call a batch file that fires off SQLCMD?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)