Blog Post

Writing Parquet Files – #SQLNewBlogger

,

Recently I’ve been looking at archiving some data at SQL Saturday, possibly querying it, and perhaps building a data warehouse of sorts. The modern view of data warehousing seems to be built on using a Lakehouse architecture where data moves through different phases, but much of the data is stored in text files, often parquet files.

As a start to this I decided to try and move data to parquet. This post looks at writing parquet files.

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

Writing Parquet Files

In a previous post I looked at reading in JSON data, which is how some of my data is archived. I also talked about importing modules. There is a module, called pyarrow, that allows me to work with various parts of Apache Arrow.

One of the submodules in pyarrow is the parquet module, which lets me read and write parquet files. So, let’s get those modules.

import pyarrow as pa
import pyarrow.parquet as pq

I am giving these show names so I can refer to them in code. Now, let’s skip the code from the previous article and assume I’ve got a dataframe with my sessions in it. How do I get a parquet file?

Fortunately, I don’t need to know anything about the physical structure, as I can use the write_table() function from the parquet module to do that. I’ll also use the pyarrow.Table.from_pandas() function to get data from the dataframe into this module. This code does that (with some setup for a filename).

    outputFilename = f + '.parquet'
    outputFile = join(outPath, outputFilename)
    pqtable = pa.Table.from_pandas(df)
# Write Arrow Table to Parquet file
    pq.write_table(pqtable, outputFile)

Note: I don’t know the technical differences between how pandas dataframes and the pyarrow tables work. I found a few notes online and it looks like pyarrow tables can handle more complex data structures.

Once this code is added to the code from the previous article (it’s already indented), this will write .parquet files to the bronze folder underneath the location from where it is run. In essence, this takes data from the raw folder and writes it to bronze in a new format.

Summary

This post shows how to write parquet files out from JSON data. Take the previous article and this one and you can move data from JSON to parquet.

This code isn’t perfect. In fact, it needs work. I am only moving session data, so only a portion of the JSON data. This code should be enhanced, or the file names changed to reflect that, but for now, this is a quick example of producing parquet data.

SQL New Blogger

This post took about 10 minutes to write once I had the code working. In fact, adding these functions to the code from the last article only took a few minutes. I had to debug a few things to get the files into the correct folder, but it took longer to get these words down than get code working.

Not a lot longer, but longer.

You can do this. If you want to work in modern technologies, learn them. Learn how to work with parquet, which is being used a lot in data warehousing, and then write about it. Prove you can get things done and your current employer, or your next one, might give you a project to actually do this work.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating