May 10, 2024 at 1:30 pm
First off, my apologies for what could potentially be a bad title!
I am looking for some general information.
I may have a potential project where I need to pull data from a SQL Server, populate a parquet file and then SFTP this parquet file to the end user. Seems simple enough. The caveat here is that some of the SQL server data needs to be encrypted, placed in the file and then sent via SFTP. Once that file is with the consumer, they will of course need to decrypt those data elements to make them usable within their system.
Does anyone have some good info on an overall process to complete this task? I have been googling and reading up on this, but I dont feel like I am reading up on a complete solution to this.
Any information will be greatly appreciated.
Thank you!
May 10, 2024 at 1:43 pm
I suspect that encrypting the entire file after it has been created would be an easier process. Is that a possibility?
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 10, 2024 at 2:04 pm
Phil,
At this point everything is on the table! Any info you can give on this would be appreciated. We would be using ADF for this work as well. It should be worth noting that we will be sending this file via an SFTP server, which could render all of this talk worthless.
Thank you
May 10, 2024 at 2:15 pm
Phil,
At this point everything is on the table! Any info you can give on this would be appreciated. We would be using ADF for this work as well. It should be worth noting that we will be sending this file via an SFTP server, which could render all of this talk worthless.
Thank you
I have not worked with Parquet files, so I cannot help with specifics, sorry.
If you were considering using straight FTP rather than SFTP, and that is the underlying reason for your question, I'd suggest you take FTP out of the equation. Don't even consider it as a possibility.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
May 12, 2024 at 1:52 am
i would suggest two things here. python can grab data into a dataframe, and then export out parquet files.
once the file is created, use GnuPG to create an encrypted version of the file.
then you give the client the public key so they can decrypt the file, and SFTP as expected.
below is some code i created when i was testing python to all sorts of formats, including export to parquet.
###################################################################################################
## Python SQL Server Table to Parquet File using Pandas
###################################################################################################
#####--prerequisites for this script--#############################################################
## # pip install --upgrade pip
## # pip install pandas
## # pip install pyodbc
## # pip install fastparquet
## # pip install openpyxl
## # pip install pyarrow
###################################################################################################
## Get-Data
###################################################################################################
import pyodbc
import pandas as pd
import os
import openpyxl
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
basedirectory = "D:\Bak\Data\GitHub\Workspace\Python\Output"
tablename = '[dbo].[vwServerInventoryPatches]'
query = "SELECT * FROM " + tablename + ";"
server = 'dbaserver.sqlservercentral.com,1433'
database = 'DBAResources'
UseSQLLogin = False
username = 'sql-lowell'
password = 'MyRealPasswordForTheDemo#312'
if (UseSQLLogin):
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE='+database+';UID='+username+';PWD='+ password)
else:
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + database + ';')
cursor = cnxn.cursor()
# select 26 rows from SQL table to insert in dataframe.
df = pd.read_sql(query, cnxn)
## print(df.head(26))
###################################################################################################
## save Data to Parquet
###################################################################################################
parquetfilename = os.path.join(basedirectory,tablename + ".parquet")
if os.path.exists(parquetfilename): os.remove(parquetfilename)
df.to_parquet(parquetfilename, engine='pyarrow')
# df.to_parquet(parquetfilename, engine='fastparquet')
###################################################################################################
## Save Data To excel
###################################################################################################
csvfilename = os.path.join(basedirectory,tablename + ".csv")
if os.path.exists(csvfilename): os.remove(csvfilename)
df.to_csv(csvfilename)
###################################################################################################
## Save Data To excel
###################################################################################################
excelfilename = os.path.join(basedirectory,tablename + ".xlsx")
if os.path.exists(excelfilename): os.remove(excelfilename)
df.to_excel(excelfilename,sheet_name='Results')
##append?
#with pd.ExcelWriter(excelfilename, mode='a') as writer: df.to_excel(writer, sheet_name='Results')
###################################################################################################
## save Data to Flat File
###################################################################################################
tabfilename = os.path.join(basedirectory,tablename + ".tsv")
if os.path.exists(tabfilename): os.remove(tabfilename)
df.to_csv(tabfilename, sep ='\t')
###################################################################################################
## Save Data To csv and then zip
###################################################################################################
zipfilename = os.path.join(basedirectory,tablename + ".zip")
if os.path.exists(zipfilename): os.remove(zipfilename)
compression_opts = dict(method='zip', archive_name=tablename + ".csv")
df.to_csv(zipfilename, index=False, compression=compression_opts)
Lowell
May 13, 2024 at 9:16 pm
Are you looking to encrypt the parquet file or just some data inside? meaning, do you trust everyone that might get the parquet file or only some?
If the former, then use what Lowell posted above. If you want to encrypt some of the data, I'd use some sort of data encryption. This is column level encryption in SQL Server, though you can encrypt some data in a query and pass along the key to users. I think you'd likely use a cert-based asymmetric encryption here. You could also use Python to query and encrypt certain data in a dataframe and not other data. Note, this is then putting binary data in your parquet file. I'm not sure how someone then decrypts that on the other side, but likely also with some app that reads the file and can apply a key and encryption function to the data.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply