April 19, 2017 at 6:53 am
xsevensinzx - Wednesday, April 19, 2017 6:41 AMYeah, it's pretty easy to create a python script that will traverse the directories on Windows and then read each file within a forloop where you basically clean each file one-by-one. In this case, you could likely load each file into Pandas, which will allow you to load the data into a dataframe (like a virtual table) where you can then apply SQL-like methods to cleaning the data. It's not super fast on larger datasets, but you can easily rearrange data or clean it. Then Pandas allows you to easily export the dataframe in CSV, JSON or whatever format you want.Without Pandas, you could likely get away with using the string methods like MyRow.replace(";") or something to replace all the semicolons from the line as you read the file line-by-line etc.
It's not quite that simple, I think. The requirement would be to change something like this:
"text1";"some data "quoted stuff" more data";"text2"
Into this
|text1|;|some data "quoted stuff" more data|;|text2|
That is, a selective replace. The algorithm would be something like this
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 19, 2017 at 7:01 am
Example of how easy it is to load data into Pandas with Python to analyze a single file.
import pandas as pd
header = ['Date', 'User', 'Type', 'Revenue']
df = pd.read_csv('./Data/Sales2016.csv', names=header, skiprows=100, delimiter=',')
sales_df = df[df['Type'] == 'Conversion']['Type'].to_frame()
sales_df.to_csv('./Data/Clean/Sales2016.csv', header=header, index=False, quotechar='"', quoting=csv.QUOTE_ALL)
So, I am creating a standard list variable with my header info. Then I am creating a new variable called DF to use Pandas.read_csv() method to read a specific CSV file on my computer with the header list I made with a delimiter as comma. Then I create a new variable from DF where I basically do something like a "INSERT INTO SELECT * FROM [ DF ] WHERE Type = 'Conversion'" to create a new table. Lastly, I export the data to a new directory as a CSV file with one line of code where I apply double quotes to every field.
For reading files, it's pretty easy.
import os
import pandas as pd
for root, dirs, filenames in os.walk(indir):
for f in filenames:
df = pd.read_csv('./Data/' + f, names=header, skiprows=100, delimiter=',')
do stuff to your loaded file here...
df.to_csv('./Data/Clean/' + f, header=header, index=False, quotechar='"', quoting=csv.QUOTE_ALL)
print f + " cleaned!"
print "All files cleaned and exported!"
Pretty easy. You can even use a chunksize option in read_csv where you can read data in chunks then even iterate over them with a forloop like -- for chunk in pd.read_csv('./Data/' + f, chunksize=100000) then do some stuff in 100,000 row chunks before you save the data as CSV.
I <3 Python
April 19, 2017 at 7:05 am
xsevensinzx - Wednesday, April 19, 2017 6:41 AMPhil Parkin - Wednesday, April 19, 2017 6:37 AMxsevensinzx - Wednesday, April 19, 2017 6:28 AMJust to follow an alternative method here while you try to dabble with the GUI to solve this problem. How many records/files are you talking about here that have similar issues? Are you talking about billions of records or a couple million or a couple thousand? I mean, not to be that guy, but if you're not really talking about billions of records, why not look into leveraging another piece of technology to clean the data on disk before you try to import it?For example, you can easily (or I could provide) a simple Python script that would go through every file and clean these records for you. Then you can start ingesting them in SSIS or whatever with whatever format you want. But, if you're talking about a lot of data, maybe it wouldn't be the easiest option.
I'd be interested in seeing that script ... is it really that simple?
Yeah, it's pretty easy to create a python script that will traverse the directories on Windows and then read each file within a forloop where you basically clean each file one-by-one. In this case, you could likely load each file into Pandas, which will allow you to load the data into a dataframe (like a virtual table) where you can then apply SQL-like methods to cleaning the data. It's not super fast on larger datasets, but you can easily rearrange data or clean it. Then Pandas allows you to easily export the dataframe in CSV, JSON or whatever format you want.
Without Pandas, you could likely get away with using the string methods like MyRow.replace(";") or something to replace all the semicolons from the line as you read the file line-by-line etc.
Yep, the truncation is not an issue, Phil. The column is already set to 255, so we've got that covered.
It basically boils down to what we feared we had to do, and as a last resort - to change the file.
As the file passes through two delivery systems before it reaches us (first from the origins file delivery, and then it passes through a so-called integration platform in-house), we've reached out to the people responsible for the integration platform and left an inquiry if they are able to filter out the column and remove the double quotes from this spesific file. As the package that runs this file, also runs a numerous of other "agreement"-files (from other companies), which we do not want to alter, we'd rather the fix is previous to our load.
If they cannot do this, we might have to take you up on your offer, xsevensinzx, and find out if this is doable without altering other double quotes within the file; i.e. other text qualifiers.
April 19, 2017 at 7:16 am
Yeah, doesn't hurt to try. Python and the packages above are all free and open source. I work with SQL Server as well and run Python on Windows. So, it's all very compatible. Swing over to:
https://www.continuum.io/downloads
Snag a copy of Python Anaconda for Windows. It's basically a prebuilt package of Python with all the popular modules for data and statistics like Pandas, Numpy, Matplotlib, Seaborn and so forth contained within it's own little Python shell. It will install Python 2.7 for you and add Python to your Windows environment where you can start creating a simple Python script to run. All very safe and I have used it for major ETL operations in production.
April 19, 2017 at 7:28 am
Shatter - Wednesday, April 19, 2017 7:05 AMYep, the truncation is not an issue, Phil. The column is already set to 255, so we've got that covered.It basically boils down to what we feared we had to do, and as a last resort - to change the file.
As the file passes through two delivery systems before it reaches us (first from the origins file delivery, and then it passes through a so-called integration platform in-house), we've reached out to the people responsible for the integration platform and left an inquiry if they are able to filter out the column and remove the double quotes from this spesific file. As the package that runs this file, also runs a numerous of other "agreement"-files (from other companies), which we do not want to alter, we'd rather the fix is previous to our load.
If they cannot do this, we might have to take you up on your offer, xsevensinzx, and find out if this is doable without altering other double quotes within the file; i.e. other text qualifiers.
And if you decide that you'd rather not install extra software but stick to stock SSIS/C#, I am happy to help with the basic design of a script component to help you do the parsing.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 19, 2017 at 10:34 am
On the particular sample of data you supplied it seems that everything field is within double quotes without exception.
That being the case would it not be an easy fix to load all records into a generic table where all fields are defined as varchar(xxx) and then retrieve/process them so that the first and last character of each column is removed?
This can also be done easily on a C# script transform and avoid a intermediary staging table and the extra dataflow.
In the above you would not set the file as having a text qualifier
April 19, 2017 at 10:42 am
frederico_fonseca - Wednesday, April 19, 2017 10:34 AMOn the particular sample of data you supplied it seems that everything field is within double quotes without exception.That being the case would it not be an easy fix to load all records into a generic table where all fields are defined as varchar(xxx) and then retrieve/process them so that the first and last character of each column is removed?
This can also be done easily on a C# script transform and avoid a intermediary staging table and the extra dataflow.
In the above you would not set the file as having a text qualifier
Wouldn't the embedded semicolon throw this out?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
April 19, 2017 at 10:57 am
Phil Parkin - Wednesday, April 19, 2017 10:42 AMfrederico_fonseca - Wednesday, April 19, 2017 10:34 AMOn the particular sample of data you supplied it seems that everything field is within double quotes without exception.That being the case would it not be an easy fix to load all records into a generic table where all fields are defined as varchar(xxx) and then retrieve/process them so that the first and last character of each column is removed?
This can also be done easily on a C# script transform and avoid a intermediary staging table and the extra dataflow.
In the above you would not set the file as having a text qualifier
Wouldn't the embedded semicolon throw this out?
ahh... I knew I was missing something. Yes it would throw it out.
hum..
Read each row as a single column, no delimiters.
strip first and last characters
replace ";" by |
split delimited by |
that should work... until one of the files contains an extra LF within the data.
Viewing 8 posts - 16 through 22 (of 22 total)
You must be logged in to reply to this topic. Login to reply