November 8, 2016 at 11:50 am
This has been discussed before but I couldn't find a case/answer that might apply to my situation:
From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.
There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.
I came across an old post on http://stackoverflow.com/questions/30696776/calculate-max-length-of-each-field-in-csv-file
Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).
I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.
Thank you very much.
November 8, 2016 at 12:18 pm
Well if you really don't know the lengths just import everything as varchar(max) then proceed to do some analysis from there.
November 8, 2016 at 12:24 pm
this is the kind of annoying projects that make you reach for your nearest programming language;
SSIS is handicapped when it comes to data lengths being dynamic.
I've got an Generic CSV Importer SSIS package featuring a fancy script task that does this for well formed data;
in my case, I don't even try to find the shape of the csv data.
i have a script task potentially assumes column names are in the first row (or not).
it reads each line in the document as a string array split on delimiters and quoted identifiers i select, and sticks the data into the row , RBAR style, into a DataTable.
after the data table is filled, i can find the max(length) of each column.
with that info, i build the destination table and bulk insert into it.
since the DataTable has lengths available, I've used this to create a table, or even to issue ALTER TABLE ALTER COLUMN commands to make my varchars bigger prior to a bulk insert.
I've used this on the NPPES data file, which is something like at least a gig or two in size(all physician/providers in the US, in a wide, ugly table with 100 columns or so)
Lowell
November 8, 2016 at 12:25 pm
A way to get more accurate results is to use the Advanced tab from the flat file connection.
There's a button that says "Suggest Types...", there you can tell the number of rows to analyze to give an estimate on what your data might look like. It also includes an option to pad string columns so you'll prevent additional truncation for rows not analyzed.
November 8, 2016 at 1:22 pm
Thanks for all the inputs, can anyone recommend a good ETL tool to perform tasks like this?
November 8, 2016 at 1:25 pm
You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.). I don't know how you are bringing in your CSV file data so I'm using a variable for the examples below. Here's how you'd do this using NGrams2B:
DECLARE @sampleCSV varchar(max) = 'xx,yyyy,zzzzzzz,abc';
WITH
points AS
(
SELECT position = 0, token = ',' UNION ALL
SELECT position, token FROM dbo.NGrams2B(@sampleCSV, 1) WHERE token = ','
),
lengths(l) AS
(
SELECT (ISNULL(LEAD(position, 1) OVER (ORDER BY position), LEN(@sampleCSV)+1) - position) - 1
FROM points
)
SELECT MAX(l)
FROM lengths;
I don't know if you have row and column delimiters but, let's pretend your column row delimiter was a char(13)+char(10), you would replace those with your column delimiter (a comma in this example).
DECLARE @sampleCSV varchar(max) = 'xx,yyyy,zzzzzzz,abc'+char(13)+char(10)+'x,x,x,zzz';
WITH
points AS
(
SELECT position = 0, token = ',' UNION ALL
SELECT position, token FROM dbo.NGrams2B(REPLACE(@sampleCSV,char(13)+char(10),','),1) WHERE token = ','
),
lengths(l) AS
(
SELECT (ISNULL(LEAD(position, 1) OVER (ORDER BY position), LEN(REPLACE(@sampleCSV,char(13)+char(10),','))+1) - position) - 1
FROM points
)
SELECT MAX(l)
FROM lengths;
It's worth noting that this is a purely set-based solution and I'll put it's performance up against a CLR or external script any day.:-P
edit: got row and column mixed up. :hehe:
-- Itzik Ben-Gan 2001
November 8, 2016 at 2:41 pm
Alan.B (11/8/2016)
You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.). I don't know how you are bringing in your CSV file data so I'm using a variable for the examples below. Here's how you'd do this using NGrams2B:edit: got row and column mixed up. :hehe:
Thank you very much Alan, I will definitely look into it later.
November 8, 2016 at 11:10 pm
Alan.B (11/8/2016)
You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).
I must be tired and I'm not seeing what you're talking about. Can you be more specific?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 8, 2016 at 11:21 pm
Sometimes I use this ad hoc method which is very simple but works like a charm
😎
First a staging table
IF OBJECT_ID(N'dbo.TBL_FLAT_FILE_IMPORT_STAGING') IS NULL
BEGIN
CREATE TABLE dbo.TBL_FLAT_FILE_IMPORT_STAGING
(
FFIS_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_FLAT_FILE_IMPORT_STAGING_FFIS_ID PRIMARY KEY CLUSTERED
,FFIS_DATA NVARCHAR(MAX) NULL
)
;
END
Next an import of the file, lines not split
TRUNCATE TABLE dbo.TBL_FLAT_FILE_IMPORT_STAGING;
INSERT INTO dbo.TBL_FLAT_FILE_IMPORT_STAGING WITH (TABLOCK) (FFIS_DATA)
EXEC XP_CMDSHELL 'TYPE C:\EE_DATA\Incoming\IMPORT_TEST_FILE.csv';
Then a function to locate the delimiters
CREATE FUNCTION [dbo].[ITVFN_DELIMITER_N_POS_TO_10_FULL_BIN_NULL]
(
@INPUT_STRING NVARCHAR(4000)
,@DELIM NCHAR(1)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH DELIMITER_COUNT(INP_LEN,DLMCNT,INP_STR) AS
(
SELECT
(LEN(NCHAR(2) + @INPUT_STRING COLLATE LATIN1_GENERAL_100_BIN2 + NCHAR(3)) - 2) AS INP_LEN
,LEN(NCHAR(2) + @INPUT_STRING COLLATE LATIN1_GENERAL_100_BIN2 + NCHAR(3)) - LEN(REPLACE(NCHAR(2) + @INPUT_STRING COLLATE LATIN1_GENERAL_100_BIN2 + NCHAR(3),@DELIM,N'')) AS DLMCNT
,@INPUT_STRING AS INP_STR
)
SELECT
DC.DLMCNT
,DC.INP_LEN
,DC.INP_STR
,CASE WHEN DC.DLMCNT > 0 THEN CI1.CI ELSE NULL END AS I1
,CASE WHEN DC.DLMCNT > 1 THEN CI2.CI ELSE NULL END AS I2
,CASE WHEN DC.DLMCNT > 2 THEN CI3.CI ELSE NULL END AS I3
,CASE WHEN DC.DLMCNT > 3 THEN CI4.CI ELSE NULL END AS I4
,CASE WHEN DC.DLMCNT > 4 THEN CI5.CI ELSE NULL END AS I5
,CASE WHEN DC.DLMCNT > 5 THEN CI6.CI ELSE NULL END AS I6
,CASE WHEN DC.DLMCNT > 6 THEN CI7.CI ELSE NULL END AS I7
,CASE WHEN DC.DLMCNT > 7 THEN CI8.CI ELSE NULL END AS I8
,CASE WHEN DC.DLMCNT > 8 THEN CI9.CI ELSE NULL END AS I9
,CASE WHEN DC.DLMCNT > 9 THEN CI10.CI ELSE NULL END AS I10
FROM DELIMITER_COUNT DC
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, 1) ) CI1 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI1.CI + 1) ) CI2 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI2.CI + 1) ) CI3 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI3.CI + 1) ) CI4 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI4.CI + 1) ) CI5 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI5.CI + 1) ) CI6 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI6.CI + 1) ) CI7 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI7.CI + 1) ) CI8 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI8.CI + 1) ) CI9 (CI)
CROSS APPLY (SELECT CHARINDEX(@DELIM COLLATE LATIN1_GENERAL_100_BIN2,DC.INP_STR COLLATE LATIN1_GENERAL_100_BIN2, CI9.CI + 1) ) CI10 (CI)
;
At last, find the widest entries in each column
SELECT
MAX(CP.I1 ) MXL_I1
,MAX(CP.I2 - CP.I1 ) MXL_I2
,MAX(CP.I3 - CP.I2 ) MXL_I3
,MAX(CP.I4 - CP.I3 ) MXL_I4
,MAX(CP.I5 - CP.I4 ) MXL_I5
,MAX(CP.I6 - CP.I5 ) MXL_I6
,MAX(CP.I7 - CP.I6 ) MXL_I7
,MAX(CP.I8 - CP.I7 ) MXL_I8
,MAX(CP.I9 - CP.I8 ) MXL_I9
,MAX(CP.I10 - CP.I9 ) MXL_I10
FROM dbo.TBL_FLAT_FILE_IMPORT_STAGING FFIS
CROSS APPLY dbo.ITVFN_DELIMITER_N_POS_TO_10_FULL_BIN_NULL(FFIS.FFIS_DATA,NCHAR(44)) CP;
November 8, 2016 at 11:22 pm
halifaxdal (11/8/2016)
This has been discussed before but I couldn't find a case/answer that might apply to my situation:From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.
There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.
I came across an old post on http://stackoverflow.com/questions/30696776/calculate-max-length-of-each-field-in-csv-file
Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).
I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.
Thank you very much.
Personally, I wouldn't do this with SSIS.
I'd used a BULK INSERT to read the first line of the file to get a count of the delimiters (number of delimiters + 1 is the number of columns). If it's a header row, you can also derive the column name from that same row by using something like the DelimitedSplit8K function, which would also give you a column count.
Use that information to dynamically create a table (with all VARCHAR(8000) or VARCHAR(MAX) {not my first choice, brw}) and import the data. Again, using the column names previously captured, write some dynamic SQL to calculate the MAX(LEN()) for each column. Then use THAT data along with the column names to create what you need for a final staging table for future use and tweak it manually to change datatypes as a first form of data validation for future imports.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 9, 2016 at 8:16 pm
Jeff Moden (11/8/2016)
Alan.B (11/8/2016)
You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).I must be tired and I'm not seeing what you're talking about. Can you be more specific?
I'm not ignoring your question - I'm sick and dealing with a sick kid :sick: :sick:. I'll have a more specific explanation of what I was talking about tomorrow.
-- Itzik Ben-Gan 2001
November 9, 2016 at 8:36 pm
Alan.B (11/9/2016)
Jeff Moden (11/8/2016)
Alan.B (11/8/2016)
You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).I must be tired and I'm not seeing what you're talking about. Can you be more specific?
I'm not ignoring your question - I'm sick and dealing with a sick kid :sick: :sick:. I'll have a more specific explanation of what I was talking about tomorrow.
Been there and done that. Glad it's not me. Hope you both feel better soon.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 16, 2016 at 9:15 am
Jeff Moden (11/9/2016)
Alan.B (11/9/2016)
Jeff Moden (11/8/2016)
Alan.B (11/8/2016)
You can accomplish this using NGrams2B which can be downloaded from here[/url] at the end of the article.).I must be tired and I'm not seeing what you're talking about. Can you be more specific?
I'm not ignoring your question - I'm sick and dealing with a sick kid :sick: :sick:. I'll have a more specific explanation of what I was talking about tomorrow.
Been there and done that. Glad it's not me. Hope you both feel better soon.
Ok, tough week but I'm back.
For context, what I posted was intended to be a SQL Server only alternative to the link that the OP posted: http://stackoverflow.com/questions/30696776/calculate-max-length-of-each-field-in-csv-file. Looking back, what I posted was not complete and I don't know if it's a great idea - it was just a thought...
Taking a .csv file that looks like this:
1,alan burstein,abc,123123123
2,fred jones,xxx,123123123123
3,sue smith,sssssss,51224123
You could pull it into a variable like this:
CREATE TABLE #test (string varchar(max));
BULK INSERT #test
FROM 'C:\Users\aburstein\Desktop\test.csv'
WITH (FIELDTERMINATOR = '|||', ROWTERMINATOR = '|||'); -- bring in the text "as is"
DECLARE @string varchar(max) = (SELECT string FROM #test);
Then you could determine the max length for the values in each column like this:
-- below is what your variable would look like
DECLARE @string varchar(max);
SET @string =
'1,alan burstein,abc,123123123
2,fred jones,xxx,123123123123
3,sue smith,sssssss,51224123';
SET @string = REPLACE(@string,char(13)+char(10),',');
-- Ngrams
SELECT colNbr, MaxLen = MAX(gap)
FROM
(
SELECT LEAD(position,1) OVER (ORDER BY x.position)-x.position-1, x.position, (rn%4)+1
FROM
(
SELECT 0, 0 UNION ALL
SELECT ROW_NUMBER() OVER (ORDER BY position), position
FROM dbo.NGrams2B(@string,1)
WHERE token = ','
) x(rn, position)
) x (gap, position, colNbr)
GROUP BY colNbr;
alternatively you could do the same thing with delimitedsplit8k like this:
-- below is what your variable would look like
DECLARE @string varchar(max);
SET @string =
'1,alan burstein,abc,123123123
2,fred jones,xxx,123123123123
3,sue smith,sssssss,51224123';
SET @string = REPLACE(@string,char(13)+char(10),',');
-- delimitedsplit8K solution
SELECT
ColNbr = (ItemNumber-1)%4+1,
MaxLen = MAX(LEN(item))
FROM dbo.DelimitedSplit8K(@string,',')
GROUP BY (ItemNumber-1)%4;
Hopefully that makes more sense.
-- Itzik Ben-Gan 2001
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply