February 9, 2012 at 6:57 am
Hi,
I need help on data level error handling.
I have csv file with Data like
First row is column header.
--------------- CSV file Start
Name,Number,Cratdate
Ram,100,02/09/2012
Smith,101,01/12/2012
Abdul,102,A
Adam,103,05/30/2011
--------------CSV file end
Here we can observe that there is data level issues in the Row with name Abdul, i.e., cratdate is not convertible.
So as per my requirement I need to avoid the row which have issue and need to import remaining Data to main table. And the row which have issue should be taken care, i.e., May be imported to the another table or something..
I need help on this what ever we used may be Query or SSIS.
What I have only thing in my mind is using loop and ignoring that row.
But it doesn't work if the data is huge..
Please Help
Ram..
🙂
February 9, 2012 at 7:20 am
Personally I'd set up an SSIS package for this.
Import the csv file into a staging table that contains the relevent datatypes and non NULLable columns and do any transformations and data cleansing in this table and use the destination error output path to redirect any errored rows (i.e. a Cratdate that is not a date format) to an error table that is structured to accept bad data (or you could use a flat file destination may be easier).
In order to achieve the above you will not be able to use the Fast Load (Bulk Insert) into the staging table it will have to be done row by row.
Then copy the cleansed data into your main table.
I wouldn't choose to just ignore the rows that contain bad data, seems a little risky to me
==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe
February 9, 2012 at 7:29 am
Few more options:
BULK INSERT (Transact-SQL)
http://msdn.microsoft.com/en-us/library/ms188365.aspx
bcp Utility
February 9, 2012 at 7:30 am
I am trying to workout using SSIS.. Let me get back to you if it works to me..
🙂
February 9, 2012 at 7:33 am
Like this:
select * from OpenRowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=C:\External;','select top 5 * from
Mytext.txt')
Or try this:
OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Text;Database=C:\External',
'select * from TRANS.csv')
select top 50 * FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0','Text;Database=c:\temp\captell;HDR=Yes;FORMAT=Delimited(;)', 'SELECT * FROM [file.TXT]')
____________________________________________________________________________
Rafo*
February 9, 2012 at 7:44 am
That wouldn't handle the bad data tho :ermm:
==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe
February 9, 2012 at 7:47 am
but he can insert the csv file into a table variable and then manipulate the data
____________________________________________________________________________
Rafo*
February 9, 2012 at 7:48 am
If you want to do it in the hand crafted T-SQL, you can do the following:
Import all data into the staging table with reasonbaly maximum unrestricted varchar datatype for all columns. Then run validation query and import only valid rows into final destination table. Small setup for example:
CREATE TABLE StagingData
( ID INT IDENTITY(1,1)
,Name VARCHAR(255)
,Number VARCHAR(255)
,Cratdate VARCHAR(255)
)
CREATE TABLE StagingDataError (ID INT, Error VARCHAR(100))
-- Import data in bulk (eg. SSIS)
INSERT StagingData SELECT 'Ram','100','02/09/2012'
INSERT StagingData SELECT 'Smith','101','01/12/2012'
INSERT StagingData SELECT 'Abdul','102','A'
INSERT StagingData SELECT 'Adam','103','05/30/2011'
INSERT StagingData SELECT 'To-Long-Name','103','05/30/2011'
INSERT StagingData SELECT 'Bob','A','05/30/2011'
-- Example of Validation:
-- validate name length
insert StagingDataError
select ID, 'Invalid Name Length'
from StagingData
where len(Name) > 10
-- validate number
insert StagingDataError
select ID, 'Invalid Number'
from StagingData
where ISNUMERIC(Number) = 0 -- please note: this is not complete valildation for number! just example
-- validate date
insert StagingDataError
select ID, 'Invalid Cratdate'
from StagingData
where ISDATE(Cratdate) = 0 -- please note: this is not complete valildation for date! just example
-- Load valid records into final destination
--insert into FINALDESTINATIONTABLE
select *
from StagingData AS sd
left join StagingDataError AS e
on e.ID = sd.ID
where e.ID is null
-- show errors
select * from StagingDataError
February 9, 2012 at 7:52 am
Eugene Elutin,
You solution seems works for me..
Still needs work on huge data and get back the status
🙂
February 9, 2012 at 7:52 am
@rafo Thats true I hadn't thought of throwing it into a table variable - I suppose it depends on the volume of data how efficient that solution would be but could work 🙂
==========================================================================================================================
A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe
February 9, 2012 at 8:01 am
You shouldn’t use table variables for huge data... temporary tables or actual tables are better.
February 9, 2012 at 8:06 am
Hmm, can u prove it?, i mean,
do you have an example and probe it what you say?
____________________________________________________________________________
Rafo*
February 9, 2012 at 8:13 am
You can optimise the solution even further. Few tips:
1. Create Error table on fly using SELECT INTO and UNION ALL
2. You may find that before loading data to the final table, you need to pre-load valid data into temp (#) table (again using SELECT INTO).
3. Use CLR functions for validation if needed instead of user-defined ones
4. Yes, use #-tables not table variables! You may need to create additonal indexes on them...
February 9, 2012 at 8:19 am
First example from Google Search...
Temporary Tables vs. Table Variables and Their Effect on SQL Server Performance
http://www.codeproject.com/Articles/18972/Temporary-Tables-vs-Table-Variables-and-Their-Effe
February 9, 2012 at 9:35 am
xRafo (2/9/2012)
Hmm, can u prove it?, i mean,do you have an example and probe it what you say?
Table variables have no statistics. That alone should be sufficient to make the point.
-- Gianluca Sartori
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply