Basics to loading data

  • I'm trying to get better at loading data, flat files, Excel files from Apknite company into their SQL Tables.

    Here is my question because I have no one else in real life to bounce this idea off of to see if this is an accepted method/practice.

    Import TXT file to 'staging table' everything as varchar.

    INSERT INTO the table it needs to go into.

    Select the columns from the staging table with necessary TRIM, CONVERT, CAST, etc around the columns to put the data into the final destination in the correct format.

    I'm finding that a few of my data sources (insurance companies) have excessive 'white space' at the end of address lines, names, etc and I want my tables to be clean. I will have to redo a lot of tables I've already made because I didn't know how bad the extra blank spaces were on some sets of data I got from them.

    I presume there's maybe a way to do all of that with an SSIS package, but I'm just not there yet.

  • You usually don't need the likes of SSIS.  Many will disagree with me on this but I consider SSIS to be an unnecessary added layer (less politically correct: an unnecessary added complication) for most imports.  If you're just using text files with delimiters, then look into BULK INSERT (a part of T-SQL).  Yeah, there's the occasional problems with dealing with poorly formed text files that SSIS can sometimes solve but (especially with the new CSV source for BULK INSERT in 2019... 'bout damned time they provided that after more than 2 decades!!!) it's normally faster and easier and a whole lot less complicated (especially during upgrades and migrations) than using SSIS packages.

    If you learn about the "ACE" drivers, you can use OPENROWSET to read directly from spreadsheets and, with a little magic in T-SQL, you can auto-magically normalize the data, as well, all in T-SQL.

    Since the files you're dealing with contain a lot of PII, you don't want to post actual lines from a file but, if you could produce some obfuscated data in a file that's representative of the issues you're having and attach the file to a post, I'm pretty sure that someone on this forum can help you solve your problems.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.  And welcome to SQLServerCentral.com!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • Thank so many guys!

  • below86 wrote:

    IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    I typically do something a wee bit different... I "Right size" and correctly datatype the columns in the staging table, which will automagically cause errors to occur during the initial import of data if there are any discrepencies.  I also set the max number of errors to a binary 2 billion to ensure that errors won't cause the process to stop.  I also have the "error file" features turned on to sequester the errors so that I don't actually have to try to locate them after the run.  Then, I can either fix the sequestered rows of data and add them to the staging table or I can provide them to the people that sourced the file and tell them to fix their junk. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • below86 wrote:

    IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    Once you get that FTP part working in T-SQL, please post back about how you did it.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin wrote:

    below86 wrote:

    IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    Once you get that FTP part working in T-SQL, please post back about how you did it.

    Batch file generated by T-SQL and then a call to xp_CmdShell.  I've not had to download from FTP sites for quite a while but I used to do it a whole lot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    Phil Parkin wrote:

    below86 wrote:

    IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    Once you get that FTP part working in T-SQL, please post back about how you did it.

    Batch file generated by T-SQL and then a call to xp_CmdShell.  I've not had to download from FTP sites for quite a while but I used to do it a whole lot.

    I have done that in the past, now I use CLR on the rare occasions I need to.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • David Burrows wrote:

    Jeff Moden wrote:

    Phil Parkin wrote:

    below86 wrote:

    IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    Once you get that FTP part working in T-SQL, please post back about how you did it.

    Batch file generated by T-SQL and then a call to xp_CmdShell.  I've not had to download from FTP sites for quite a while but I used to do it a whole lot.

    I have done that in the past, now I use CLR on the rare occasions I need to.

    Cool info, David.  One of these days, I'm going to have to learn how to fire up .NET.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    below86 wrote:

    IMHO, you are on the right path.  I'm currently doing exactly what you described.  Importing flat, delimited files to a 'staging' table where all fields are defined as VARCHAR(500), some things I had to do a 1000.  Then I use SQL to convert the data before loading to my real table(s).

    I do use SSIS to pick up the file from the FTP site and move it into the 'staging' table, but I use SQL for the rest(SQL task in SSIS).

    If I had the knowledge(and could get the buy in from my superiors) I would like to do move to more of what Jeff describes and do everything with SQL.

    I typically do something a wee bit different... I "Right size" and correctly datatype the columns in the staging table, which will automagically cause errors to occur during the initial import of data if there are any discrepencies.  I also set the max number of errors to a binary 2 billion to ensure that errors won't cause the process to stop.  I also have the "error file" features turned on to sequester the errors so that I don't actually have to try to locate them after the run.  Then, I can either fix the sequestered rows of data and add them to the staging table or I can provide them to the people that sourced the file and tell them to fix their junk. 😀

    Since I'm using SSIS to map the data from the flat file to my staging table I prefer to get all the data in the staging table, then try and push it to the correct data type.  With thousands of rows it is easier to see/find the bad data in this staging table than trying to find it on the text file.  When you map it straight in using SSIS it tends to blow up when you hit the bad data.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • below86 wrote:

    Since I'm using SSIS to map the data from the flat file to my staging table I prefer to get all the data in the staging table, then try and push it to the correct data type.  With thousands of rows it is easier to see/find the bad data in this staging table than trying to find it on the text file.  When you map it straight in using SSIS it tends to blow up when you hit the bad data.

    I'm with Jeff, at least in principle, on this one. I want to catch bad data as early as possible in the process.

    While the default behaviour of SSIS is as you suggest, it's not difficult to change that behaviour such that

    a) Bad data gets redirected to an error table,

    b) A description of the error which caused the redirection is also passed to the error table

    c) The errors do not cause the job to fail.

    At the end of the import, producing a report containing a list of all the errors then becomes trivial.

    • This reply was modified 5 years, 4 months ago by  Phil Parkin. Reason: Fix typo

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin wrote:

    below86 wrote:

    Since I'm using SSIS to map the data from the flat file to my staging table I prefer to get all the data in the staging table, then try and push it to the correct data type.  With thousands of rows it is easier to see/find the bad data in this staging table than trying to find it on the text file.  When you map it straight in using SSIS it tends to blow up when you hit the bad data.

    I'm with Jeff, at least in principle, on this one. I want to catch bad data as early as possible in the process.

    While the default behaviour of SSIS is as you suggest, it's not difficult to change that behaviour such that

    a) Bad data gets redirected to an error table,

    b) A description of the error which caused the redirection is also passed to the error table

    c) The errors do not cause the job to fail.

    At the end of the import, producing a report containing a list of all the errors then becomes trivial.

    Never said it couldn't been done within the SSIS dataflow task.  I've seen it done and there are some that look pretty nasty with all the look ups and splits an merges and redirects.  I was just saying Jeff has mentioned before he doesn't use SSIS, and I was just saying I would like to have that knowledge and support from the people at my work to do more of it with just SQL.  IMHO the SQL is a lot easier to read and maintain.  I seen it become a huge pain when a new column was added to the input file that needed to be carried through.  Even had to add new splits(and merge) because the new column meant that one of the existing look ups would fail because that column would now be blank or NULL.  Again IMHO this could have been done a lot easier and quicker in SQL.

    Since a lot of what I do is take in data from outside companies I like to get the data stored in a SQL table exactly the way they sent it.  This means dropping the data into VARCHAR(500) fields usually.  This way no conversions are done and when I do find an issue with the data I know by looking at this data in that staging table if it is an issue with the data they sent or is it on my end, a conversion gone bad.  I'm sure you can accomplish this in your way if you redirect the bad records to an error table I just prefer my method.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply