loading and reformatting text file data into into SQL table

  • Hi and many thanks for your help in advance. I am looking for a way to convert the following format into a sql table. For those of you familar with the format it is Bib Tex.

    Essentially a new row in the table would be for each entry, denoted by an @ logo and each column is denoted by an =, as you can see from the example data no one contains all the possible columns and some fields can be over two lines long.

    To load this I was considering loading it into a table as each line being a row. Adding a row number, then a column counting the @ signs in order and essentially grouping each record, then for each group running through and looking for the column keywords 'author' , 'title' etc then splitting the data out into those constituent parts using substring and charindex.

    Does this sound like the right approach.

    Many Thanks for your help and suggestions,

    Oli

    @Book{hicks2001,

    author = "von Hicks, III, Michael",

    title = "Design of a Carbon Fiber Composite Grid Structure for the GLAST

    Spacecraft Using a Novel Manufacturing Technique",

    publisher = "Stanford Press",

    year = 2001,

    address = "Palo Alto",

    edition = "1st",

    }

    @Book{Torre2008,

    author = "Joe Torre and Tom Verducci",

    publisher = "Doubleday",

    title = "The Yankee Years",

    year = 2008,

    }

    @INPROCEEDINGS {author:06,

    title = {Some publication title},

    author = {First Author and Second Author},

    crossref = {conference:06},

    pages = {330--331},

    }

    @PROCEEDINGS {conference:06,

    editor = {First Editor and Second Editor},

    title = {Proceedings of the Xth Conference on XYZ},

    booktitle = {Proceedings of the Xth Conference on XYZ},

    year = 2006,

    month = oct,

    }

  • You could do the whole thing (parsing, splitting, outputting one row with multiple columns per @ group) using an asynchronous Script Component, but it would take a fair chunk of C# to implement.

    If that doesn't scare you, do some research to get yourself up to speed with what this might entail & then come back here with any follow-up questions.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Hi, thanks for the advice, I will try with Python as I am more familiar with python than c# and see how I get on.

    Many Thanks,

    Oliver

  • As far as I know, you cannot use Python in a Script Component, so I think you are limited by that.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • oliver.morris (5/29/2015)


    Hi and many thanks for your help in advance. I am looking for a way to convert the following format into a sql table. For those of you familar with the format it is Bib Tex.

    Essentially a new row in the table would be for each entry, denoted by an @ logo and each column is denoted by an =, as you can see from the example data no one contains all the possible columns and some fields can be over two lines long.

    To load this I was considering loading it into a table as each line being a row. Adding a row number, then a column counting the @ signs in order and essentially grouping each record, then for each group running through and looking for the column keywords 'author' , 'title' etc then splitting the data out into those constituent parts using substring and charindex.

    Does this sound like the right approach.

    Many Thanks for your help and suggestions,

    Oli

    @Book{hicks2001,

    author = "von Hicks, III, Michael",

    title = "Design of a Carbon Fiber Composite Grid Structure for the GLAST

    Spacecraft Using a Novel Manufacturing Technique",

    publisher = "Stanford Press",

    year = 2001,

    address = "Palo Alto",

    edition = "1st",

    }

    @Book{Torre2008,

    author = "Joe Torre and Tom Verducci",

    publisher = "Doubleday",

    title = "The Yankee Years",

    year = 2008,

    }

    @INPROCEEDINGS {author:06,

    title = {Some publication title},

    author = {First Author and Second Author},

    crossref = {conference:06},

    pages = {330--331},

    }

    @PROCEEDINGS {conference:06,

    editor = {First Editor and Second Editor},

    title = {Proceedings of the Xth Conference on XYZ},

    booktitle = {Proceedings of the Xth Conference on XYZ},

    year = 2006,

    month = oct,

    }

    It might seem obvious but I want to make sure...

    1. Do you want to capture the fact that the items after the @ symbol change?

    2. In the case of things like "hicks2001" "conference:06", what do you want done with THAT data?

    3. What do you want done with the braces and double quotes around the items?

    Maybe it would be easier to just ask you what the data from above should look like in a table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • oliver.morris (5/29/2015)


    Hi and many thanks for your help in advance. I am looking for a way to convert the following format into a sql table. For those of you familar with the format it is Bib Tex.

    Essentially a new row in the table would be for each entry, denoted by an @ logo and each column is denoted by an =, as you can see from the example data no one contains all the possible columns and some fields can be over two lines long.

    To load this I was considering loading it into a table as each line being a row. Adding a row number, then a column counting the @ signs in order and essentially grouping each record, then for each group running through and looking for the column keywords 'author' , 'title' etc then splitting the data out into those constituent parts using substring and charindex.

    Does this sound like the right approach.

    Many Thanks for your help and suggestions,

    Oli

    I'm curious about the Data format.

    Who are you getting the data from?

    Can't you ask they to provided you a different format?

    Is this a homework assignment?

    @Book{hicks2001,

    author = "von Hicks, III, Michael",

    title = "Design of a Carbon Fiber Composite Grid Structure for the GLAST

    Spacecraft Using a Novel Manufacturing Technique",

    publisher = "Stanford Press",

    year = 2001,

    address = "Palo Alto",

    edition = "1st",

    }

    @Book{Torre2008,

    author = "Joe Torre and Tom Verducci",

    publisher = "Doubleday",

    title = "The Yankee Years",

    year = 2008,

    }

    @INPROCEEDINGS {author:06,

    title = {Some publication title},

    author = {First Author and Second Author},

    crossref = {conference:06},

    pages = {330--331},

    }

    @PROCEEDINGS {conference:06,

    editor = {First Editor and Second Editor},

    title = {Proceedings of the Xth Conference on XYZ},

    booktitle = {Proceedings of the Xth Conference on XYZ},

    year = 2006,

    month = oct,

    }

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • oliver.morris (5/29/2015)


    Hi and many thanks for your help in advance. I am looking for a way to convert the following format into a sql table. For those of you familar with the format it is Bib Tex.

    Essentially a new row in the table would be for each entry, denoted by an @ logo and each column is denoted by an =, as you can see from the example data no one contains all the possible columns and some fields can be over two lines long.

    To load this I was considering loading it into a table as each line being a row. Adding a row number, then a column counting the @ signs in order and essentially grouping each record, then for each group running through and looking for the column keywords 'author' , 'title' etc then splitting the data out into those constituent parts using substring and charindex.

    Does this sound like the right approach.

    Many Thanks for your help and suggestions,

    Oli

    This article bears strong resemblance to your issue:;-)

    http://en.wikipedia.org/wiki/BibTeX">

    http://en.wikipedia.org/wiki/BibTeX

    @Book{hicks2001,

    author = "von Hicks, III, Michael",

    title = "Design of a Carbon Fiber Composite Grid Structure for the GLAST

    Spacecraft Using a Novel Manufacturing Technique",

    publisher = "Stanford Press",

    year = 2001,

    address = "Palo Alto",

    edition = "1st",

    }

    @Book{Torre2008,

    author = "Joe Torre and Tom Verducci",

    publisher = "Doubleday",

    title = "The Yankee Years",

    year = 2008,

    }

    @INPROCEEDINGS {author:06,

    title = {Some publication title},

    author = {First Author and Second Author},

    crossref = {conference:06},

    pages = {330--331},

    }

    @PROCEEDINGS {conference:06,

    editor = {First Editor and Second Editor},

    title = {Proceedings of the Xth Conference on XYZ},

    booktitle = {Proceedings of the Xth Conference on XYZ},

    year = 2006,

    month = oct,

    }

    Perhaps I'm missing something?

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Well, at least it's not as noisy and bloated as XML. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • So, now that we've established what it is, my question remains... how do you want to see that in a table? In this case and based on the Wikipedia article, I'd recommend 1 table per unique @category.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi,

    Thanks for all your help, I took the example content from wikipedia but all the content I want to put into a table is in the format. Essentially I am trying to move the data into a single standardised table from this bib tex format.

    The way I have achieved this is to use python to change the text file so that each new line starts at an @ symbol. Then import row by row into an existing table using python pyodbc. Then using charindex and substring to pull the content from the large string into their relevant tables.

    It seems to work and at the moment I don't need to put it in a SSIS package.

    Thanks again for your help,

    Oliver

  • olibbhq (5/31/2015)


    Hi,

    Thanks for all your help, I took the example content from wikipedia but all the content I want to put into a table is in the format. Essentially I am trying to move the data into a single standardised table from this bib tex format.

    The way I have achieved this is to use python to change the text file so that each new line starts at an @ symbol. Then import row by row into an existing table using python pyodbc. Then using charindex and substring to pull the content from the large string into their relevant tables.

    It seems to work and at the moment I don't need to put it in a SSIS package.

    Thanks again for your help,

    Oliver

    My advice to you is to request the data feed in a specific format that is Conducive to an easy load.

    You need to tell them what you want.

    If this is a mandate by an instructor then unfortunately that is not going to work. Good luck. 🙂

    I would KISS.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Why do you think you need that you do not need to use SSIS?

    If you get your date in a simplified format you can use the Import Wizard To load your data without any code

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • olibbhq (5/31/2015)


    Hi,

    Thanks for all your help, I took the example content from wikipedia but all the content I want to put into a table is in the format. Essentially I am trying to move the data into a single standardised table from this bib tex format.

    The way I have achieved this is to use python to change the text file so that each new line starts at an @ symbol. Then import row by row into an existing table using python pyodbc. Then using charindex and substring to pull the content from the large string into their relevant tables.

    It seems to work and at the moment I don't need to put it in a SSIS package.

    Thanks again for your help,

    Oliver

    I would not accept that file format.

    Specify the file format and make life easier on yourself.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Thanks for the advice and your help have a good sunday afternoon!

  • olibbhq (5/31/2015)


    Hi,

    Thanks for all your help, I took the example content from wikipedia but all the content I want to put into a table is in the format. Essentially I am trying to move the data into a single standardised table from this bib tex format.

    The way I have achieved this is to use python to change the text file so that each new line starts at an @ symbol. Then import row by row into an existing table using python pyodbc. Then using charindex and substring to pull the content from the large string into their relevant tables.

    It seems to work and at the moment I don't need to put it in a SSIS package.

    Thanks again for your help,

    Oliver

    Congrats on that. I just hate to do such things because it lends to the proverbial "Tower of Babel".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply