Help with importing data

  • Hi I really appreciate your help, I am not being lazy and trying to get someone to sort it I really want to learn how to do this stuff.

    Basically I have data like so:

    @article{

    author={Carslaw,H. S. and Jaeger,J. C.},

    year={1959},

    journal={Conduction of Heat in Solids},

    note={Cited By (since 1996): 7498},

    url={www..com},

    }

    }

    @article{

    author={Kretz,R.},

    year={1983},

    title={Symbols for rock-forming minerals.},

    journal={American Mineralogist},

    volume={68},

    number={1-2},

    pages={277-279},

    note={Cited By (since 1996): 1967},

    url={www..com},

    }

    }

    etc etc

    So this data needs to be entered into SQL Server table so that each record is a row with the items in the sections as columns (i.e. columns would be author, year, title etc

    So the question is how do I translate from this format to a row format. Any pointers to other feeds, books etc would be invaluable

    Thanks for your help.

    Oliver

  • There are probably a lot more elegant (and faster) ways to do this but one simple and straight forward way would be to pull the whole thing into a temporary SQL table (each line is a row in the table) and then just scan through the table (one row at a time).

    When you get to a row that begins with @article, you empty out a set of variables (ex. @author, @year, @journal, etc.). Then you set these variables as you continue scanning down the rows. When you get to the end (two rows containing just right braces) you insert a row into your database and dump the values you stored in the variables into the columns. Then you start all over again.

    This continues until you get to the end of the file.

    Again, there are probably a zillion other and better ways to accomplish this. But at least using this approach you should be able to get something working today.

  • Thanks for the help.

    The only issue with this method is that in some cases a line e.g. author maybe longer than a line and roll over to the next line. In this case I dont think this method would work.

    Many thanks for your help though, its a good start.

  • data like this is tough; i always do this via TSQL and not through SSIS, just because I'm more familiar with TSQL.

    Here's how i would do it:

    I'd grab the entire file and remove all the CrLf.

    then i'd replace every }} with }} + CrLf; that would designate one record for each row of data, right?

    then i'd bulk insert into a table with a single varchar(max) column, like this:

    BULK INSERT BULKACT FROM 'C:\whateverfolder\filename.txt'

    WITH (

    DATAFILETYPE = 'char',

    FIELDTERMINATOR = '|',

    ROWTERMINATOR = '[slash n]', --forum does not like \ n with no space

    FIRSTROW = 1

    )

    with everything in a table, I'd find each element in a select, like this, updating each element separately:

    create table #BULKACT(RAWDATA varchar(max) )

    insert into #BULKACT(RAWDATA)

    SELECT

    '@article{author={Carslaw,H. S. and Jaeger,J. C.},year={1959},journal={Conduction of Heat in Solids},note={Cited By (since 1996): 7498},url={www..com},}}'

    UNION ALL SELECT

    '@article{author={Kretz,R.},year={1983},title={Symbols for rock-forming minerals.},journal={American Mineralogist},volume={68},number={1-2},pages={277-279},note={Cited By (since 1996): 1967},url={www..com},}}'

    SELECT

    SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8))

    FROM #BULKACT

    WHERE CHARINDEX('author={',RAWDATA) > 0

    --results

    Carslaw,H. S. and Jaeger,J. C.

    Kretz,R.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • The only issue with this method is that in some cases a line e.g. author maybe longer than a line and roll over to the next line. In this case I dont think this method would work.

    Can you provide an example of an article that contains lines that "roll over" to the next line? No matter how you try to skin this cat, the first step is going to be figuring out how to keep associated information together.

  • Thanks for the response I am really grateful and will give it a go over the weekend and let you know how it goes, an example of when it carrys on to the next line is like so

    @article{

    author={Merriman,R. J. and Roberts,B. and Peacor,D. R. and Hirons,S. R.},

    year={1995},

    title={Strain-related differences in the crystal growth of white mica and chlorite: a TEM and XRD study of the development of metapelitic microfabrics in the Southern Uplands thrust terrane, Scotland},

    journal={Journal of Metamorphic Geology},

    volume={13},

    number={5},

    pages={559-576},

    note={Cited By (since 1996): 55},

    url={www.scopus.com},

    so here the title carries over to the next line.

    I will try the advice above and remove the lines and go from there.

    Thanks again for the help.

    Oliver

  • Give it a try...

    I don't see the rolling over as a big problem. Every time you read in a row, check to see if it starts with "author=", "year=", "title=", etc. If it does, then you put the remaining portion of the line into the variable appropriate for that value. If not, append it to the variable that you just set on the prior row.

    Good luck

  • Thanks for the fast reply grasshopper

    can you give me some clues on what you wrote

    When you get to a row that begins with @article, you empty out a set of variables (ex. @author, @year, @journal, etc.). Then you set these variables as you continue scanning down the rows. When you get to the end (two rows containing just right braces) you insert a row into your database and dump the values you stored in the variables into the columns. Then you start all over again.

    This continues until you get to the end of the file.

    how would I empty out the variables ( is it by using the substring command or is there something better)

    How do I set is to have another row after I reach the double braces

    And sorry to be a pain finally how do I get it to loop through all the articles.

    Thanks very much. Have a good weekend.

  • One other thing, the main reason this has to be done is that there are non-standard characters in the text like letters with dashed, umlouts? etc and so I cant use varchar at the moment. will charindex still work?

    Many Thanks,

    Oliver

  • Hi SSCrazy,

    Thanks for the script had a go this weekend made some edits:

    SELECT rawdata,

    SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8)) as Author1,

    SUBSTRING(RAWDATA, CHARINDEX('year={',RAWDATA)+6, ((CHARINDEX('}',RAWDATA,(CHARINDEX('year={',RAWDATA) + 6)))- (CHARINDEX('year={',RAWDATA)+6))) as year1,

    SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, ((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))) as Title1

    FROM BULKACT

    This works fine, however if there is no title for example at the moment it will pull back the first 7 characters of the string after the first { is found! I basically would like it to be null if their is no title available for example.

    I has a go with nullif to fix this:

    SELECT rawdata,

    SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, nullif((((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))),CHARINDEX('}',RAWDATA)-7)) as Title1

    FROM BULKACT

    But this doesnt work, if the title is the same length as it is to the first { then it goes null!

    Do you have any ideas to fix this. Many Thanks,

  • Hi fixed it last night very happy. used case to sort it:

    SELECT rawdata,

    SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8)) as Author1,

    SUBSTRING(RAWDATA, CHARINDEX('year={',RAWDATA)+6, ((CHARINDEX('}',RAWDATA,(CHARINDEX('year={',RAWDATA) + 6)))- (CHARINDEX('year={',RAWDATA)+6))) as year1,

    CASE WHEN CHARINDEX('title={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, ((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))) end as Title1,

    CASE WHEN CHARINDEX('journal={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('journal={',RAWDATA)+9, ((CHARINDEX('}',RAWDATA,(CHARINDEX('journal={',RAWDATA) + 9)))- (CHARINDEX('journal={',RAWDATA)+9))) END as Journal1,

    CASE WHEN CHARINDEX('volume={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('volume={',RAWDATA)+8, ((CHARINDEX('}',RAWDATA,(CHARINDEX('volume={',RAWDATA) + 8)))- (CHARINDEX('volume={',RAWDATA)+8))) END as volume

    FROM BULKACT

    Thanks for the help.

    Oliver

  • Hi,

    Parsing of variable length columns text file could also be achieved using the Script Transform component in SSIS. Will involve some amount of VB.NET scripting.

    Thanks,

    Amol

    Amol Naik

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply