Help with importing data

Question

Help with importing data

olibbhq

SSCarpal Tunnel

Points: 4034
More actions
June 26, 2009 at 10:26 am

#133876

Hi I really appreciate your help, I am not being lazy and trying to get someone to sort it I really want to learn how to do this stuff.
Basically I have data like so:
@article{
author={Carslaw,H. S. and Jaeger,J. C.},
year={1959},
journal={Conduction of Heat in Solids},
note={Cited By (since 1996): 7498},
url={www..com},
}
}
@article{
author={Kretz,R.},
year={1983},
title={Symbols for rock-forming minerals.},
journal={American Mineralogist},
volume={68},
number={1-2},
pages={277-279},
note={Cited By (since 1996): 1967},
url={www..com},
}
}
etc etc
So this data needs to be entered into SQL Server table so that each record is a row with the items in the sections as columns (i.e. columns would be author, year, title etc
So the question is how do I translate from this format to a row format. Any pointers to other feeds, books etc would be invaluable
Thanks for your help.
Oliver

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

DBA with OCD SSC Enthusiast Points: 199 More actions · Answer 1

There are probably a lot more elegant (and faster) ways to do this but one simple and straight forward way would be to pull the whole thing into a temporary SQL table (each line is a row in the table) and then just scan through the table (one row at a time).

When you get to a row that begins with @article, you empty out a set of variables (ex. @author, @year, @journal, etc.). Then you set these variables as you continue scanning down the rows. When you get to the end (two rows containing just right braces) you insert a row into your database and dump the values you stored in the variables into the columns. Then you start all over again.

This continues until you get to the end of the file.

Again, there are probably a zillion other and better ways to accomplish this. But at least using this approach you should be able to get something working today.

olibbhq SSCarpal Tunnel Points: 4034 More actions · Answer 2

Thanks for the help.

The only issue with this method is that in some cases a line e.g. author maybe longer than a line and roll over to the next line. In this case I dont think this method would work.

Many thanks for your help though, its a good start.

Lowell SSC Guru Points: 323755 More actions · Answer 3

data like this is tough; i always do this via TSQL and not through SSIS, just because I'm more familiar with TSQL.

Here's how i would do it:

I'd grab the entire file and remove all the CrLf.

then i'd replace every }} with }} + CrLf; that would designate one record for each row of data, right?

then i'd bulk insert into a table with a single varchar(max) column, like this:

BULK INSERT BULKACT FROM 'C:\whateverfolder\filename.txt'

WITH (

DATAFILETYPE = 'char',

FIELDTERMINATOR = '|',

ROWTERMINATOR = '[slash n]', --forum does not like \ n with no space

FIRSTROW = 1

)

with everything in a table, I'd find each element in a select, like this, updating each element separately:

create table #BULKACT(RAWDATA varchar(max) )

insert into #BULKACT(RAWDATA)

SELECT

'@article{author={Carslaw,H. S. and Jaeger,J. C.},year={1959},journal={Conduction of Heat in Solids},note={Cited By (since 1996): 7498},url={www..com},}}'

UNION ALL SELECT

'@article{author={Kretz,R.},year={1983},title={Symbols for rock-forming minerals.},journal={American Mineralogist},volume={68},number={1-2},pages={277-279},note={Cited By (since 1996): 1967},url={www..com},}}'

SELECT

SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8))

FROM #BULKACT

WHERE CHARINDEX('author={',RAWDATA) > 0

--results

Carslaw,H. S. and Jaeger,J. C.

Kretz,R.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

DBA with OCD SSC Enthusiast Points: 199 More actions · Answer 4

The only issue with this method is that in some cases a line e.g. author maybe longer than a line and roll over to the next line. In this case I dont think this method would work.

Can you provide an example of an article that contains lines that "roll over" to the next line? No matter how you try to skin this cat, the first step is going to be figuring out how to keep associated information together.

olibbhq SSCarpal Tunnel Points: 4034 More actions · Answer 5

Thanks for the response I am really grateful and will give it a go over the weekend and let you know how it goes, an example of when it carrys on to the next line is like so

@article{

author={Merriman,R. J. and Roberts,B. and Peacor,D. R. and Hirons,S. R.},

year={1995},

title={Strain-related differences in the crystal growth of white mica and chlorite: a TEM and XRD study of the development of metapelitic microfabrics in the Southern Uplands thrust terrane, Scotland},

journal={Journal of Metamorphic Geology},

volume={13},

number={5},

pages={559-576},

note={Cited By (since 1996): 55},

url={www.scopus.com},

so here the title carries over to the next line.

I will try the advice above and remove the lines and go from there.

Thanks again for the help.

Oliver

DBA with OCD SSC Enthusiast Points: 199 More actions · Answer 6

Give it a try...

I don't see the rolling over as a big problem. Every time you read in a row, check to see if it starts with "author=", "year=", "title=", etc. If it does, then you put the remaining portion of the line into the variable appropriate for that value. If not, append it to the variable that you just set on the prior row.

Good luck

olibbhq SSCarpal Tunnel Points: 4034 More actions · Answer 7

Thanks for the fast reply grasshopper

can you give me some clues on what you wrote

When you get to a row that begins with @article, you empty out a set of variables (ex. @author, @year, @journal, etc.). Then you set these variables as you continue scanning down the rows. When you get to the end (two rows containing just right braces) you insert a row into your database and dump the values you stored in the variables into the columns. Then you start all over again.

This continues until you get to the end of the file.

how would I empty out the variables ( is it by using the substring command or is there something better)

How do I set is to have another row after I reach the double braces

And sorry to be a pain finally how do I get it to loop through all the articles.

Thanks very much. Have a good weekend.

olibbhq SSCarpal Tunnel Points: 4034 More actions · Answer 8

One other thing, the main reason this has to be done is that there are non-standard characters in the text like letters with dashed, umlouts? etc and so I cant use varchar at the moment. will charindex still work?

Many Thanks,

Oliver

olibbhq SSCarpal Tunnel Points: 4034 More actions · Answer 9

Hi SSCrazy,

Thanks for the script had a go this weekend made some edits:

SELECT rawdata,

SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8)) as Author1,

SUBSTRING(RAWDATA, CHARINDEX('year={',RAWDATA)+6, ((CHARINDEX('}',RAWDATA,(CHARINDEX('year={',RAWDATA) + 6)))- (CHARINDEX('year={',RAWDATA)+6))) as year1,

SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, ((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))) as Title1

FROM BULKACT

This works fine, however if there is no title for example at the moment it will pull back the first 7 characters of the string after the first { is found! I basically would like it to be null if their is no title available for example.

I has a go with nullif to fix this:

SELECT rawdata,

SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, nullif((((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))),CHARINDEX('}',RAWDATA)-7)) as Title1

FROM BULKACT

But this doesnt work, if the title is the same length as it is to the first { then it goes null!

Do you have any ideas to fix this. Many Thanks,

olibbhq SSCarpal Tunnel Points: 4034 More actions · Answer 10

Hi fixed it last night very happy. used case to sort it:

SELECT rawdata,

SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8)) as Author1,

SUBSTRING(RAWDATA, CHARINDEX('year={',RAWDATA)+6, ((CHARINDEX('}',RAWDATA,(CHARINDEX('year={',RAWDATA) + 6)))- (CHARINDEX('year={',RAWDATA)+6))) as year1,

CASE WHEN CHARINDEX('title={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, ((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))) end as Title1,

CASE WHEN CHARINDEX('journal={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('journal={',RAWDATA)+9, ((CHARINDEX('}',RAWDATA,(CHARINDEX('journal={',RAWDATA) + 9)))- (CHARINDEX('journal={',RAWDATA)+9))) END as Journal1,

CASE WHEN CHARINDEX('volume={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('volume={',RAWDATA)+8, ((CHARINDEX('}',RAWDATA,(CHARINDEX('volume={',RAWDATA) + 8)))- (CHARINDEX('volume={',RAWDATA)+8))) END as volume

FROM BULKACT

Thanks for the help.

Oliver

AmolNaik SSCarpal Tunnel Points: 4767 More actions · Answer 11

Hi,

Parsing of variable length columns text file could also be achieved using the Script Transform component in SSIS. Will involve some amount of VB.NET scripting.

Thanks,

Amol

Amol Naik