June 26, 2009 at 10:26 am
Hi I really appreciate your help, I am not being lazy and trying to get someone to sort it I really want to learn how to do this stuff.
Basically I have data like so:
author={Carslaw,H. S. and Jaeger,J. C.},
year={1959},
journal={Conduction of Heat in Solids},
note={Cited By (since 1996): 7498},
url={www..com},
}
}
author={Kretz,R.},
year={1983},
title={Symbols for rock-forming minerals.},
journal={American Mineralogist},
volume={68},
number={1-2},
pages={277-279},
note={Cited By (since 1996): 1967},
url={www..com},
}
}
etc etc
So this data needs to be entered into SQL Server table so that each record is a row with the items in the sections as columns (i.e. columns would be author, year, title etc
So the question is how do I translate from this format to a row format. Any pointers to other feeds, books etc would be invaluable
Thanks for your help.
Oliver
June 26, 2009 at 11:03 am
There are probably a lot more elegant (and faster) ways to do this but one simple and straight forward way would be to pull the whole thing into a temporary SQL table (each line is a row in the table) and then just scan through the table (one row at a time).
When you get to a row that begins with @article, you empty out a set of variables (ex. @author, @year, @journal, etc.). Then you set these variables as you continue scanning down the rows. When you get to the end (two rows containing just right braces) you insert a row into your database and dump the values you stored in the variables into the columns. Then you start all over again.
This continues until you get to the end of the file.
Again, there are probably a zillion other and better ways to accomplish this. But at least using this approach you should be able to get something working today.
June 26, 2009 at 11:06 am
Thanks for the help.
The only issue with this method is that in some cases a line e.g. author maybe longer than a line and roll over to the next line. In this case I dont think this method would work.
Many thanks for your help though, its a good start.
June 26, 2009 at 11:22 am
data like this is tough; i always do this via TSQL and not through SSIS, just because I'm more familiar with TSQL.
Here's how i would do it:
I'd grab the entire file and remove all the CrLf.
then i'd replace every }} with }} + CrLf; that would designate one record for each row of data, right?
then i'd bulk insert into a table with a single varchar(max) column, like this:
BULK INSERT BULKACT FROM 'C:\whateverfolder\filename.txt'
WITH (
DATAFILETYPE = 'char',
FIELDTERMINATOR = '|',
ROWTERMINATOR = '[slash n]', --forum does not like \ n with no space
FIRSTROW = 1
)
with everything in a table, I'd find each element in a select, like this, updating each element separately:
create table #BULKACT(RAWDATA varchar(max) )
insert into #BULKACT(RAWDATA)
SELECT
'@article{author={Carslaw,H. S. and Jaeger,J. C.},year={1959},journal={Conduction of Heat in Solids},note={Cited By (since 1996): 7498},url={www..com},}}'
UNION ALL SELECT
'@article{author={Kretz,R.},year={1983},title={Symbols for rock-forming minerals.},journal={American Mineralogist},volume={68},number={1-2},pages={277-279},note={Cited By (since 1996): 1967},url={www..com},}}'
SELECT
SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8))
FROM #BULKACT
WHERE CHARINDEX('author={',RAWDATA) > 0
--results
Carslaw,H. S. and Jaeger,J. C.
Kretz,R.
Lowell
June 26, 2009 at 11:39 am
The only issue with this method is that in some cases a line e.g. author maybe longer than a line and roll over to the next line. In this case I dont think this method would work.
Can you provide an example of an article that contains lines that "roll over" to the next line? No matter how you try to skin this cat, the first step is going to be figuring out how to keep associated information together.
June 26, 2009 at 4:03 pm
Thanks for the response I am really grateful and will give it a go over the weekend and let you know how it goes, an example of when it carrys on to the next line is like so
@article{
author={Merriman,R. J. and Roberts,B. and Peacor,D. R. and Hirons,S. R.},
year={1995},
title={Strain-related differences in the crystal growth of white mica and chlorite: a TEM and XRD study of the development of metapelitic microfabrics in the Southern Uplands thrust terrane, Scotland},
journal={Journal of Metamorphic Geology},
volume={13},
number={5},
pages={559-576},
note={Cited By (since 1996): 55},
url={www.scopus.com},
so here the title carries over to the next line.
I will try the advice above and remove the lines and go from there.
Thanks again for the help.
Oliver
June 26, 2009 at 4:20 pm
Give it a try...
I don't see the rolling over as a big problem. Every time you read in a row, check to see if it starts with "author=", "year=", "title=", etc. If it does, then you put the remaining portion of the line into the variable appropriate for that value. If not, append it to the variable that you just set on the prior row.
Good luck
June 26, 2009 at 4:59 pm
Thanks for the fast reply grasshopper
can you give me some clues on what you wrote
When you get to a row that begins with @article, you empty out a set of variables (ex. @author, @year, @journal, etc.). Then you set these variables as you continue scanning down the rows. When you get to the end (two rows containing just right braces) you insert a row into your database and dump the values you stored in the variables into the columns. Then you start all over again.
This continues until you get to the end of the file.
how would I empty out the variables ( is it by using the substring command or is there something better)
How do I set is to have another row after I reach the double braces
And sorry to be a pain finally how do I get it to loop through all the articles.
Thanks very much. Have a good weekend.
June 26, 2009 at 5:04 pm
One other thing, the main reason this has to be done is that there are non-standard characters in the text like letters with dashed, umlouts? etc and so I cant use varchar at the moment. will charindex still work?
Many Thanks,
Oliver
June 28, 2009 at 2:37 pm
Hi SSCrazy,
Thanks for the script had a go this weekend made some edits:
SELECT rawdata,
SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8)) as Author1,
SUBSTRING(RAWDATA, CHARINDEX('year={',RAWDATA)+6, ((CHARINDEX('}',RAWDATA,(CHARINDEX('year={',RAWDATA) + 6)))- (CHARINDEX('year={',RAWDATA)+6))) as year1,
SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, ((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))) as Title1
FROM BULKACT
This works fine, however if there is no title for example at the moment it will pull back the first 7 characters of the string after the first { is found! I basically would like it to be null if their is no title available for example.
I has a go with nullif to fix this:
SELECT rawdata,
SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, nullif((((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))),CHARINDEX('}',RAWDATA)-7)) as Title1
FROM BULKACT
But this doesnt work, if the title is the same length as it is to the first { then it goes null!
Do you have any ideas to fix this. Many Thanks,
June 29, 2009 at 2:17 am
Hi fixed it last night very happy. used case to sort it:
SELECT rawdata,
SUBSTRING(RAWDATA, CHARINDEX('author={',RAWDATA) + 8, CHARINDEX('}',RAWDATA) - (CHARINDEX('author={',RAWDATA) + 8)) as Author1,
SUBSTRING(RAWDATA, CHARINDEX('year={',RAWDATA)+6, ((CHARINDEX('}',RAWDATA,(CHARINDEX('year={',RAWDATA) + 6)))- (CHARINDEX('year={',RAWDATA)+6))) as year1,
CASE WHEN CHARINDEX('title={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('title={',RAWDATA)+7, ((CHARINDEX('}',RAWDATA,(CHARINDEX('title={',RAWDATA) + 7)))- (CHARINDEX('title={',RAWDATA)+7))) end as Title1,
CASE WHEN CHARINDEX('journal={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('journal={',RAWDATA)+9, ((CHARINDEX('}',RAWDATA,(CHARINDEX('journal={',RAWDATA) + 9)))- (CHARINDEX('journal={',RAWDATA)+9))) END as Journal1,
CASE WHEN CHARINDEX('volume={', rawdata) = 0 THEN '' ELSE SUBSTRING(RAWDATA, CHARINDEX('volume={',RAWDATA)+8, ((CHARINDEX('}',RAWDATA,(CHARINDEX('volume={',RAWDATA) + 8)))- (CHARINDEX('volume={',RAWDATA)+8))) END as volume
FROM BULKACT
Thanks for the help.
Oliver
June 29, 2009 at 4:02 pm
Hi,
Parsing of variable length columns text file could also be achieved using the Script Transform component in SSIS. Will involve some amount of VB.NET scripting.
Thanks,
Amol
Amol Naik
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply