Parsing a document - Is T-SQL a good tool for this ?

  • I have a SQL parsing routine for extracting information from text documents. It works ok for 100k records, about 20 minutes, but when i do 200k or more records, it goes a LOT slower.

    Thanks

  • No... T-SQL isn't a good parsing tool for such a thing. What are you trying to parse out of the Text documents and what's the average length of these documents?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This may be a perfect occasion to get started with Powershell or SSIS to parse your documents.

    SSC has some nice articles to get you started with both.

    For powershell this one may just do it: http://www.sqlservercentral.com/articles/powershell/72051/

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Name, Address, City, State, Zip, Phone Number

    I do have an SSIS procedure that works very well at extracting emails, but this is more complicated!

    eg. I need to search for 'Name' string, but get the data AFTER the 'Name' string.

    I'm just not sure how to do this, probably a combination of C# and regex in a script, which is how the SSIS procedure works.

  • ALZDBA (10/9/2011)


    This may be a perfect occasion to get started with Powershell or SSIS to parse your documents.

    SSC has some nice articles to get you started with both.

    For powershell this one may just do it: http://www.sqlservercentral.com/articles/powershell/72051/

    I'd love to see a Powershell script do the kind of parsing I think the OP is doing. Do you have a link or code example you can point to for such parsing, Johan?

    IMHO, SSIS sucks as bad at what I think of as the "parsing of text" documents as does T-SQL. In fact, I'm not sure such a capability exists in SSIS. I believe a script would need to be called. I could be wrong and look forward to any corrections you might want to make on my statements.

    The only thing that I know of that works reasonable well in SQL Server for the "parsing of text documents" is FULL TEXT INDEXING. Even that pales in comparison to some of the commercially made FULL TEXT search engines such as DtSearch and the like.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/9/2011)


    ALZDBA (10/9/2011)


    This may be a perfect occasion to get started with Powershell or SSIS to parse your documents.

    SSC has some nice articles to get you started with both.

    For powershell this one may just do it: http://www.sqlservercentral.com/articles/powershell/72051/

    I'd love to see a Powershell script do the kind of parsing I think the OP is doing. Do you have a link or code example you can point to for such parsing, Johan?

    IMHO, SSIS sucks as bad at what I think of as the "parsing of text" documents as does T-SQL. In fact, I'm not sure such a capability exists in SSIS. I believe a script would need to be called. I could be wrong and look forward to any corrections you might want to make on my statements.

    The only thing that I know of that works reasonable well in SQL Server for the "parsing of text documents" is FULL TEXT INDEXING. Even that pales in comparison to some of the commercially made FULL TEXT search engines such as DtSearch and the like.

    I was thinking about using split

    get-help about_split

    or a combination with

    Get-Help Select-String

    edited:

    Here's a nice ref: http://www.leeholmes.com/blog/2005/11/14/parse-textobject-%e2%80%93-awk-with-a-vengeance/

    http://blogs.msdn.com/b/powershell/archive/2008/03/23/select-string-and-grep.aspx

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Very interesting. Thanks, Johan. Have you tested it for performance at all? And, yes... I'll check out the link you provided.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply