Import data from html document Into SQL Table

  • Hello,

    I have someone who is sending me .htm documents, with a table in them, and I was wondering if there is a way to import the data from those tables into a SQL table, probably using an SSIS Package.

    Thanks,

    Sam

  • Quick thought, check out this excellent article How to Import Data from HTML pages[/url] by Phil Factor

    😎

  • 1. Phil's article is excellent; I never read it until today after I saw that Erikur posted it.

    2. Second, can you post and example of the HTML files that you are getting?

    3. There is an SSIS XSLT task that you could leverage but that is a little more complicated than we need to get...

    Lastly, I have to do this from time-to-time in SSIS and the solution varies depending on the condition of the HTML file you are getting. Keeping in mind that, beginning with HTML 4.01, HTML files were required to be well-formed. Though HTML existed before XML, you can now think of HTML files (4.01+) as XML files.

    This HTML code:

    <html>

    <header><title>Test</title></header>

    <body>

    <h1>blah blah</h1>

    <table border="1" cellspacing="0" style="border: black 1px solid">

    <tr><th>Date</th><th>Sales</th></tr>

    <tr><td>1/1/2003</td><td>12.09</td></tr>

    <tr><td>1/3/2003</td><td>85.09</td></tr>

    </table>

    <p>blah blah</p>

    </body>

    </html>

    Will produce a page that looks like this:

    Note the table, that's where the information is that we are looking for...

    Now we'll cast it as XML and query it like this:

    -- sample html data

    DECLARE @html xml =

    '

    <html>

    <header><title>Test</title></header>

    <body>

    <h1>blah blah</h1>

    <table border="1" cellspacing="0" style="border: black 1px solid">

    <tr><th>Date</th><th>Sales</th></tr>

    <tr><td>1/1/2003</td><td>12.09</td></tr>

    <tr><td>1/3/2003</td><td>85.09</td></tr>

    </table>

    <p>blah blah</p>

    </body>

    </html>

    '

    -- code to get extract data from the HTML

    SELECTsale_date = xx.value('(./td/text())[1]','varchar(100)'),

    sale_amt = xx.value('(./td/text())[2]','varchar(100)')

    FROM (values(@html)) t1(x)

    CROSS APPLY x.nodes('//table[1]/tr[position()>1]') t2(xx);

    Which produces this:

    sale_datesale_amt

    ------------------------

    1/1/200312.09

    1/3/200385.09

    You can use this as a template for understanding how to extract data from an HTML table.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Nice job Alan!

    😎

  • Eirikur Eiriksson (10/9/2014)


    Nice job Alan!

    😎

    Thanks!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply