Extract Data and XML nodes from XML datatype

  • hey.

    i've been dabbling with xml for the past few days and from the vast searches i have made i almost have a solution to my issue, but not quite. I hope someone can help me go that extra mile.

    As a matter of course, I need to import MS Word documents to the SQL server on a daily basis. For reasons i cant specify we need to do this on an XML basis which Word can provide. I save the document data to an XML field in SQL. The data looks like this:

    <?mso-application progid="Word.Document"?>

    <w:wordDocument xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wsp="http://schemas.microsoft.com/office/word/2003/wordml/sp2" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core" w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no" xml:space="preserve">

    <w:body>

    <wx:sect>

    <w:p wsp:rsidR="00DB06F9" wsp:rsidRPr="00E27331" wsp:rsidRDefault="00DB06F9" wsp:rsidP="00DB06F9">

    <w:pPr>

    <w:pStyle w:val="a3" />

    <w:jc w:val="center" />

    <w:rPr>

    <w:rFonts w:ascii="Arial" w:h-ansi="Arial" />

    <wx:font wx:val="Arial" />

    <w:b />

    <w:b-cs />

    <w:sz w:val="28" />

    <w:sz-cs w:val="28" />

    <w:u w:val="single" />

    <w:rtl />

    </w:rPr>

    </w:pPr>

    <w:r wsp:rsidRPr="00E27331">

    <w:rPr>

    <w:rFonts w:ascii="Arial" w:h-ansi="Arial" />

    <wx:font wx:val="Arial" />

    <w:b />

    <w:b-cs />

    <w:sz w:val="28" />

    <w:sz-cs w:val="28" />

    <w:u w:val="single" />

    <w:rtl />

    </w:rPr>

    <w:t>MY DATA</w:t>

    </w:r>

    </w:p>

    as you can see, the Word XML is quite complex. I have identified that the information i need is in sections that end in w:p/w:r/w:t

    Here is where it gets tricky. Since a Word document can change and is not a constant, there is a difference between documents as to how the data is sectioned. sometimes it's with tables:

    w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

    and sometimes it has sections and sub sections

    w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:sectPr/w:ftr/w:p/w:r/w:t

    w:wordDocument/w:body/wx:sect/wx:pBdrGroup/w:p/w:r/w:t

    So I have 2 solutions that get me close, but not quite there.

    1: A code snippet i got in my searcehs for a routine to list all elements in the document. The problem with this one is that it doens't give me the namespaces and if i dont have those, i cant proceed.

    DECLARE @idoc INT,

    @xml XML SET @xml = (SELECT mytest from TestTable Where ID = 2)

    EXEC sp_xml_preparedocument @idoc OUTPUT, @xml;

    WITH E AS (SELECT * FROM OPENXML(@idoc,'/',3)), P AS ( -- anchor member

    SELECT id, parentid, localname AS [Path] FROM E WHERE parentid IS NULL

    UNION ALL -- recursive member

    SELECT E.id, E.parentid, P.[Path] + '/' + localname AS [Path]

    FROM P INNER JOIN E ON E.parentid = P.id ) SELECT [Path] FROM P

    where Path like '%p/r/t'

    group by path

    EXEC sp_xml_removedocument @idoc

    Results:

    wordDocument/body/sect/p/pPr/sectPr/ftr/p/r/t

    wordDocument/body/sect/p/r/t

    wordDocument/body/sect/pBdrGroup/p/r/t

    wordDocument/body/sect/sectPr/ftr/p/r/t

    2. another code snippet i found somewhere:

    declare@xml xml

    select@xml =

    '

    <person>

    <first_name>Sam</first_name>

    <last_name>Watson</last_name>

    <age>43</age>

    <gender>Male</gender>

    </person>

    '

    -------------------------------------------

    ----------------------------------------------------------------

    SELECT x.query('

    for $node in /descendant::node()[local-name() != ""]

    return <node>

    <namespace>{ namespace-uri($node) }</namespace>

    <localname>{ local-name($node) }</localname>

    <value>{ $node }</value>

    <parent>{ local-name($node/..) }</parent>

    </node>') AS nodes

    FROM @xml.nodes('/person') x(x)

    SELECT DISTINCT

    T.n.value('namespace[1]', 'varchar(100)') AS Namespace,

    T.n.value('localname[1]', 'varchar(100)') AS Localname,

    T.n.value('value[1]', 'VARCHAR(100)') AS val,

    T.n.value('parent[1]', 'VARCHAR(100)') AS parent

    FROM ( SELECT

    x.query('

    for $node in /descendant::node()[local-name() != ""]

    return <node>

    <namespace>{ namespace-uri($node) }</namespace>

    <localname>{ local-name($node) }</localname>

    <value>{ $node }</value>

    <parent>{ local-name($node/..) }</parent>

    </node>') AS nodes

    FROM @xml.nodes('/person') x(x)

    ) q1

    CROSS APPLY q1.nodes.nodes('/node') AS T ( n )

    this works unless i apply

    SET @xml = (SELECT mytest from TestTable Where ID = 2)

    in which case it returns nothing. Again, i believe it has a lot to do with the namespaces. Yes, i also see the reoccuring theme here.

    I need that last stretch in order to get my entire software to work. Any help here would be great.

    Thanx a lot.

    Adi

    p.s. I'm very new to XML and what i had managed to gather is through a lot of searching. I dont really know how most of this code works so i would appreciate it if in the solution you post there will be some kind of code element. Much obliged.

  • It may not be entirely doable with TSQL. I would probably opt for manipulating a word document via CLR in C# via the system.xml namespace. It gives you much finer granularity in dealing with XML documents, especially when it comes to namespaces.

    The probability of survival is inversely proportional to the angle of arrival.

  • unfortunatly that's not an option for me. I'm still looking for a solution in SQL...

    thx

  • Ok, found a kind of solution that helps me.

    as part of solution 1 per the above, I have found a brilliant site with all the XML knowledge i needed

    http://beyondrelational.com/blogs/jacob/archive/2008/06/26/xquery-labs-a-collection-of-xquery-sample-scripts.aspx

    and the solution was to add the TEXT column herewith to the output i already had of the sections i needed and presto i have all the information required.

    SET @xml = (SELECT mytest from TestTable Where ID = 2)

    EXEC sp_xml_preparedocument @idoc OUTPUT, @xml

    ;WITH cte AS (

    -- shreds XML to rowset

    SELECT * FROM OPENXML(@idoc,'/',3)

    ), rcte AS (

    -- anchor part of recursive query

    SELECT 0 AS Level, id, parentid, nodetype, localname, prev,

    CAST(text AS NVARCHAR(max)) AS Text,

    CAST(localname AS NVARCHAR(100)) AS path,

    CAST(id AS VARBINARY(128)) AS Sort

    FROM cte WHERE id = 0

    -- recursive part

    UNION ALL

    SELECT p.level + 1, c.id, c.parentid, c.nodetype, c.localname, c.prev,

    CAST(c.text AS NVARCHAR(max)),

    CAST(p.path + CASE WHEN c.nodetype = 3 THEN '' ELSE '\' END +

    CASE WHEN c.nodetype = 2 THEN '@' ELSE '' END +

    CASE WHEN c.nodetype = 3 THEN '' ELSE c.localname END AS NVARCHAR(100)),

    CAST(p.Sort + CAST(c.id AS VARBINARY(4)) AS VARBINARY(128))

    FROM cte c

    INNER JOIN rcte p ON p.id = c.parentid)

    -- the final query

    insert into #XmlData (Path,text)

    SELECT path, text FROM rcte

    WHERE text IS NOT NULL

    and path like 'worddocument\body\%\t'

    ORDER BY sort

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply