July 19, 2010 at 12:49 pm
I'm new to this XML stuff, so please excuse my lack of knowledge.
Below is part of the daily file provided by the US Government Printing Office. This is a large file and I'm only interested in a few small parts of it. So what I'm trying to do is pull out the relevant parts, and load it into a tables that I can then report on.
I have worked out how to find the relevant sections, but I can figure out how to iterate thru the entries within the section.
Attached is an example of the output I'm hopeing to achieve.
- <GPOTABLE CDEF="s25,r20,r20,r45,r75,r75" COLS="6" OPTS="L2,i1">
- <TTITLE>
<E T="04">I.—44 Premanufacture Notices Received From: 6/15/09 to 7/3/09</E>
</TTITLE>
- <BOXHD>
<CHED H="1">Case No.</CHED>
<CHED H="1">Received Date</CHED>
<CHED H="1">Projected Notice End Date</CHED>
<CHED H="1">Manufacturer/Importer</CHED>
<CHED H="1">Use</CHED>
<CHED H="1">Chemical</CHED>
</BOXHD>
- <ROW>
<ENT I="01" O="xl">P-09-0424</ENT>
<ENT O="xl">06/12/09</ENT>
<ENT O="xl">09/09/09</ENT>
<ENT O="xl">CBI</ENT>
<ENT O="xl">(G) Thermoset adhesive additive</ENT>
<ENT O="xl">(S) Siloxanes and silicones, di-me, bu group- and 3-[2-(3-carboxymethylene-1-oxopropoxy)ethoxy]propyl group-terminated</ENT>
</ROW>
- <ROW>
<ENT I="01" O="xl">P-09-0425</ENT>
<ENT O="xl">06/15/09</ENT>
<ENT O="xl">09/12/09</ENT>
<ENT O="xl">Grafil, Inc.</ENT>
<ENT O="xl">(S) Precursor for carbonization/production of carbon fiber</ENT>
<ENT O="xl">(S) 2-propenoic acid, 2-methyl-, polymer with 2-propenamide and 2-propenenitrile</ENT>
</ROW>
Any pointers would be much appreciated.
Tim
July 19, 2010 at 2:39 pm
Would the following help you to get closer to your expected result?
SELECT
row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS row,
col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,
col.value('text()[1]','VARCHAR(10)') AS val
FROM @xml.nodes('GPOTABLE/ROW') AS x1(row)
CROSS APPLY row.nodes('ENT') AS x2(col)
I'm not sure how it will perform against a large xml file... If the code performance cannot be accepted we might need an intermediate table to assign at least the row number per ROW element.
Based on the result you could either use PIVOT or the CrossTab method to get the data in your expected result.
July 20, 2010 at 7:11 am
Lutz,
I won't pretend I understand how it works, but it does just what I asked for.
But can I now be a real nuisance and add an additional requirement, to include the "TITLE" value as part of the returned value. This way I can determine from which section of the original source, the details are referring to.
Many thanks for you help.
Tim
July 20, 2010 at 1:34 pm
You would need to add the reference to the TTITLE section:
SELECT
title.value('E[1]','VARCHAR(100)') AS title,
row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS ROW,
col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,
col.value('text()[1]','VARCHAR(10)') AS val
FROM @xml.nodes('GPOTABLE/ROW') AS x1(ROW)
CROSS APPLY row.nodes('ENT') AS x2(col)
CROSS APPLY @xml.nodes('GPOTABLE/TTITLE') AS x3(title)
If you're new to XQuery I'd recommend Jacob Sebastians blog. It's a great resource for XML related questions that I refer to frequently.
July 21, 2010 at 10:37 am
Lutz,
Thank you once again for the great help, it worked beautifully.
However now reviewing the source data, I have found the a very few records are structured slightly differently, with an extra node (PRTPAGE) (see record 2 below) before the start of the ENT nodes, when this happens it throws the subsequent columns out for that specific record. Is there any way of telling the code to skip the PRTPAGE node, if it's present?
- <ROW>
<ENT I="01" O="xl">P-09-0396</ENT>
<ENT O="xl">06/01/09</ENT>
<ENT O="xl">08/29/09</ENT>
<ENT O="xl">CBI</ENT>
<ENT O="xl">(G) Papermaking process aid</ENT>
<ENT O="xl">(G) Polyamide epichlorohydrin resin salt</ENT>
</ROW>
- <ROW>
<PRTPAGE P="34570" />
<ENT I="01" O="xl">P-09-0397</ENT>
<ENT O="xl">06/02/09</ENT>
<ENT O="xl">08/30/09</ENT>
<ENT O="xl">CBI</ENT>
<ENT O="xl">(G) Site limited intermediate</ENT>
<ENT O="xl">(G) Aromatic diamine</ENT>
</ROW>
- <ROW>
<ENT I="01" O="xl">P-09-0398</ENT>
<ENT O="xl">06/02/09</ENT>
<ENT O="xl">08/30/09</ENT>
<ENT O="xl">CBI</ENT>
<ENT O="xl">(S) Anti-scaling agent for water treatment; co-builder for detergents; drilling mud additive; dispersant for solid materials in water based coatings</ENT>
<ENT O="xl">(G) Polyitaconic acid</ENT>
</ROW>
I had not anticipated needing any help with the Pivot part of the problem, but at present don't seem to be able to get this to work either. So any pointers anybody can give on this would also be appreciated.
I had already come across Jacob Sabastains blog and had found them helpful.
Many thanks
Tim
July 21, 2010 at 12:00 pm
I would wrap the XML shredding into a CTE or subquery and use ROW_NUMBER to get it sorted as required:
;WITH cte AS
(
SELECT
title.value('E[1]','VARCHAR(100)') AS title,
row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS ROW,
col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,
col.value('text()[1]','VARCHAR(10)') AS val
FROM @xml.nodes('GPOTABLE/ROW') AS x1(ROW)
CROSS APPLY row.nodes('ENT') AS x2(col)
CROSS APPLY @xml.nodes('GPOTABLE/TTITLE') AS x3(title)
)
SELECT
title,
ROW,
ROW_NUMBER() OVER(PARTITION BY ROW ORDER BY col ) AS col,
val
FROM cte
Regarding pivoting the data I'd recommend you have a look at the CrossTab article referenced in my signature.
July 22, 2010 at 8:31 am
Lutz,
Making some good progress now with your great help.
But I have found 2 issues with the current code.
1) In the original specification of the problem I omited to explain that the source data can have a variable number of sections that I'm interested in. (In the test data I'm using there are 6 relevant sections).
In the code for the "row"
row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2)
this resets to 1 at the start of each new section. The existing "row" counter is useful for identifying the number of rows per section, but it's stopping the Pivot working. So is it possible to add another row counter, that does not reset on change of section
2) This is to do with the way that the "TTILE" node is added, currently this adds all 6 section TTITLES to each row (taking the total row count from 620 to 3720 (ie: 620 * 6 = 3720). If I remove the TTILE section (and manually fix the "row" issue above), then the data is correct and the Pivot works just fine. I did try to add the TTILE after selecting the rest of the data, by using an Update query, but I can't see any way of selecting the correct TTILE value based on the value of the ROW/ENT.
I'm sorry to be such a nuisance on this, but any further help you are able to give, would be very much appreciated.
Tim
July 22, 2010 at 12:10 pm
I think the samples provided should help you to resolve your issues.
To add another counter, look what ROW_NUMBER actually does and try to modify it to your needs.
Regarding the PIVOT: based on the code and sample data you provided I don't see any reason why the last code I provided won't work.
Please try to understand and modify the code. It's a much better learning experience than simply copy and paste. If you get stuck, post your code, provide related sample data and your expected result.
July 26, 2010 at 4:55 am
Lutz,
This is the code I currently have and it mostly works.
-- Use a real table so that I can check what is happening
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].') AND type in (N'U'))
DROP TABLE [dbo].
CREATE TABLE b
(
IID INT IDENTITY(1,1),
b_date varchar(25),
b_title varchar(100),
b_row INT,
b_row2 INT,
b_col INT,
b_val varchar(200)
)
-- Take out the "Titles" for the present as they screw thing up
INSERT INTO b(b_date, /*b_title,*/ b_row, b_col, b_val)
SELECT
(SELECT a.b.value('DATE[1]','varchar(25)') AS s_Date
FROM @xml.nodes('FEDREG') AS a(b)) AS R_Date,
--title.value('E[1]','VARCHAR(100)') AS title,
row.value('1+count(for $a in . return $a/../*[. << $a])','INT')-2 AS row,
col.value('1+count(for $a in . return $a/../*[. << $a])','INT') AS col,
col.value('text()[1]','VARCHAR(200)') AS val
FROM @xml.nodes('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/ROW') AS x1(row)
CROSS APPLY row.nodes('ENT') AS x2(col)
--CROSS APPLY @xml.nodes('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/TTITLE') AS x3(title)
DECLARE @row_int INT, @row_b INT, @col_b INT
SET @row_int = 1
-- Use a Cursor to allocate the Row numbers over the whole result set.
-- Messy, but there are only a few hundred rows in the set, so it runs fast
DECLARE curs CURSOR FOR
SELECT b_row, b_col FROM dbo.b;
OPEN curs
FETCH NEXT FROM curs INTO @row_b, @col_b
WHILE @@FETCH_STATUS = 0
BEGIN
IF @row_b <> @row_int AND @col_b = 1
BEGIN
SET @row_int = @row_int + 1
UPDATE b SET b_row2 = @row_int WHERE CURRENT OF curs
END
ELSE
BEGIN
UPDATE b SET b_row2 = @row_int WHERE CURRENT OF curs
END
FETCH NEXT FROM curs INTO @row_b, @col_b
END
CLOSE curs
DEALLOCATE curs
-- Pivot
SELECT b_row2,
[1] as [Case No.],
[2] as [Received],
[3] as [End],
[4] as [Manufacture],
[5] as [Use],
[6] as [Chemical]
FROM (SELECT b_row2, b_col, b_val FROM b) AS src
PIVOT (MIN(b_val) FOR b_col IN ([1],[2],[3],[4],[5],[6])) AS pvt
ORDER BY b_row2
The 2 areas that are still giving difficulty are -
1) The extra (PRTPAGE) nodes referred to in my post of the 21st, where I would like to be able to skip those extra nodes (where they are present)
2) The TTITLE section, referred to in my post of the 22nd. I tried this code to pick up the correct TTITLE, but not surprisingly it returns all TTILE's from the source data, rather than just the one I'm interested in.
SELECT Title.value('E[1]','varchar(100)') AS title
FROM @xml.nodes('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/TTITLE') AS x3(title)
WHERE @xml.exist('FEDREG/NOTICES/NOTICE/SUPLINF/GPOTABLE/TTITLE/E[./text()="II. 63 Notices of Commencement From: 09/7/09 to 10/23/09"]') = 1
You can get a copy of the test source XML file that I'm using here http://www.gpo.gov/fdsys/bulkdata/FR/2009/11/FR-2009-11-12.xml
Any help or guidance you can give in solving these last 2 issues, would be very much appreciated.
Many thanks
Tim
July 26, 2010 at 5:44 am
Please post your expected result set based with reference to your sample data (either the first sample, the second or the data from the link you mentioned.
Regarding the c.u.r.s.o.r. *cough*: what's wrong with the ROW_NUMBER approach I posted on 7-21 ?? (I wouldn't have to ask if I had a result set (based on provided sample data) to compare to... 😉 )
July 26, 2010 at 7:09 am
Yes I know all about the very widely held views on cursors, but personally I'm not to worried about this aspect, as at least I have something that works! Your suggested solution, resets back to one on each change of (TTITLE) group.
I have attached a output.bmp that shows a fake up of the type of output I would like to be able to achieve. This is based on the data in the url mentioned earleir today.
The attachment output2.bmp show part of the pdf that is currently published by the US government. This runs to hundreds of pages, but we are only interested in a few, hence the desire to automatically extract only the parts we are interested in.
If you want a full fake up of the data based on the URL, just let me know.
Many thanks
Tim
July 26, 2010 at 7:36 am
Sorry, should have also said, that we are only interested in data from the source file, where the TTITLE contains the following text -
Premanufacture Notices Received From
Notices of Commencement From
Test Marketing Exemption Notices Received From
Tim
July 26, 2010 at 12:06 pm
I had a look at the bmp file you posted. That's not what I expected.
Please provide table structure (CREATE TABLE) and some INSERT INTO ... SELECT ... UNION ALL statements that will actually show your populated target table based on a source to either be provided or referenced.
Regarding resetting to one with the ROW_NUMBER approach: yes, that's intentional, since I assumed that's what you've been looking for. Instead of using some sort of a loop I strongly recommend you look up ROW_NUMBER in BOL (BooksOnLine, the SQL help system usually installed together with SQL Server) and make the changes you need.
July 27, 2010 at 2:14 am
I hope this is what you expected.
The table structure for storing the pivoted data will look like this -
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[c]') AND type in (N'U'))
DROP TABLE [dbo].[c]
GO
CREATE TABLE [dbo].[c](
[IID] [int] IDENTITY(1,1) NOT NULL,
[c_date] [varchar](25) NULL,
[Title] [varchar](100) NULL,
[CaseNo] [varchar](200) NULL,
[Received] [varchar](200) NULL,
[EndDate] [varchar](200) NULL,
[Manufacture] [varchar](200) NULL,
[Usage] [varchar](200) NULL,
[Chemical] [varchar](200) NULL
) ON [PRIMARY]
GO
And the insert would look something like this (but would need to include the Title part) -
INSERT INTO c(c_date, /*Title,*/ CaseNo, Received, EndDate, Manufacture, Usage, Chemical)
SELECT b_row2,
[1] as [CaseNo],
[2] as [Received],
[3] as [EndDate],
[4] as [Manufacture],
[5] as [Usage],
[6] as [Chemical]
FROM (SELECT b_row2, b_col, b_val FROM b) AS src
PIVOT (MIN(b_val) FOR b_col IN ([1],[2],[3],[4],[5],[6])) AS pvt
ORDER BY b_row2
I'm sorry for the confusion on the Row_number issue, and your assumption was very reasonable. But I don't see how ROW_NUMBER is going to help to solve the problem - BOL says "Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition", but I need a number that is repeated n times for each group. In theory you could use ROW_NUMBER together with PARTITION, but I don't see anything to Partition against in the current data.
Tim
September 13, 2010 at 6:29 am
Couldn't you use XSLT to transform the original large XML file to one containing just the XML elements you want and then load this into SQL Server?
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply