December 7, 2010 at 1:32 pm
Hello all –
Our company purchased a web application where the data was hosted remotely. The vendor is under no obligation to provide us with our data when the contract runs out (12/31/2010). We had a contractor come in an ran a scraping tool that automated opening the website, navigating to the main screen, scraped data from there, then went to the child screen and scraped that data. The contractor used this program to load 6 tables. The first 3 represent tabs on the main screen. The next 3 are all the various sections on the child screens of each tab. I get to parse out the scraped data into a DB. The main screen tables can stay as they are. The problem is the “child” tables...
His one table has columns:
Counter – a generated counter that links the parent to the child.
Cat – the section headers from the child screen
Field – field names for each category
Data – the actual data
I select distinct Ltrim(Rtrim([Cat])) and it returns 7 “categories” that should represent table names. I fixed up a create table and some test data that has only 2 of the “categories” in it.
I need a push in the right direction as to how to parse out the data into each of the different child tables. Maybe it’s an “unpivot”? or a “for xml”? I just can't get my head around it and would sure appreciate some starter code with explanation. Also, what’s the “TEXTIMAGE_ON [PRIMARY]” in the create Table?
Thanks in advance,
Kevin
CREATE TABLE [dbo].[report_test](
[counter] [nchar](100) NOT NULL,
[cat] [nchar](100) NULL,
[field] [nchar](100) NULL,
[data] [ntext] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
insert into [dbo].[report_test]
values
(92,'Vehicle Information', 'License Plate/Tail Number', '2XXX51'),
(2, 'Vehicle Information', 'License Plate State', 'TX'),
(2, 'Vehicle Information', 'Color', 'Beige'),
(93, 'Incident Information', 'Incident Date', '04/19/05'),
(93, 'Incident Information', 'Incident Hour', '10:30PM'),
(2, 'Incident Information', 'Incident Date',null),
(2, 'Vehicle Information', 'VIN', '4TASM92N2XXXXXX32' ),
(3, 'Vehicle Information', 'License Plate/Tail Number', 'XXXS51'),
(93, 'Incident Information', 'Incident Date', '05/16/05'),
(5, 'Incident Information', 'Incident Date', null),
(93, 'Incident Information', 'Incident Hour', '10:20AM'),
(93, 'Incident Information', 'State', 'TX' ),
(3, 'Incident Information', 'Incident Date',null ),
(4, 'Incident Information', 'Incident Date', null),
(92, 'Incident Information', 'Incident Date', '03/08/05'),
(3, 'Vehicle Information', 'License Plate State', 'TX'),
(3, 'Vehicle Information', 'Color', 'Beige'),
(1, 'Incident Information', 'Incident Date', '05/12/05'),
(93, 'Incident Information', 'State', 'TX'),
(3, 'Vehicle Information', 'VIN', '4TASM9XXXXX502732' ),
(1, 'Vehicle Information', 'License Plate/Tail Number', '2VXX51'),
(93, 'Incident Information', 'Incident Date', '05/16/05'),
(5, 'Incident Information', 'Incident Date', null),
(93, 'Incident Information', 'Incident Hour', '10:20AM'),
(1, 'Vehicle Information', 'License Plate State', 'TX'),
(1, 'Vehicle Information', 'Color','Beige'),
(1, 'Vehicle Information', 'VIN', '4TASXXXX2XZ502732'),
(93, 'Vehicle Information', 'License Plate/Tail Number', 'RXXX05'),
(93, 'Vehicle Information', 'License Plate State', 'TEXAS'),
(93, 'Vehicle Information', 'Color', 'WHITE'),
(1, 'Incident Information', 'Incident Date',null),
(93, 'Incident Information', 'Incident Date', '05/16/05'),
(5, 'Incident Information', 'Incident Date', null),
(93, 'Incident Information', 'Incident Hour', '10:20AM'),
(93, 'Incident Information', 'Country', 'USA'),
(1, 'Incident Information', 'Incident Hour', 'UNKPM'),
(1, 'Incident Information', 'Has it happened?', 'yes')
December 7, 2010 at 2:08 pm
First, switch from ntext to nvarchar(max), if you even need that.
If it's US license plate and VIN data, I guarantee you can fit that into something much more efficient, like varchar(8) for the license, and varchar(20) for the VIN.
I can't quite tell from this how the data is related back to itself, but I'm assuming you can from the pages. However, from what I can see, it looks like you should be able to normalize it pretty easily by creating tables for the categories, adding columns based on the "Field" data, and then inserting the data.
Pivoting on the fields would work, but given the sample here, I'd probably do it this way:
SELECT
COALESCE(PlateNos.counter, PlateStates.counter) AS ID,
PlateNo,
PlateSt
FROM
(SELECT
counter,
data AS PlateNo
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'License Plate/Tail Number') AS PlateNos
FULL OUTER JOIN (SELECT
counter,
data AS PlateSt
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'License Plate State') AS PlateStates
ON PlateNos.counter = PlateStates.counter ;
Add another derived table for each of the values you want, and you'll have data in tabular format. You'll want to add the "counter" for each derived table to the Coalesce function, in order to be sure you get one.
Go ahead an insert that into a table, and you'll be on your way to a database.
Does that help?
I also would redefine the base table to look more like this:
CREATE TABLE [dbo].[report_test](
[counter] [varchar](100) NOT NULL,
[cat] [varchar](100) NULL,
[field] [varchar](100) NULL,
[data] [varchar] (100) NULL
) ON [PRIMARY] ;
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
December 7, 2010 at 2:35 pm
GSquared -
yes, I'd already set up target tables to use more effecient types/sizes:
CREATE TABLE [dbo].[Vehicle_Info](
[Counter] [int] not null
,[Year] [nvarchar](5) null
,[Make] [nvarchar](20) null
,[License_State] [nvarchar](15) null
,[VIN] [nvarchar](20) null
,[Comments] [nvarchar](max) null
,[License_Plate] [nvarchar](30) null
,[Type] [nvarchar](30) null
,[Color] [nvarchar](30) null
) ON [PRIMARY]
Part of the problem is that the above reflects all the fields that ARE used, but few of the records actually use them all - and some of the records that have the field identified have empty data in them. I belive the use of the nchar type actually makes them 100 spaces instead of null. All the different "categories" are like that.
Could you comment your code with some explainations so that I can get a handle on what it does and try to change it to meet my needs? I'm obviously pretty green on TSQL.
thanks again,
Kevin
December 8, 2010 at 6:39 am
SELECT
COALESCE(PlateNos.counter, PlateStates.counter) AS ID, -- Pulls the "counter" value from the derived tables to create an ID number for the row
PlateNo,
PlateSt
FROM
-- First "derived table", gets the counter and the Plate Number
(SELECT
counter,
data AS PlateNo
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'License Plate/Tail Number') AS PlateNos
-- Second "derived table" gets the License Plate State
-- Full Outer Join in case either derived table doesn't have a row for its "field"
FULL OUTER JOIN (SELECT
counter,
data AS PlateSt
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'License Plate State') AS PlateStates
ON PlateNos.counter = PlateStates.counter ;
I'm just taking each value from "field", and turning that into a column, along with the "counter" in each case, by querying just the rows relevant to the value in "field".
So, the first sub-query (derived table=sub-query) gets all the license plate numbers, and the "counter". The second sub-query gets all the license plate states, and the "counter" value. A third would get the make, model, year, or whatever else you have "fields" for.
Then, using the "counter" value in each sub-query, I join them all together. I use Full Outer Joins, in case one "counter" doesn't have a license number (for example).
Then I use Coalesce to get the "counter" value, which will fill it in no matter which sub-queries have a value for that.
It's a brute-force pivot, but it gives you great control over your end results.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
December 8, 2010 at 3:24 pm
GSquared -
thanks very much for the code and explaination. I'm out sick today so was kindo slow responding, but I wanted you to know that I really appreciate your effort. I'll be back in the oriface tomorrow and will get a chance to play with it.
thanks,
Kevin
December 9, 2010 at 11:32 am
GSquared -
I've been playing around with your code and seem to be running into a problem. Here's the fleshed out version for the test data that I posted
SELECT
COALESCE(PlateNos.counter, PlateStates.counter, color.counter, vin.counter) AS ID, -- Pulls the "counter" value from the derived tables to create an ID number for the row
PlateNo,
PlateSt,
Color,
Vin
FROM
-- First "derived table", gets the counter and the Plate Number
(SELECT
counter,
data AS PlateNo
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'License Plate/Tail Number') AS PlateNos
-- Second "derived table" gets the License Plate State
-- Full Outer Join in case either derived table doesn't have a row for its "field"
FULL OUTER JOIN (SELECT
counter,
data AS PlateSt
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'License Plate State') AS PlateStates
ON PlateNos.counter = PlateStates.counter
-- Full Outer Join in case either derived table doesn't have a row for its "field"
FULL OUTER JOIN (SELECT
counter,
data AS Color
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'Color') AS Color
ON PlateNos.counter = Color.counter
FULL OUTER JOIN (SELECT
counter,
data AS Vin
FROM
dbo.report_test
WHERE
cat = 'Vehicle Information'
AND field = 'Vin') AS Vin
ON Color.counter = Vin.counter ;
Even on this small dataset I end up with 2 rows with ID = 2.
I fleshed it out for the full table and am getting 13,251 rows but when I do a select distinct counter from that table, there's only 6,918.
Thanks to your explaination I do understand what the query is supposed to do.... but I'm lost on figuring out why it appears to work 'differently' than intended. Messing around with the join criteria seems to produce different output, but how do I determine which joins need which criteria?
thanks again,
Kevin
December 13, 2010 at 7:22 am
Select everything with ID = 2 all by itself. Does it have more than one entry for the same "field" anywhere?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
December 13, 2010 at 7:44 am
g'morning -
there doesn't seem to be any duplication for counter/field pairs. I believe the problem is that there is no field that is guarrenteed to have data in it so the joins get all hinky. I played around with the join criteria and was able to get several different result sets but all of them would have multiple rows on some of the counters.
You've given me a good lesson in how to look at this problem differently than I was, and I appreciate that. Anyway, I just fell back to VB6 and got the job done.
Thanks,
Kevin
December 14, 2010 at 6:54 am
Nested outer joins can do weird things like that. Glad you got something working.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply