May 29, 2012 at 10:42 am
Morning all-
I have a process that compares two tables (same structure) and finds the differences between them, then I insert those records into a temp table and then use a subset of those columns and compare those columns to the original data to see if those columns are different or the same telling me if the data was updated or deleted. Example:
Original data:
Column1 Column2 Column3
1 A B
2 C D
3 E F
4 G H
New Data:
Column1 Column2 Column3
1 A B
2 C I
3 J F
So when I run my except on the two tables it returns:
Column1 Column2 Column3
2 C D
3 E F
4 G H
I've identified the columns as either "Key" or "Variable" in a mapping table. Key values can not change or we consider the original row as deleted, where if the variable values have changed, then we see the row as updated. Example:
Column1 Column2 Column3
K K V
So after I populate my temp table I then use the key columns and compare those to the original data to see if they match. If they match the new data then I know that the row is an update because only the variable columns have changed to I return the data from both the original and new data sets with "FOR XML" and insert those xml stream into a table like this:
Customer Date OldValues NewValues Action
1 5/28/12 <2, C, D> <2, C, I> Update
1 5/28/12 <3, E, F> null Delete
1 5/28/12 <4, G, H> null Delete
Since there's no row in the new table that has key columns that match we determine that the data is deleted and assign the action accordingly.
The process actually works well but unfortunately since I am returning the result set as FOR XML I need to do this in a cursor and can't find a way to return this as a record set. This is causing the process to take several hours to complete instead of a minute or two that it should.
I'm looking for suggestions on either changing the way I'm storing the data in the table, or some way to return the information as a record set instead of having to go through a cursor. I'm not married to using XML to store the values, but we're going to be using the data later to allow users to search through the data and then display the values to them. It seemed to me that storing the data as XML would make it easier to display back on a webpage to the user.
Thanks for any help you can provide.
May 29, 2012 at 10:50 am
Are you familiar with Inline Subqueries?
That will do what you want, in a single set-based query, without a cursor.
I'd need to see the structure of the actual table before I could provide an actual query, but they look like this:
SELECT ID,
(SELECT MyColumn1, MyColumn2
FROM dbo.MyLogTable
WHERE SourceID = MyTable.ID
FOR XML RAW, TYPE) as BeforeValues
FROM dbo.MyTable ;
Does that help?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 29, 2012 at 1:26 pm
That completely makes sense. I tried doing something similar earlier but I think I instead put both join tables within the subquery and so it came back with all the rows. I did some preliminary testing and it looks like it's going to work. I'll update when I know for sure.
Thanks!
May 29, 2012 at 3:13 pm
Ok, so that works, BUT, it isn't any faster than the other option using the cursor. I think the problem might actually be retrieving the data itself, not the speed of the cursor. We build the tables dynamically based on the text files we receive from our customers. Because we don't know the size of the data they're sending us we default our table definitions for every column to nvarchar(max). I think this might be our actual problem since we can't index any of these columns because of the nvarchar(max) data typing. We've discussed creating an identity column on the tables which would allow us to create an index. There is the option of changing the data types to nvarchar(4000) but I'm afraid we might start truncating data that way is the problem.
In look at the execution plan when I run the query it does a table scan on the source tables (each over 1 million rows) and then filters the results. Does a couple of nested loops and inserts. Here's the query that we build dynamically:
INSERT INTO Deltadawn.dbo.DeltaHistory SELECT '66670','ATT','May 29 2012 1:19PM',convert(nvarchar(max), (SELECT * FROM D66670.dbo.ATT a WHERE ISNULL(A.SN,'|||||') = ISNULL(B.SN,'|||||') AND ISNULL(A.SC,'|||||') = ISNULL(B.SC,'|||||') AND ISNULL(A.DT,'|||||') = ISNULL(B.DT,'|||||') FOR XML PATH('ATT'), ELEMENTS XSINIL, TYPE)) AS OldValues,CONVERT(NVARCHAR(max), (SELECT * FROM DeltaScratch.dbo.ATT a WHERE ISNULL(A.SN,'|||||') = ISNULL(B.SN,'|||||') AND ISNULL(A.SC,'|||||') = ISNULL(B.SC,'|||||') AND ISNULL(A.DT,'|||||') = ISNULL(B.DT,'|||||') FOR XML PATH('ATT'), ELEMENTS XSINIL, TYPE)) AS NewValues,'DELETE' FROM COMPARETABLE B;
May 30, 2012 at 6:36 am
Those IsNulls in the Where clause and Join predicates are also going to also kill performance, unless I'm mistaken. Regardless of indexing.
I would definitely move towards a more structured dataset with better sizing & typing on the columns. That will also help performance.
You'll need to do both.
What I've found useful before is go ahead and make a staging environment that has nvarchar(max) columns for initial import. That can be good for speed on bulk importing. Then go from there to a more tightly designed table, with error trapping for rows that would truncate or have a type mismatch. Might take a little bit to set up, but if this kind of import is something that will be an ongoing task, it will be worth the effort in the long run.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 30, 2012 at 9:28 am
Well I certainly don't argue with you there, but unfortunately there are about 10 different types of systems that we get data from with all different data types, and nulls being a valid value for any of the columns. We stronger type the columns once we import them into our system but the customers can customize their systems in any way so we have to rebuild their data every time we import a file from them.
Thanks for the help. If you or anyone can think of anything else I would appreciate the assistance.
June 3, 2012 at 3:21 am
GSquared (5/30/2012)
Those IsNulls in the Where clause and Join predicates are also going to also kill performance, unless I'm mistaken. Regardless of indexing....
Indeed, the isnulls should be removed to get more speed. Not using isnull() doesn't mean you can not deal with null values in the output. I'll show you below an alternative that does work better.
Also, if you're going to return the data as nvarchar(max) your generated queries do not need to generate xml-typed data that then is converted into varchar. For xml path() returns nvarchar if you leave the ", type" off, so you don't need the convert at all.
So you better generate something like this:
INSERT INTO Deltadawn.dbo.DeltaHistory
SELECT '66670','ATT',{ts '2012-05-29 13:19:00'},
(SELECT *
FROM D66670.dbo.ATT a
WHERE ((A.SN IS NULL AND B.SN IS NULL) OR A.SN = B.SN)
AND ((A.SC IS NULL AND B.SN IS NULL) OR A.SC = B.SC)
AND ((A.DT IS NULL AND B.DT IS NULL) OR A.DT = B.DT)
FOR XML PATH('ATT'), ELEMENTS XSINIL
) AS OldValues,
(SELECT *
FROM DeltaScratch.dbo.ATT a
WHERE ((A.SN IS NULL AND B.SN IS NULL) OR A.SN = B.SN)
AND ((A.SC IS NULL AND B.SN IS NULL) OR A.SC = B.SC)
AND ((A.DT IS NULL AND B.DT IS NULL) OR A.DT = B.DT)
FOR XML PATH('ATT'), ELEMENTS XSINIL
) AS NewValues,
'DELETE'
FROM COMPARETABLE B;
June 3, 2012 at 3:41 am
You can also give this a try to see what it does for your performance. It would do a lot better if at least some of the columns used in the matching where non-nullable, but I expect this to perform better in tis situation too.
This query will however not return exactly the same results as your original one: this query will return null values for all columns in the both columns OldValues or NewValues when no matching row exists in D66670.dbo.ATT or DeltaScratch.dbo.ATT respectively. This can be easily fixed by adding where clauses in the both subqueries as shown in the commented out lines, but you need at least one column that is not nullable in those tables to determine whether a match was found on this table. Since you chose -in your examples- to put isnull() on all columns I don't know which columns are nullable and which aren't so I left it to you to pick one.
INSERT INTO Deltadawn.dbo.DeltaHistory
SELECT '66670','ATT',{ts '2012-05-29 13:19:00'},
(SELECT A.*
-- WHERE A.<non-nullable-column> is not null
FOR XML PATH('ATT'), ELEMENTS XSINIL
) AS OldValues,
(SELECT C.*
-- WHERE C.<non-nullable-column> is not null
FOR XML PATH('ATT'), ELEMENTS XSINIL
) AS NewValues,
'DELETE'
FROM COMPARETABLE B
LEFT OUTER JOIN D66670.dbo.ATT A ON ((A.SN IS NULL AND B.SN IS NULL) OR A.SN = B.SN)
AND ((A.SC IS NULL AND B.SN IS NULL) OR A.SC = B.SC)
AND ((A.DT IS NULL AND B.DT IS NULL) OR A.DT = B.DT)
LEFT OUTER JOIN DeltaScratch.dbo.ATT C ON ((C.SN IS NULL AND B.SN IS NULL) OR C.SN = B.SN)
AND ((C.SC IS NULL AND B.SN IS NULL) OR C.SC = B.SC)
AND ((C.DT IS NULL AND B.DT IS NULL) OR C.DT = B.DT)
June 4, 2012 at 9:47 am
Great thanks! I'll try these later today.
Unfortunately we really don't have control over the data we bring in for the most part and they can change data types and null pretty much any column based on their own custom configurations. Plus I'm trying to make this generic enough for 10 different vendors with similar products so that makes it difficult to customize much. We could try to customize for individual vendors, but it would become a maintenance nightmare.
I'll let you know how these work.
June 4, 2012 at 12:09 pm
I think I'm just going to have to change one of the columns (almost happens to be one of the most selective) to nvarchar(50) so that I can create an index. In looking at the way the query is chunking that data back to SSMS and by reviewing the query plan I'm sure an index would fix my problem. I'm going to try that next and see if that fixes it.
Thanks again for all the help to everyone.
June 5, 2012 at 6:48 am
Sounds like you're going about it the right way. How'd the index test go?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
June 6, 2012 at 12:16 pm
Awesome! Worked like a charm. Comes back instantly now instead of the 5-6 hours that it used to take on the large data loads.
Thanks again for the help to all involved. Total PITA getting this working but will be worth it in the long run.
June 6, 2012 at 1:12 pm
Awesome!
And you're welcome.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply