January 2, 2008 at 2:40 pm
So this is the question I have, I have a data feed that sents me all the historical data it has, this is around 40 gigs of data. So the data is loaded into a database, all together there are over 20 different tables. The files that we get from this feed on a monthly bases is put into different feeds.
Now this is what we want to do: We get the feed on a monthly basis, 90% of the data we already have. What we want to do is, dedup this data and only insert new data that exists. BUT, we need to verify that the records that we already have, have not changed at all.
What would be the best way to do this? I was thinking of creating something like a checksum for the row to check on, because each row can have over 10-20 columns, I would need to verify each of these columns to make sure if anything has changed.
Any suggestions? And NO there isn't any way of telling what records are new and what records have been updated.
Thanks.
January 2, 2008 at 3:07 pm
One way to do it would be an outer join on the existing data vs the incoming data. That would give you rows that have been changed or are new, but judging by the amount and complexity of the data you are describing, it might not perform well. The only way to find out would be to try.
A checksum (or hash), if put together correctly, might help at least eliminate some of the rows that are duplicates of existing rows. I'm not sure how accurate such a check would be. Checksums usually aren't unique, but have some "collisions", where similar data has the same checksum/hash.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 2, 2008 at 4:04 pm
Another option (assuming you have unlimited space) is to create a set of working tables .. or a working database. Load the entire data set and then run compairs. IE do inserts if the key values don't exist, do updates if the key values exist but some of the other values have changed, maybe mark a delete flag if the data no longer exists in the data set.
Given the large amount of data it could take awhile to process and certainly would take up alot of space but it may be your best bet over all.
Kenneth
Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]
January 2, 2008 at 4:06 pm
I have a few questions that need further clarification:
1) Why does historical data change?
2) Can you use a timestamp column in the data feed to only grab the needed rows?
http://msdn2.microsoft.com/en-us/library/ms182776.aspx
Now with that said, I would not suggesting grabbing 40gb of data and checking each one for values. It may be faster to delete the data in the table and bulk insert all of the new data.
January 2, 2008 at 4:45 pm
I'd second Adam's comment. If you have to load all the data to do the compares, you might as well just load the data as you do now.
I once worked on a system where I had a requirement to load large volumes of data with the same restrictions you mention. The requirement was to have the database available for as long as possible so we toyed with the idea of loading the new feed and comparing. In the end, we created a job that loaded all the new data into a differently named database (db_load), renamed the current database to temp_something, renamned the new database to the current production name, and then renamed the temp_something to the load database name(db_load). The renames were quick and required only seconds of downtime and the process ran pretty well for several years. The only trick was making sure there were no live connections when the rename took place. There are scripts on this site that can generate the commands to kill active processes for spacific databases.
Hope that helps.
January 2, 2008 at 11:51 pm
Checksum works great if you make sure not to include "non-comparable" columns such as Text, NText, Image, Cursor, and SQL_Varient. It won't tell you what changed, but it'll help you quick disern the rows that have so you don't need to check as many rows. Since it will probably have very high cardinality over 10-20 columns, it's a beautiful candidate for an index.
From there, you might want to make a second pass on the columns using CHECKSUM_AGG to narrow down which columns have changes...
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2008 at 5:52 am
This feed is coming from one of our vendors, and they send the FULL data load on each run. We have seen in the past that data has changed (even though the vendor says they do not change historical data...hmm).
So the project requirement was to keep all the data, and compare the data values based on the primary keys that are set in the table.
Storage is not a requirement at all. We have plenty of storage. When I was first given this project, I was thinking of using the renaming the tables, which does take minimal time, but this can not happen. The requirements of the project changed recently, and hence now I am wondering how to go about it.
The reason I was thinking of doing a checksum, I figured it would be easy to do instead of comparing each and every column for each and every primary key match, could be a lot of queries and very painful and time consuming.
Does anyone have any checksum code that I can use or where I can find an example on how to create a checksum? These fields are all nvarchars, ints, and floats. The nvarchar fields are not larger then a 100 characters.
Thanks.
January 3, 2008 at 6:03 am
Sorry, just found that CHECKSUM is actually built into sql server. I never realized this.
Overall, how does everyone feel about using CHECKSUM over comparing each and every column for each and every row?
January 3, 2008 at 10:21 am
Ravi Patel (1/3/2008)
This feed is coming from one of our vendors, and they send the FULL data load on each run.
If by FULL data load you mean they have ALL rows both new and old, then I wouldn't waste my time trying to build a merge... I'd load it into a new table and point a synonym at it. Next time, I'd truncate the old table, load it with the new data, and repoint the same synonym at that. That makes the total contention for the table about 65 milliseconds or less. No lock, no blocks, no complex code.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2008 at 10:24 am
Ravi Patel (1/3/2008)
Sorry, just found that CHECKSUM is actually built into sql server. I never realized this.Overall, how does everyone feel about using CHECKSUM over comparing each and every column for each and every row?
Heh... it's a heck of a lot better than comparing 10-20 columns just to find out which rows MAY have changed.
And glad to see that someone else learned a new function "on the way"... 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2008 at 10:30 am
Jeff Moden (1/3/2008)
Ravi Patel (1/3/2008)
This feed is coming from one of our vendors, and they send the FULL data load on each run.If by FULL data load you mean they have ALL rows both new and old, then I wouldn't waste my time trying to build a merge... I'd load it into a new table and point a synonym at it. Next time, I'd truncate the old table, load it with the new data, and repoint the same synonym at that. That makes the total contention for the table about 65 milliseconds or less. No lock, no blocks, no complex code.
I forsee a few possible problems (and as always its based on assumptions)
1st If the data is loaded into tables with Identity columns that are used to reference the data elsewhere you will almost certainly run into referential problems.
2nd We were told the historical data does change from time to time weather it should or not so you run the risk of a row being removed from the dataset. If that row was refered to elsewhere then you will get orphans.
While I'm at it one other benefit of loading the whole dataset into a workspace is that you have an opportunity to do some pre-processing without affecting your working dataset. Check for dups, orphans, general bad data, etc
Just my 2 cents though
Kenneth
Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply