January 21, 2016 at 5:33 am
salardx (1/21/2016)
Guys! please!We CANNOT have much indexes in the tables (neither one table for each month, nor 10,000 tables!)
because data is bulking into the tables! 1000 records on each bulk! having indexes slows it down!
beside, as I said before, indexes increases the size of the database enormously!!
And also I have experienced corruption in the database very much! and for this size, if you faced corruption, it takes days to recover a backup! (Although I'm thinking about using cluster or mirror SQL servers)
But please focus on my main question, is there any faster way to handle uniqueness with a faster hashing algorithm?
First the question you ask: can this be made faster with hashing. The answer is no, at least not on SQL Server (I have no idea about other platforms).
But I think you are barking up the wrong tree.
First, step back. If a problem looks too hard to solve, I always like to back up a few steps to decide whether I already took a wrong turn earlier, to prevent that I do not accidentally keep digging myself in a deeper and deeper hole. Do you really need this data to be in a relational database? Why? If you do not use any of the specific features offered by a relational data store, consider using another type of data store.
Second, review what has been said. You say that you do not want indexes on the table because that slows down the data load and increases storage space. But why the plural? If you want to ensure uniqueness on a combination of 5 columns, than you need just a single index - on that specific column combination. Depending on what else you need to do with the data, you might even be able to make that the clustered index, in which case you use no extra storage space at all: the clustered index is where the table data is stored.
Third, consider alternatives. You write that you need 10,000 tables because you want to do analysis on only current data and not be slowed down by old data, and because you want an easy way to archive data that is not needed anymore. All that is offered out of the box on a single table if you use table partitioning. And that makes for a solution that is probably much easier to manage.
Fourth, do a reality check. You are concerned about adding indexes for the insert speed, but then you say that it's 1000 rows per batch. Even with a bucketload of indexes on the table, adding a batch of 1000 rows should take less than a second. If not, you have different problems - probably locking and blocking but impossible to tell for sure without knowing more.
January 21, 2016 at 5:33 am
salardx (1/21/2016)
Guys! please!We CANNOT have much indexes in the tables (neither one table for each month, nor 10,000 tables!)
because data is bulking into the tables! 1000 records on each bulk! having indexes slows it down!
beside, as I said before, indexes increases the size of the database enormously!!
And also I have experienced corruption in the database very much! and for this size, if you faced corruption, it takes days to recover a backup! (Although I'm thinking about using cluster or mirror SQL servers)
But please focus on my main question, is there any faster way to handle uniqueness with a faster hashing algorithm?
it's not the speed of the hash. And yeah, you can hash the columns and then make that a calculated column. Index that and it will be pretty darned fast. You can use CHECKSUM to do it. The issue is, you will have hash collisions. The only way to ensure absolutely unique values is to use a unique constraint.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 21, 2016 at 5:46 am
Thank you Hugo Kornelis and Grant Fritchey
Based on your responses I think I found what I need to do to overcome this problem
January 21, 2016 at 6:34 am
salardx (1/21/2016)
Thank you Hugo Kornelis and Grant FritcheyBased on your responses I think I found what I need to do to overcome this problem
You've been given some great guidance from several experienced and talented folks. Would you mind sharing which direction you're thinking of going in case it helps other readers?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
January 21, 2016 at 7:22 am
1) I would avoid using MD5 for uniqueness. http://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value for example
2) I hope you are computing your hash outside SQL Server
3) It would then be more efficient and faster to do the load with a unique index on the hash column.
4) I bet you can get faster bulk inserts using more than 1000 rows per explicit transaction
5) What do you want to happen when there IS a duplicate value??
6) Are you 100% CERTAIN you meet all the requirements to get minimally logged bulk inserts?? This is an absolute must for optimal performance.
7) Note that you CAN get minimally logged INSERTS on tables with either nonclustered or clustered indexes. But to keep it thusly for both types of pages you need to do all 50M rows in one transaction. Otherwise on the second+ batch you only get data pages minimally logged.
8) It can be MUCH faster in some cases to insert all data to a heap table and then create your clustered and/or non-clustered index(es). Not sure, but my not-enough-coffee-yet brain says this would be more likely when the index is unique. Also, doing this will give you the smallest index structures if you make the index with 100% fill factor.
9) Speaking of index size, a clustered index for a 50M row table adds a VERY small amount of additional space for the btree levels above the leaf.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 5 posts - 16 through 19 (of 19 total)
You must be logged in to reply to this topic. Login to reply