Performance & uniqueness

Question

Performance & uniqueness

salardx

Old Hand

Points: 316
More actions
January 20, 2016 at 11:27 pm

#315135

Hi guys,
There's a situation which requires holding about 50,000,000 records in a table
During inserting into the table (bulking), we need to check the uniqueness of about 5 columns (usually no more than 100 bytes overall)
Therefore, I was looking for a solution which provides performance and uniqueness, like having another column as ID char(16) and holding MD5 in it?
Is that a good idea, or it may trade off performance? Do you have any better idea for such occasion?
Thank you

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 1

What about just using a unique constraint?

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

salardx Old Hand Points: 316 More actions · Answer 2

I didn't quite get you, you mean set all those 5 columns as primary keys?!

If so, I think when it comes to 50,000,000 record check, that's a terrible idea from performance perspective!

We just need to check the uniqueness of one column (which shows the uniqueness of 5 columns)

Will you please elaborate, perhaps by giving an example?

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 3

I don't understand what you're going for. When you say you want to "check five columns uniqueness by checking one column" you're stepping outside of standard mechanisms.

What I'm suggestion is quite simply creating a unique index based on five columns. That will enforce unique values. If you mean that you want to take five columns on 50million rows and then based on the values of those five columns query all 50 million rows to check to see if there are duplicates, then yeah, that's going to be a performance nightmare. Instead, traditional data definitions would be to say "These five columns define a unique row so we will put a unique constraint/index on them." It doesn't have to be a primary key, although it can be. Whether or not that's a good idea is dependent on how these columns will be used for data retrieval as well as their business definition of uniqueness.

If you're suggesting that having a calculated column that combines your five columns and then you put a unique index on that, it's not likely to be faster than having an index on the five columns. The width of the definition of the key will be the same as the five column key. If you're suggesting creating a hash value of the five columns and then indexing the hash value, that might be a lot faster, depending on how fast your hash calculations are, but you also have to deal with the fact that it's unlikely to then be unique and you'll have to deal with hash collisions.

In general, if X number of columns make a row unique, assuming we're not talking violations of the size of index keys, then those columns are best defined as a unique constraint/index.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

salardx Old Hand Points: 316 More actions · Answer 4

Thanks for the elaboration

But there are some problems with creating a unique index based on five columns:

Firstly, I'm dealing with huge amount of data which is required to get bulked into the database. Having index in the table will definitely slow down the process of inserting!

Secondly, as you mentioned, it requires more storage! Imagine, I have 10,000 tables in a database which each of them can have up to 50 million records! if I set such index on each table, the size of database will about 150G

Therefore, we need a better solution, considering performance in inserting, checking uniqueness and storage

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 5

150GB isn't a huge amount of storage these days. But in your original post, you were talking about "a table", and now you're talking about 10,000. Do all those tables have the same five columns to be made unique? If so, I'd say your database design is a much bigger problem than insert performance or additional disk space. A unique index is the correct way to implement your requirement. If that doesn't work for you, consider using a tool other than a relational database.

John

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 6

The very best way to deal with scale is start with the fundamentals. Making good choices on the clustered index, primary key (not the same things), unique constraints, normalization, data types, etc., are the way to deal with performance issues right at the start of building your systems. You're telling me that you have a structure that is defined as unique, but instead of using standard practices, you want to do something abnormal, at the start, in preparation for performance problems that you haven't hit yet. I'm sorry, no, I just don't support that method of design and programming.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

salardx Old Hand Points: 316 More actions · Answer 7

150G gets considered huge if you have it per months!

Look, there's no other choice for designing a perfect structure in such occasion.

There are load of huge files (each file up to 200M) that I wrote a C# program for them to store them in an SQL archive per month!

My software generates a new database monthly! without having indexes, it gets about 30G per month.

After putting and shredding the files into the monthly archives, the software needs to extract some aggregated reports for the owners! so having 10,000 tables is the best approach!

Can you help with hashing? to acquire a better performance? What tools (other than SQL) would you suggest?

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 8

salardx (1/21/2016)
the software needs to extract some aggregated reports for the owners! so having 10,000 tables is the best approach!

Please can you explain the causal link between the need to extract reports and having 10,000 tables being the best option. The opposite is true, in my experience - if you need to run reports, you need a properly normalised database design.

Do any of the files that you load have the same structure? If so, why not put them into the same table? And why create a new database each month?

I'm afraid I don't know much about non-SQL databases - somebody else may be able to answer that better than I can.

John

salardx Old Hand Points: 316 More actions · Answer 9

Do any of the files that you load have the same structure? If so, why not put them into the same table? And why create a new database each month?

Because based on my experiences there are following problems:

1- if some bytes of the main file (mdf) get corrupted for any reason (hardware problems mostly) there's high probability that the whole table become unavailable! I can't risk it that much!

2- aggregating smaller tables would be way much easier and faster! (I create smaller tables based on more frequent queries which sounds logical)

3- joining smaller tables (with some extra tables) would be faster without any "WHERE" clause!

4- if I create a mere table for each month, after a while there will be some old huge tables that I don't even need them to carry on! analysis has been done on them! I just need to save them somewhere else as a backup for files! so it becomes another burden to cut them off the main database

overall, there will be very rare situations that I end up using "UNION" between such structures!

i think we better focus on my main question! How can I check uniqueness of five columns quickly?

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 10

salardx (1/21/2016)
1- if some bytes of the main file (mdf) get corrupted for any reason (hardware problems mostly) there's high probability that the whole table become unavailable!

Nope. If a page gets corrupt, that page is unreadable, not the entire table. Corruption is rare, and anyway that's one reason why we take backups and do integrity checks

2- aggregating smaller tables would be way much easier and faster!

Nope. If correctly indexed, large tables aggregate very well. The only time your statement is correct is when you intend to have no indexes at all on a table

3- joining smaller tables (with some extra tables) would be faster without any "WHERE" clause!

Again, no. If properly indexed, joins with where clauses are just as fast (or faster) than joins without.

You're proposing a very strange, non-standard design here, based on reasons that just aren't true on well-designed systems and on (probably) untested assumptions. This is generally a recipe for a system that's unmanageable, hard to modify, hard to administer or all of the above.

The best way to get a well-performing, scalable system is to get the fundamentals right and the design correct, according to the currently understood best practices and only look for non-standard solutions if testing has shown that the standard solutions don't work, the non-standard does and the extra difficulty in understanding/modifying/administering is acceptable under the specific circumstances.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Grant Fritchey SSC Guru Points: 399381 More actions · Answer 11

salardx (1/21/2016)
Do any of the files that you load have the same structure? If so, why not put them into the same table? And why create a new database each month?
Because based on my experiences there are following problems:
1- if some bytes of the main file (mdf) get corrupted for any reason (hardware problems mostly) there's high probability that the whole table become unavailable! I can't risk it that much!
2- aggregating smaller tables would be way much easier and faster! (I create smaller tables based on more frequent queries which sounds logical)
3- joining smaller tables (with some extra tables) would be faster without any "WHERE" clause!
overall, the chance that I require to use "UNION" between such structures is very very low!

To a large degree you're attempting to avoid using SQL Server as a relational data storage engine and instead use it as some mechanism of file storage. That's actually a horrible way to go about it. It's never going to work that well for you because that's not what it's designed to do. You really would be better off looking into a different tool than SQL Server.

1) Data corruption is pretty rare in my experience. It does happen, but that's why we have maintenance, monitoring, backups and backup testing. Recovery is something you may need on occasion (and on occasion is every 5-10 years, not once a week unless you have severe hardware problems, and then, let's fix the hardware problems). I wouldn't suggest designing the system the way you have based on the possibility of corruption. Now, security is a good reason to split databases even if the structure is the same. Don't let people into a db that contains data for a different set of people. If you're getting into horizontally partitioning for performance is another reason for having multiple databases. But not the slim possibility of corruption. I've had, since the horrible days of SQL Server 6.5, maybe 2-3 corrupt databases and maybe 12-15 corrupt indexes. The corrupt databases were a pain, but we had backups. The corrupt indexes just involved dropping and recreating them, done.

2) Smaller tables, less data, is faster. Can't argue. But, if your design leads to 10,000 tables, then I have to push back because we're talking about non-normalized storage and abnormal approaches that are going to be difficult to maintain and code against.

3)JOIN is faster with less data. Yep. Again, can't argue. Scans faster than range seeks... eh, maybe, depending, lots of factors there. However, you're taking away any chance that the query optimizer has to help you. No WHERE clause, scans. Always. You'll never know if there are better methods, good indexes, excellent choices for the clustered index resulting in good performance, none of that is possible.

I'm trying to help, but I think we're missing the fundamentals here that make standard (not perfect) approaches possible. In all seriousness, if you don't want to use relational storage, use a different data management system. You'll be happier with the options available.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

salardx Old Hand Points: 316 More actions · Answer 12

Guys! please!

We CANNOT have much indexes in the tables (neither one table for each month, nor 10,000 tables!)

because data is bulking into the tables! 1000 records on each bulk! having indexes slows it down!

beside, as I said before, indexes increases the size of the database enormously!!

And also I have experienced corruption in the database very much! and for this size, if you faced corruption, it takes days to recover a backup! (Although I'm thinking about using cluster or mirror SQL servers)

But please focus on my main question, is there any faster way to handle uniqueness with a faster hashing algorithm?

Grumpy DBA SSChampion Points: 12783 More actions · Answer 13

My suggestion is to use Oracle and take this conversation to an Oracle forum.

salardx Old Hand Points: 316 More actions · Answer 14

My suggestion is to use Oracle and take this conversation to an Oracle forum.

Thank you,

But I deeply believe that Microsoft has a solution for such occasions.