This one comes up a lot on the forums, often as advice given…
“You should use integers for the key rather than strings. The joins will be faster.”
It sounds logical. Surely integers are easier to compare than large complex strings. Question is, is it true?
This is going to take lots and lots of rows to get any significant results. I’m going to do two sets of tests. The first comparing query execution speed between string and integer data types on the join column while keeping the size of the join column the same between the two. This is to tell if there’s a difference just because of the data type.
The second test will have both the data type and the size of the join columns differing, while the total size of the table row will be kept the same. This is to answer the ‘string joins are slower because they are larger’ argument.
Test 1: Same key size, no indexes
The two tables have the same size join column – a bigint in one and a char(8) in the other.
SELECT t1.ID, t2.ID, t1.IntForeignKey, t2.SomeArbStatus FROM dbo.TestingJoinsInt t1 INNER JOIN dbo.LookupTableInt t2 ON t1.IntForeignKey = t2.ID GO SELECT t1.ID, t2.ID, t1.StrForeignKey, t2.SomeArbStatus FROM dbo.TestingJoinsString t1 INNER JOIN dbo.LookupTableString t2 ON t1.StrForeignKey = t2.ID GO
First up, a check of what Statistics IO and Statistics Time show for the two and whether there’s any difference in execution plan. (Full repo code is available for download, link at the end of the post)
Int joins
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsInt’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableInt’. Scan count 1, logical reads 735, physical reads 0.
SQL Server Execution Times:
CPU time = 2433 ms, elapsed time = 32574 ms.
String joins
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsString’. Scan count 1, logical reads 66036, physical reads 0.
Table ‘LookupTableString’. Scan count 1, logical reads 735, physical reads 0.
SQL Server Execution Times:
CPU time = 3744 ms, elapsed time = 33947 ms.
Execution plan’s the same, but that shouldn’t really be a surprise. With no nonclustered indexes there have to be table scans (or clustered index scan) and, with the resultsets not ordered by the join key a hash join is about the only join that could be used efficiently here.
The CPU time is the interesting thing. 35% more CPU time from the string join. To check that the difference is consistent and not a once off, I’m going to run the same test 10 times each and use Profiler to catch the durations and CPU times and aggregate.
That’s a notable difference in the average CPU usage. Average of 31% greater CPU usage from the string join over the integer join.
Maybe an index will fix things…
Test 2: Same key size, indexes on join column
Same tables, just with a nonclustered index added on the foreign key column.
Int joins
Table ‘TestingJoinsInt’. Scan count 1, logical reads 4654.
Table ‘LookupTableInt’. Scan count 1, logical reads 735.
SQL Server Execution Times:
CPU time = 2043 ms, elapsed time = 30993 ms.
String joins:
Table ‘TestingJoinsString’. Scan count 1, logical reads 4654.
Table ‘LookupTableString’. Scan count 1, logical reads 735.
SQL Server Execution Times:
CPU time = 2995 ms, elapsed time = 32904 ms.
The one scan has changed to an index scan and the join type is now merge join (as the indexes provide the join order), but the plan still has the same form (as would be expected) and there’s still a fairly substantial difference in the CPU times.
On average a 50% increase in CPU time. That’s pretty extreme.
Test 3: Same row size, no indexes
For this test, the join columns are now different sizes. (This is the second script in the attached repo code for anyone using that) I’m using an int in the one table and a char(24) in the other. This is probably a little more realistic, if strings are being used as keys and join columns, there’s a very good chance that it will be longer than if an int was used.
Straight into the testing, the query’s the same form, the names of the tables are the only things that changed.
SELECT t1.ID, t2.ID, t1.IntForeignKey, t2.SomeArbStatus FROM dbo.TestingJoinsInt2 t1 INNER JOIN dbo.LookupTableInt2 t2 ON t1.IntForeignKey = t2.ID GO SELECT t1.ID, t2.ID, t1.StrForeignKey, t2.SomeArbStatus FROM dbo.TestingJoinsString2 t1 INNER JOIN dbo.LookupTableString2 t2 ON t1.StrForeignKey = t2.ID GO
Int joins:
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsInt2′. Scan count 1, logical reads 64342, physical reads 0.
Table ‘LookupTableInt2′. Scan count 1, logical reads 685, physical reads 0.
SQL Server Execution Times:
CPU time = 2293 ms, elapsed time = 30839 ms.
String joins:
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘TestingJoinsString2′. Scan count 1, logical reads 64342, physical reads 0.
Table ‘LookupTableString2′. Scan count 1, logical reads 688, physical reads 0.
SQL Server Execution Times:
CPU time = 4290 ms, elapsed time = 36742 ms.
And we’re back to the same plan as in the first test – clustered index scans and hash join, for the same reasons. I seem to have messed up somewhere in trying to keep the tables the same size. Still, 3 reads difference on the lookup table is not really a large difference.
Just as in all the previous tests, the average CPU usage on the string join is markedly higher. This time it’s nearly 100% greater than for the int joins.
Test 4: Same row size, index on join column
In this test I’m breaking my own test rules a bit. While the table’s row size is the same between the two tables, the index row size is not. Still, I feel it’s fair enough as it reflects what would be done on a real system (no one pads out indexes for no reason)
Int joins:
Table ‘TestingJoinsInt2′. Scan count 1, logical reads 3407, physical reads 0.
Table ‘LookupTableInt2′. Scan count 1, logical reads 685, physical reads 0.
SQL Server Execution Times:
CPU time = 2075 ms, elapsed time = 30459 ms.
String joins:
Table ‘TestingJoinsString2′. Scan count 1, logical reads 9625, physical reads 0.
Table ‘LookupTableString2′. Scan count 1, logical reads 688, physical reads 0.
SQL Server Execution Times:
CPU time = 3775 ms, elapsed time = 34028 ms.
Same plan as the second test. I hope no one’s surprised by that.
And now we’re over a 100% increase in average CPU times for these two. The differing row sizes (with corresponding differing page counts) will be contributing to that, but just contributing, not causing, since we were seeing similar increases in earlier cases.
Conclusion
Is the use of integer data types better for join columns than strings? It certainly does appear so, and not insignificantly either. Bear in mind though that what I was doing here was a bit extreme. 2.5 million rows in a a query with no filters applied. This shouldn’t be something that ever gets done in a real system. So it’s not a case that YMMV1, it’s a case that it almost certainly will. You probably will see different results, but it is definitely something worth testing when planning data types for a DB
While this may not the final nail in the coffin for natural keys, it is worth keeping in mind when choosing between natural and artificial keys for a system, especially one likely to process large numbers of rows, such as a decision support/datawarehouse system. Test carefully with expected data volumes and expected loads if you’re considering natural keys and decide based on the result of those tests.
Repo code: Int vs String joins
(1) YMMV = ‘Your mileage may vary’, colloquialism that means that you may get a different result from me.