July 10, 2009 at 3:14 am
The checksum can be considered "Good enough".
But you still will have to compensate for the duplicates; ie delete the duplicates and insert 217 new records (in your example), check them for duplicates and so on... Until exactly 1 million unique records are created.
See my pseudocode in previous post. That's one example of how to do this fast and set-based.
The first insert to a temp table uses a cross join similar to create a tally numbers table. The number of records created must be at least the number of records already existing in the Codes table, plus the number of new records wanted, to guarantee @NewRecordsWantedCount to be exact.
Then insert TOP (@NewRecordsWantedCount) into Codes table with a NOT EXISTS check for already used/stored codes.
Also, purely mathematical, there are a total of 5,079,110,400 combinations when you want 8 characters out of the original 20 and don't want duplicate character within code.
If you don't care about duplicate character within code, there are 25,600,000,000 combinations total.
So another way to do this is to pre-generate all possible combinations one and for all, and have a status column added to the table denoting "Free or taken". This way the "creation" of 1 million records would be a simple
UPDATE TOP (1000000)
Codes
SET Taken = 1
OUTPUT inserted.Code
WHERE Taken = 0
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 3:33 am
Peso (7/10/2009)
The checksum can be considered "Good enough".
So it's not a "BAD choice" any more then? 😀
Peso (7/10/2009)
But you still will have to compensate for the duplicates; ie delete the duplicates and insert 217 new records (in your example), check them for duplicates and so on... Until exactly 1 million unique records are created.
No. I would encourage you to run my posted script and then examine the execution plan to see how it works. I don't know where you get 217 from (there were only in fact 19 duplicates - from 280 matches on the hash - from 25K candidate codes.
Peso (7/10/2009)
See my pseudocode in previous post. That's one example of how to do this fast and set-based.
The same code exists in my script! With the hash_code optimization added.
Peso (7/10/2009)
The first insert to a temp table uses a cross join similar to create a tally numbers table. The number of records created must be at least the number of records already existing in the Codes table, plus the number of new records wanted, to guarantee @NewRecordsWantedCount to be exact.
Recall that the code-generation CLR function is expensive. Creating sufficient codes for a 1M row table is inifficient given the likelihood of collisions. My script uses an iterative approach with 'knobs and levers' to tune the query to the application. Note that more codes than requested are generated for exactly this reason. Once again, please run it.
Paul
July 10, 2009 at 3:48 am
In fact I'll do it for you. Actual execution plan attached.
Note that the @MaxGen tuning parameter is deliberately set low - at 100 - to allow the script to demonstrate multiple passes. This would not be desirable in practice; this is a demo.
Even so, also note that on subsequent passes, if a small number of target codes are required SQL Server will choose a Hash Match Flow Distinct - so a minimal number of extra codes are in fact generated.
Anyway, on to the plan:
On this run, our aim is to insert 25K new unique codes into the table containing 125K unique codes already.
The code generation part of the script generates 25,100 rows.
Two of these rows match on the hash_code, so 25,002 rows flow through the first TOP.
Both rows turn out to be real matches to existing codes.
(Note that only 2 loop join singleton lookups are done to determine the matches.)
The remaining 25,000 unique codes flow on to be inserted.
So, to insert 25K new unique codes, we generate 25,100 codes, use 25,002 of them, and perform only 2 code comparisons, the rest of the work is carried out using the hash codes.
Paul
July 10, 2009 at 3:54 am
More math on CHECKSUM.
When you calculate the CHECKSUM for a 8-character string you utilizie the full INT (32 bit) without overflowing the cyclic XOR.
It means the checksum for any given 8-character string, at least have 16,777,215 other strings giving the same checksum, if full ascii 0-255 is used.
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 3:59 am
Heh. Ok, so that was the version based on HashBytes using SHA1. My mistake.
The attached plan is a run using CHECKSUM.
This time 175K code existed already, we're still adding 25K new ones.
This run happened to require a second execution - note the second run uses the Flow Distinct, so to generate the missing 27 rows, only 28 codes are generated.
With actual execution plan turned off, and a warm data cache, my very ordinary single-core 2GHz Pentium 4 Mobile laptop completes the whole process of inserting 25K new unique codes into a table with 250K existing records in less than one second.
Paul
July 10, 2009 at 4:03 am
Peso (7/10/2009)
When you calculate the CHECKSUM for a 8-character string you utilizie the full INT (32 bit) without overflowing the cyclic XOR. It means the checksum for any given 8-character string, at least have 16,777,215 other strings giving the same checksum, if full ascii 0-255 is used.
And how many other strings not having the same checksum? :laugh:
Fact is, it works. Forget the theory (interesting as it is) - check the plans.
BTW the codes are VARCHAR(10), and the length varies, so not sure why the 8-character thing is significant.
July 10, 2009 at 4:42 am
Hi Peso , i pretty much agree with you psuedo-code but your statement of..
as long as the first creation guarantees unique output.
Gets things a little circular 🙂
Maybe when i get a chance ill play around with a full solution ie inserting the values to a table as a primary key, but i suspect that , iterating until total rows inserted = number required would be best. I admit that i havent even considered IGNORE_DUP_KEY , would be interesting to experiment with though.
Im keen to hear back from the OP that we have even help to solve his issues.
July 10, 2009 at 6:20 am
Something similar to this, if "random" also mean unique.
The drawback is that the numbers generated per "batch" are similar, but still unique in the batch.
DECLARE@Codes TABLE
(
Code CHAR(8) PRIMARY KEY CLUSTERED
)
DECLARE@Sample TABLE
(
ID INT PRIMARY KEY CLUSTERED,
Char1 AS ((ID / POWER(CAST(19 AS BIGINT), 7)) % 19) PERSISTED,
Char2 AS ((ID / POWER(CAST(19 AS BIGINT), 6)) % 19) PERSISTED,
Char3 AS ((ID / POWER(CAST(19 AS BIGINT), 5)) % 19) PERSISTED,
Char4 AS ((ID / POWER(CAST(19 AS BIGINT), 4)) % 19) PERSISTED,
Char5 AS ((ID / POWER(CAST(19 AS BIGINT), 3)) % 19) PERSISTED,
Char6 AS ((ID / POWER(CAST(19 AS BIGINT), 2)) % 19) PERSISTED,
Char7 AS ((ID / POWER(CAST(19 AS BIGINT), 1)) % 19) PERSISTED,
Char8 AS ((ID / POWER(CAST(19 AS BIGINT), 0)) % 19) PERSISTED,
Code CHAR(8) NOT NULL
)
DECLARE@WantedCodes INT,
@x INT
SET@WantedCodes = 333
SELECT@x = COUNT(*)
FROM@Codes
INSERT@Sample
(
ID,
Code
)
SELECTNumber,
''
FROMdbo.TallyNumbers
WHERENumber < @x + @WantedCodes
DECLARE@Token TABLE
(
ID TINYINT IDENTITY(0, 1) PRIMARY KEY CLUSTERED,
CHR CHAR(1) NOT NULL
)
DECLARE@Chars CHAR(19)
SET@Chars = '4679CDFGHJKLNPRTVXY'
INSERT@Token
(
CHR
)
SELECTSUBSTRING(@Chars, Number + 1, 1) AS CHR
FROMdbo.TallyNumbers
WHERENumber < DATALENGTH(@Chars)
ORDER BYNEWID()
UPDATEs
SETs.Code = t1.CHR + t2.CHR + t3.CHR + t4.CHR + t5.CHR + t6.CHR + t7.CHR + t8.CHR
FROM@Sample AS s
INNER JOIN@Token AS t1 ON t1.ID = s.Char1
INNER JOIN@Token AS t2 ON t2.ID = s.Char2
INNER JOIN@Token AS t3 ON t3.ID = s.Char3
INNER JOIN@Token AS t4 ON t4.ID = s.Char4
INNER JOIN@Token AS t5 ON t5.ID = s.Char5
INNER JOIN@Token AS t6 ON t6.ID = s.Char6
INNER JOIN@Token AS t7 ON t7.ID = s.Char7
INNER JOIN@Token AS t8 ON t8.ID = s.Char8
INSERT@Codes
(
Code
)
SELECTTOP (@WantedCodes)
s.Code
FROM@Sample AS s
WHERENOT EXISTS (SELECT * FROM @Codes AS c WHERE c.Code = s.Code)
SELECT*
FROM@Codes
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 6:45 am
Peso (7/10/2009)
Something similar to this, if "random" also mean unique.The drawback is that the numbers generated per "batch" are similar, but still unique in the batch.
Adding the missing definition for dbo.TallyNumbers, and populating with 1M rows:
CREATE TABLE dbo.TallyNumbers (Number INT NOT NULL PRIMARY KEY)
INSERT dbo.TallyNumbers (Number)
SELECTTOP (1000000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROMmaster.sys.all_columns C1, master.sys.all_columns C2, master.sys.all_columns C3
Running the script produces:
[font="Courier New"]Msg 2627, Level 14, State 1, Line 69
Violation of PRIMARY KEY constraint 'PK__#164452B__A25C5AA6182C9B23'. Cannot insert duplicate key in object 'dbo.@Codes'.
The statement has been terminated.
[/font]
:blink:
July 10, 2009 at 6:55 am
Your TallyNumbers table must begin with 0 (zero).
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 7:18 am
Peso (7/10/2009)
Your TallyNumbers table must begin with 0 (zero).
Sheesh.
*exits*
July 10, 2009 at 8:34 am
If the CHECKSUM is deemed as acceptable function for hashing out values (1 chance in 16,777,215), using this code gives one chance in 3,047,466,240 to get a duplicate value!
DECLARE @Codes TABLE
(
Code CHAR(8) PRIMARY KEY CLUSTERED
)
DECLARE @Sample TABLE
(
ID INT PRIMARY KEY CLUSTERED,
Char1 AS ((ID / POWER(CAST(19 AS BIGINT), 7)) % 19) PERSISTED,
Char2 AS ((ID / POWER(CAST(19 AS BIGINT), 6)) % 19) PERSISTED,
Char3 AS ((ID / POWER(CAST(19 AS BIGINT), 5)) % 19) PERSISTED,
Char4 AS ((ID / POWER(CAST(19 AS BIGINT), 4)) % 19) PERSISTED,
Char5 AS ((ID / POWER(CAST(19 AS BIGINT), 3)) % 19) PERSISTED,
Char6 AS ((ID / POWER(CAST(19 AS BIGINT), 2)) % 19) PERSISTED,
Char7 AS ((ID / POWER(CAST(19 AS BIGINT), 1)) % 19) PERSISTED,
Char8 AS ((ID / POWER(CAST(19 AS BIGINT), 0)) % 19) PERSISTED
)
DECLARE @WantedCodes INT
SET @WantedCodes = 333
INSERT @Sample
(
ID,
Code
)
SELECT Number,
''
FROM dbo.TallyNumbers
WHERE Number < @WantedCodes
DECLARE @Token TABLE
(
ID TINYINT IDENTITY(0, 1) PRIMARY KEY CLUSTERED,
CHR CHAR(1) NOT NULL
)
DECLARE @Chars CHAR(19)
SET @Chars = '4679CDFGHJKLNPRTVXY'
INSERT @Token
(
CHR
)
SELECT SUBSTRING(@Chars, Number + 1, 1) AS CHR
FROM dbo.TallyNumbers
WHERE Number < DATALENGTH(@Chars)
ORDER BY NEWID()
INSERT Codes (Code)
SELECT t1.CHR + t2.CHR + t3.CHR + t4.CHR + t5.CHR + t6.CHR + t7.CHR + t8.CHR
FROM @Sample AS s
INNER JOIN @Token AS t1 ON t1.ID = s.Char1
INNER JOIN @Token AS t2 ON t2.ID = s.Char2
INNER JOIN @Token AS t3 ON t3.ID = s.Char3
INNER JOIN @Token AS t4 ON t4.ID = s.Char4
INNER JOIN @Token AS t5 ON t5.ID = s.Char5
INNER JOIN @Token AS t6 ON t6.ID = s.Char6
INNER JOIN @Token AS t7 ON t7.ID = s.Char7
INNER JOIN @Token AS t8 ON t8.ID = s.Char8
N 56°04'39.16"
E 12°55'05.25"
July 20, 2009 at 8:17 am
My Problem is that my base table has 130 Million codes
and I have to check if the created 100 Million codes a unique
against my 130 Millions codes and itself.
Therefore all runs very slow
July 20, 2009 at 8:39 am
scziege (7/20/2009)
My Problem is that my base table has 130 Million codesand I have to check if the created 100 Million codes a unique
against my 130 Millions codes and itself.
Therefore all runs very slow
Welcome back , so which of the multiple hints and suggestions given to you have you tried ?
July 21, 2009 at 12:25 am
Every hint, the problem ist that I have to check wether the key exists
in a table of 130 Million rows and after that I have to create 100 Million
unique keys.
Regards
Viewing 15 posts - 31 through 45 (of 50 total)
You must be logged in to reply to this topic. Login to reply