July 8, 2009 at 1:16 am
Thank's for all the good hints:
I modified my UDF an now I can show the code:
public static SqlString RNGCharacterMask(int codesize, string characters)
{
int maxSize = codesize;
int minSize = codesize;
char[] chars = new char[20];
string a;
a = characters;
chars = a.ToCharArray();
int size = maxSize;
byte[] data = new byte[1];
RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();
crypto.GetNonZeroBytes(data);
size = maxSize;
data = new byte;
crypto.GetNonZeroBytes(data);
StringBuilder result = new StringBuilder(size);
foreach (byte b in data)
{ result.Append(chars); }
return result.ToString();
}
Hope that clearify my problem a little bit.
Thank's in advance
July 8, 2009 at 1:37 am
Ouch.
I would modify the CLR Udf to returns a data set of '@amount' Codes and store those to a temp table.
Then delete duplicates with the codes table and re-execute the UDF returning the rowcount from the delete.
Repeat until there are no duplicate rows.
This will minimize the amount of round trips to the udf , plus the SQL logic is much more set based.
Ive never used the RNGCryptoServiceProvider class but you may only have to have one instance of if it if GetBytes returns a different value at each call.
As another thought though does all this function do return a string of variable(edit: random) characters from '34679CDFGHJKLNPRTVXY' ?
If so a pure TSQL solution should be pretty simple
Post back if this isn't clear
July 8, 2009 at 3:01 am
Can you please provide me a sample TSQL
Application which replaces the UDF Function ?
ThanK's
July 8, 2009 at 3:37 am
try this , ill be interested to hear how it performs in comparison.
Note that the case statement on the call to GetRandomCode is required otherwise you will get the same value returned.
On my dev server it creates 15,000,000 rows in about 11 mins, cutting down the size of the union to your limited set should improve matters.
CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED);
GO
DECLARE @i INT;
SELECT @i = 1;
WHILE @i <= 10000
BEGIN
INSERT INTO dbo.Numbers(Num) VALUES (@i);
SELECT @i = @i + 1;
END;
Create View VwNewId
as
Select New_Id = newid()
go
Drop Function GetRandomCode
go
Create Function GetRandomCode(@CharsNeeded integer)
returns table
as
return
(
with cteCharsNeeded(Character)
as(
select '1' union all select '2' union all select '3' union all select '4' union all select '6' union all select '7' union all select '8' union all select '9' union all Select '0' union all
select 'A' union all select 'B' union all select 'C' union all select 'D' union all select 'E' union all select 'F' union all select 'G' union all select 'H' union all Select 'I' union all
select 'J' union all select 'K' union all select 'L' union all select 'M' union all select 'N' union all select 'O' union all select 'P' union all select 'Q' union all Select 'R' union all
select 'S' union all select 'T' union all select 'U' union all select 'V' union all select 'W' union all select 'X' union all select 'Y' union all select 'Z' union all
select 'a' union all select 'b' union all select 'c' union all select 'd' union all select 'e' union all select 'f' union all select 'g' union all select 'h' union all select 'i' union all
select 'j' union all select 'k' union all select 'l' union all select 'm' union all select 'n' union all select 'o' union all select 'p' union all select 'q' union all select 'r' union all
select 's' union all select 't' union all select 'u' union all select 'v' union all select 'w' union all select 'x' union all select 'y' union all select 'z'
),
ctenumbers(num)
as
(
Select Num from numbers
where Num <= @CharsNeeded
),
cteRandomChars(num,c)
as
(
select num,c = chars.c
from ctenumbers cross apply (select top 1 Character from cteCharsNeeded,vwNewId where num = num order by new_id) as chars(c)
)
select (
select c as [text()]
from cteRandomChars
for xml path('')) as Random
)
go
drop table #res
go
create table #res
(
random char(8)
)
insert into #res
select Random from sysobjects cross apply GetRandomCode(case when id is not null then 8 else null end)
July 8, 2009 at 5:30 am
Ok give this a try...
1,000,000 random strings in 24Seconds
CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED);
GO
DECLARE @i INT;
SELECT @i = 1;
WHILE @i <= 1000
BEGIN
INSERT INTO dbo.Numbers(Num) VALUES (@i);
SELECT @i = @i + 1;
END;
go
update statistics Numbers with fullscan
go
drop View VwNewCheck
go
Create View VwNewCheck
as
Select abs(checksum(NewId())) as New_Id
go
Drop Function GetRandomCode
go
Create Function GetRandomCode(@StrLen integer,@CharsNeeded char(62))
returns table
as
return
(
with cteRandomChars(num,c)
as
(
Select Num,substring(@CharsNeeded,(Select VwNewCheck.new_id%(len(@CharsNeeded)-1)+1 from VwNewCheck where num = num ),1)
from numbers
where Num <= @StrLen
)
select (
select c as [text()]
from cteRandomChars
for xml path('')) as random
)
go
create table #random
(
random char(8)
)
insert into #random
select top 15000000 Random from sysobjects cross apply sysobjects so1 cross apply GetRandomCode(case when so1.id is not null or sysobjects.id is not null then 8 else 0 end,'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')
select @@rowcount
July 8, 2009 at 6:47 am
Once you have a way of generating many codes all at once, you need a way to efficiently check for duplicates, and to reach your target for new unique codes.
The script below illustrates some techniques. I wrote this before seeing Dave Ballantyne's posts, so I use a different method to generate 'random' codes.
This site will ruin the formatting, so I have attached it as a file as well.
Paul
-- Safe database
USE tempdb;
GO
-- Drop our test table if it exists
IF OBJECT_ID(N'dbo.UniqueCodes', N'U') IS NOT NULL DROP TABLE dbo.UniqueCodes;
GO
-- Create the test table
CREATE TABLE dbo.UniqueCodes
(
codeVARCHAR(10) NOT NULL UNIQUE,-- our unique codes
hash_codeINTEGER NOT NULL,-- a hash of the unique code to optimize lookups
);
GO
-- Create an index on the hash code
CREATE NONCLUSTERED INDEX nc1 ON dbo.UniqueCodes (hash_code);
--
-- Main script start
-- (you may have to run this several times before unique code collisions start to occur)
--
DECLARE@RowsToAdd BIGINT,-- Unique codes to add to the table
@MaxGen BIGINT,-- Maximum number of random codes to generate (some may already exist)
@MaxPasses INT, -- Maximum number of passes to make before giving up
@Magic BIGINT, -- A small number of extra rows to cope with generated duplicates and duplicates in the table (optimization, not critical)
@pass INT-- The current pass
--
-- Change these values to experiment
--
SELECT@RowsToAdd = 25000,-- We want to add these many codes
@MaxGen = 25100,-- Prepared to generate up to this many codes on each pass
@Magic = 25,-- This magic number can help produce a plan with a flow distinct if not many rows are needed
@MaxPasses = 5,-- Prepared to try five times
@pass = 0-- First pass (don't change this)
-- Optional
-- UPDATE STATISTICS dbo.UniqueCodes;
-- Loop while we have unique codes to find, and have passes remaining
WHILE(@RowsToAdd > 0 AND @pass 0.2 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.3 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.4 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.5 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.6 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.7 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.8 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +
CASE WHEN RAND() > 0.9 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END
FROMmaster.sys.columns C1, master.sys.columns C2, master.sys.columns C3
),
A (data) AS
(
-- Generate the hash code (a simple CHECKSUM will do here)
-- add filter out any duplicate codes we generated
SELECTDISTINCT
TOP (@RowsToAdd + @Magic)
N.data
FROMRandom N
),
B (data, hash_code, may_clash) AS
(
-- Create a flag which is set if the code may already exist in the target table
-- If the flag is zero, the code definitely does not already exist, so we can add it
SELECTTOP (@RowsToAdd + @Magic)
data,
CHECKSUM(data),
CASE WHEN EXISTS (SELECT * FROM dbo.UniqueCodes U WHERE U.hash_code = CHECKSUM(A.data)) THEN 1 ELSE 0 END
FROMA
)
-- Add a maximum of @RowsToAdd rows from CTE 'B'
-- The flag allows the optimizer to use start-up filters to optimize lookups on the existing data
INSERT dbo.UniqueCodes (code, hash_code)
SELECTTOP (@RowsToAdd)
B.data, B.hash_code
FROMB
WHEREmay_clash = 0 -- Safe to add if the flag is zero
OR(
-- If the flag is set, we need to check the code explicitly, since
-- the hash shows a possible match
may_clash = 1
AND NOT EXISTS
(
-- Can add if the code does not already exist
SELECT*
FROMdbo.UniqueCodes U
WHEREU.code = B.data
)
)
OPTION(RECOMPILE);-- The optimal plan may change significantly between calls
-- Adjust the number of codes we need to find, and increment the pass counter
SELECT@RowsToAdd = @RowsToAdd - @@ROWCOUNT,
-- Informational: pass number & rows remaining
IF @RowsToAdd > 0 SELECT [Pass #] = @pass, [Rows Remaining] = @RowsToAdd;
END;
-- Error if we failed
IF@RowsToAdd > 0 RAISERROR('Failed to add the required number of codes - %i remain', 16, 1, @RowsToAdd);
GO
SELECT [Total Unique Codes] = COUNT_BIG(*) FROM dbo.UniqueCodes;
July 10, 2009 at 12:32 am
Hi,
While your table has more then 100 mil. why you don't try to crete triggers and one another table. Changes on primary table such as delete or insert should update row on another table.
The new tabel can have only one column type of int or long int and on this column can be saved the total number of rows in primary table. After that you don;t need to scan all 100 mil rows.
July 10, 2009 at 12:47 am
It's not the function that takes time here.
It's the LOOP. See this row of code
if ((select count(code) from codes where code = @tempcode) = 0)
For each created code, you COUNT the Codes table if the code already exists. And THAT will take a LONG TIME.
Try this instead
IF NOT EXISTS (SELECT * FROM Codes WHERE Code = @TempCode)
And you will be better off using IGNORE_DUP_KEY on your table and insert all created tempcodes. If successful, the @@ROWCOUNT will be 1 and @@ERROR = 0, if not successful @@ROWCOUNT = 0 AND @@ERROR 0.
Then you can keep track of the successful inserts and increment a variable.
Rewrite the code and make it work.
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 12:52 am
Try this
CREATE TABLE [dbo].[Codes](
[Code] [char](10) NOT NULL,
CONSTRAINT [PK_Codes] PRIMARY KEY CLUSTERED
(
[Code] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = ON, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
ALTER PROCEDURE [dbo].[CreateCodes]
(
@amount bigint
)
AS
SET NOCOUNT ON
set @amount = abs(@amount )
while @amount > 0
begin
insertcodes
(
code
)
SELECT [dbo].[RNGCharacterMask](8, '34679CDFGHJKLNPRTVXY')
IF @@ERROR = 0 AND @@ROWCOUNT = 1
SET @Amount = @Amount - 1
end
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 1:37 am
Peso (7/10/2009)
It's not the function that takes time here.It's the LOOP.
That has already been covered in a previous post.
It rather misses the point anyway - there is no need for a loop, a set-based solution will be much faster (especially since the expensive random-generating class is instantiated for each call). That has also already been covered by several people.
Peso (7/10/2009)
And you will be better off using IGNORE_DUP_KEY on your table and insert all created tempcodes
There is are some important caveats to this. (Many people avoid IGNORE_DUP_KEY for these, and other, reasons)
First, the IGNORE_DUP_KEY should be set on a clustered index only. It would be unwise to change the definition to make the primary key non-clustered. Adding a future non-clustered index with IGNORE_DUP_KEY would produce the same effect: the query plan changes dramatically, making it cheaper to check for and eliminate potential duplicates in the query rather than relying on the engine to handle the duplicates. See this blog entry by Craig Freeman for full details.
The second caveat is to note that IGNORE_DUP_KEY only works with INSERT statements. Any UPDATE statement that would result in a duplicate key will raise an error and roll back as normal.
Third, one needs to be aware that if a key does not exist in the table, but the INSERT set contains duplicates for that key value, one row will be inserted, and one will be rejected. Exactly which row is inserted is non-deterministic. One could argue that this breaks the normally atomic behaviour of a transaction where either all rows are inserted or none are.
Fourth, you should be aware that some client APIs misinterpret SQL Server's 'duplicate key was ignored' warning as an error.
Paul
Given that the above considerations are neither well documented, nor well-known, it would make sense to highlight these issues when recommending a solution based on IGNORE_DUP_KEY.
July 10, 2009 at 1:46 am
Yes, I have read the previous posts.
But why suggest an alternative that returns an error saying "Sorry mate, couldn't create as many codes as you wanted."
This behaviour wasn't there before and why introduce it at all?
OP just wants the stored procedure to create the number of codes he wants with no fuzz.
I admit the set-based way to create the codes is WAY more efficient but doing this, you also need to take care of the duplicates in the temp table.
One way or another, you're bound to check for number of unique records created in one way or another.
And the CHECKSUM approach for hash value? What can I say more than it's a BAD choice.
See here how checksum is calculated. It will provide very many false positives... CHECKSUM('bQ') = CHECKSUM('aa') and CHECKSUM('ABA') = CHECKSUM('ACQ')
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 2:19 am
Or this pseudo code..
INSERT INTO #Temp (Code)
SELECT ... (record count equals the existing number of codes in Codes table plus the new wanted number).
insert into Codes
SELECT TOP (@wantedrecords)
from #Temp AS t
where not exist (select * from codes as c where c.code = t.code)
It only requires two set-based operations and will always work without any errors, as long as the first creation guarantees unique output.
N 56°04'39.16"
E 12°55'05.25"
July 10, 2009 at 2:19 am
Peso (7/10/2009)
But why suggest an alternative that returns an error saying "Sorry mate, couldn't create as many codes as you wanted." This behaviour wasn't there before and why introduce it at all?
Because it is a script to illustrate technique, not a cut-and-paste solution, as I stated explicitly in that post.
Peso (7/10/2009)
I admit the set-based way to create the codes is WAY more efficient but doing this, you also need to take care of the duplicates in the temp table. One way or another, you're bound to check for number of unique records created in one way or another.
Absolutely. So one might as well do it in a robust and efficient manner.
Peso (7/10/2009)
And the CHECKSUM approach for hash value? What can I say more than it's a BAD choice.
The comment in the code states that CHECKSUM is sufficient for the demo script. If you are that concerned about collisions, use HashBytes instead. I would argue that CHECKSUM is faster and eliminates sufficient joins to make it well worth doing. Try running the script (instead of making assumptions) - CHECKSUM is a perfectly usable hash function.
Please also see my comments regarding your choice of IGNORE_DUP_KEY in my previous post.
Paul
July 10, 2009 at 2:36 am
Hi Peso ,
I think you are being a little unfair..
checksumming the newid value is pretty safe due to the size (in terms of bytes) or the uniqueidentifier type. You could reverse engineer 2 ids to have the same checksum , but after abs(checksum) 1 million newid()'s i have 217 collisions. I think generating an 8 character random string with this method is as random as any method can be. Having populated 2million rows with my test script , i only have 1 duplicate string. I think that is is 'fit for purpose'.
July 10, 2009 at 3:12 am
Hey Dave,
Your logic also applies to the hash_code calculated on my UniqueCodes table as well - since the hash is for the code column which is VARCHAR(10).
In my current UniqueCodes table I have 845K rows. Duplicate hash values based on the CHECKSUM exist for 29K hash keys, which is 3%. Seems reasonable.
In a new run, 25K potentially unique codes generate 280 matches on the hash key, with 19 codes turning out to be unique. That compares well to the alternative of doing a join between those 25K rows and the 845K rows in the main table.
Paul
Viewing 15 posts - 16 through 30 (of 50 total)
You must be logged in to reply to this topic. Login to reply