Code generation

  • Absolutly Great job

  • Lowell (5/11/2010)


    Jeff Moden (5/11/2010)


    1.6 million random 8 character codes that no one can guess and they're guaranteed to be unique within the set... takes about 25 seconds on my 8 year old desktop...

    you must have a monster "old" desktop; I've got a decent 2.8Gig developer machine and it took 5 times that long; 01:29 to run your code,01:28 on second run.

    25 seconds....your old developer machine plugged into a SAN? or did you use DBCC TIMEWARP with this?

    While it's true that I keep the flux capacitor on the hardware version of DBCC TIMEWARP loaded up with fresh hampster poo for the ultimate in "flash warping" when bit 77 is set, I forgot to mention that the SELECT * FROM #MyHead was not included in that time period.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • scziege (5/11/2010)


    Absolutly Great job

    I love a satisfied customer. Thanks for the feedback. Most folks on this thread had the right idea to begin with...

    Now... a favor from you, please. Could you tell us why and what you need these for and why a simple IDENTITY column wouldn't have been good enough? It would really be nice to know the details behind the business decisions for things like this. Thanks for your time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, that is an interesting take on Random..... :w00t:

    The first ten codes in the sample I tried...

    10000180

    10000F52

    10001875

    10001D17

    1000221F

    10003478

    100035E7

    10003763

    10006415

    100081B2

    Is there some bias towards groups of characters in that solution ....

    Or am I seeing it wrong?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (5/11/2010)


    Jeff, that is an interesting take on Random..... :w00t:

    The first ten codes in the sample I tried...

    10000180

    10000F52

    10001875

    10001D17

    1000221F

    10003478

    100035E7

    10003763

    10006415

    100081B2

    Is there some bias towards groups of characters in that solution ....

    Or am I seeing it wrong?

    Heh... good question, actually. The "bias" isn't really "bias" It's because we have 1.6 million random numbers starting at 10000000 because I didn't want to have any numbers that began with leading zeros. The other "bias" is only because the PARTITION clause sorts the final output. If we had sorted the final output using the whole of NEWID() for 1.6 million rows, the first 8 characters would form a similar "pattern" of ascent.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Magoo... here's what I'm talking about... if I take out the partitioning that supports the dupe check, then you probably get more like what you're expecting...

    WITH

    cteFirstGen AS

    ( --=== Gen enough 8 character random codes to cover possible dupes for 1.6 million codes

    SELECT TOP 2000000 LEFT(NEWID(),8) AS RandomCode

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    )

    ,

    cteNumberDupes AS

    ( --=== Number the codes so we can select unique random codes

    SELECT --ROW_NUMBER() OVER (PARTITION BY RandomCode ORDER BY (SELECT NULL)) AS Occurance,

    RandomCode

    FROM cteFirstGen

    WHERE RandomCode > '10000000'

    ) --=== Select 1.6 million unique random codes only

    SELECT TOP 1600000 RandomCode

    INTO #MyHead

    FROM cteNumberDupes

    -- WHERE Occurance = 1

    SELECT * FROM #MyHead

    The problem is that there's no guarantee that there's no dupe. You can use a DISTINCT if you don't mind waiting the extra time... it will double the amount of time it takes to execute this way....

    WITH

    cteFirstGen AS

    ( --=== Gen enough 8 character random codes to cover possible dupes for 1.6 million codes

    SELECT TOP 2000000 LEFT(NEWID(),8) AS RandomCode

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    )

    ,

    cteNumberDupes AS

    ( --=== Number the codes so we can select unique random codes

    SELECT --ROW_NUMBER() OVER (PARTITION BY RandomCode ORDER BY (SELECT NULL)) AS Occurance,

    RandomCode

    FROM cteFirstGen

    WHERE RandomCode > '10000000'

    ) --=== Select 1.6 million unique random codes only

    SELECT DISTINCT TOP 1600000 RandomCode

    INTO #MyHead

    FROM cteNumberDupes

    -- WHERE Occurance = 1

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey Jeff,

    our customer want to make a promotion on a product package with this codes.

    The customer can buy the product and insert the code onto a web page, after that

    he will win or loose.

    But do you think the codes are safe or is it possible to guess the code, on the web page

    we use captchas.

    Thanks

    Thorsten

  • Hey,

    another stupid question. Is it possible to check the codes against a table in which I store old codes ?

    The main point is the performance we have to look.

    We have a table with 100 Million rows and should insert 5 Million rows which not exists in the database.

    Thanks in adavance

    Thorsten

  • My 2p worth:

    Jeff has shown you a framework for generating numbers.

    I personally think you could consider generating more numbers and disallowing any with any repeating digits.

    You should probably consider going case sensitive e.g. Ax9G3bgt4 to make it harder to guess.

    You could (if you wanted) generate all possible combinations of 8 alphanumeric characters (about 2.4 trillion I think unless you want case sensitivity, then it's about 200 trillion ) which would take a couple of weeks to do on a desktop pc at least. (Previous experience shows you can generate a trillion rows in about 6 days) and then pick 1.6 million values from there, which gives you pretty good odds of not being guessed.

    Or you could generate batches of random strings and add them to your pool, seeded to deliberately scatter the values so that you don't get silly ones like AAAAAAAA.

    You have to define more clearly the parameters of the request to get a really solid answer.

    Having just seen your other question, I say again - tell us the whole requirement.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Great explanation, Jeff! That's one thing I wanted to ask.

    It's strange to see that partition by seems to be faster than distinct. I would have expected to be equally fast (or slow).

    Just out of curiosity, how did you come up with this? I suppose this is something you already had in your toolbox.

    -- Gianluca Sartori

  • Gianluca Sartori (5/12/2010)


    Great explanation, Jeff! That's one thing I wanted to ask.

    It's strange to see that partition by seems to be faster than distinct. I would have expected to be equally fast (or slow).

    Just out of curiosity, how did you come up with this? I suppose this is something you already had in your toolbox.

    I'm not 100% sure of the "internals" but it seemed logical to me that the partition would do a single sort whereas DISTINCT has to check each value against a range of other values and then, if within the range, check for the actual value. I'm no expert on the internals so rather than guess, I tried both. The partitioning won. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • scziege (5/12/2010)


    Hey Jeff,

    our customer want to make a promotion on a product package with this codes.

    The customer can buy the product and insert the code onto a web page, after that

    he will win or loose.

    But do you think the codes are safe or is it possible to guess the code, on the web page

    we use captchas.

    Thanks

    Thorsten

    With random numbers with such a relatively small domain, someone could get lucky with a guess. If you were to use a full NEWID(), instead, it would be like someone guessing the exact linear mile of travel you were in across a diameter of something like 17 quadrillion Milkyway galaxies.

    It would be a good test, as well. If you can't successfully and correctly enter a 36 character code, you don't deserve to win. 😛

    Maggo is correct... if you're going to use these codes for such a contest, codes with repeating adjacent characters should be eliminated and to prevent "mistrakes", things like lower case "L" and "1" and upper case "0" and zero should be eliminated from the mix. That's actually a lot easier to do than what most people would think although it does take a little extra time to generate. I'll play with this tonight because it IS an interesting problem. I've got a "random password generator" bit of code that I can steal a bit of code from.

    Someone will likely beat me to it, though... I've got to get to work. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • scziege (5/12/2010)


    Hey Jeff,

    our customer want to make a promotion on a product package with this codes.

    The customer can buy the product and insert the code onto a web page, after that

    he will win or loose.

    But do you think the codes are safe or is it possible to guess the code, on the web page

    we use captchas.

    Thanks

    Thorsten

    Heh... I guess there were no takers while I was at work. Ok, no problem. We've got this covered.

    Thorsten, I don't know what your company makes or sells but I'm thinking they owe me a bit for this one. 😉 Ask them if they can send me a couple of boxes of whatever they make. I'm sure that I can find someone or a homeless shelter in the area that needs whatever they sell. Or, if they want, they can make a donation to their favorite charity in my name, please.

    The whole thing takes 00:01:40 to run on my 8 year old single P4 CPU desktop and does a lot of things mentioned on this thread. Because random letters frequently spell out swear words, I added a "fix" for that little problem, as well. I also have it pick the "winning" code and store it in a table as well as storing the 1.6 million random codes in another table. If you need more than 1 winning code, then change the TOP 1 to TOP n where "n" is the number of winning codes you want the machine to pick. The winning code table also has a "Place" number in case of such an eventuality. The random code table also has an IDENTITY column so whoever prints up inserts can keep track of what they printed.

    We could check against the previous random code list you have but I'm going to hold out for a couple of more boxes of whatever your company sells for a homeless shelter. 😉

    Here's the code... as usual, the details are documented in the code itself...

    --===== Conditionally drop the permanent random code and winner tables

    -- Uncomment if you want to test for reruns

    IF OBJECT_ID('dbo.jbmRandomCode','U') IS NOT NULL

    DROP TABLE dbo.jbmRandomCode

    ;

    IF OBJECT_ID('dbo.jbmWinnerCode','U') IS NOT NULL

    DROP TABLE dbo.jbmWinnerCode

    ;

    --===== Generate 1.6 Million random codes where no two adjacent characters

    -- are alike. We also remove all vowels to keep from accidently

    -- spelling swear words and we remove things that could be mistaken

    -- for each other such as:

    -- Capital "O" and "0" (zero)

    -- Lower Case "L" and "1" (one)

    -- Lower Case "o" removed just to avoid all confusion

    -- Jeff Moden

    WITH

    cteBytes AS

    ( --=== Generate 2.2 Million random 8 byte numbers.

    -- We need this many so that when we remove numbers that

    -- have identical adjacent characters and any dupes,

    -- we still have enough to gen 1.6 Million random codes.

    -- This is done at the byte level because NEWID is expensive to run.

    -- If we ran 8 individual NEWID()'s for each code, this code would

    -- take a lot longer.

    SELECT TOP 2200000 CAST(NEWID() AS BINARY(8)) AS Bytes

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    )

    ,

    cteCharacters AS

    ( --=== This is just the list of characters that we have to chose from.

    -- Vowels and certain other characters have been removed to avoid confusion.

    -- Note that both strings must be absolutely identical here.

    SELECT LEN('BCDFGHJKLMNPQRSTVWXYZbcdfghjkmnpqrstvwxyz23456789') AS CharLen,

    'BCDFGHJKLMNPQRSTVWXYZbcdfghjkmnpqrstvwxyz23456789' AS Characters

    )

    ,

    cteChars AS

    ( --=== This takes each byte from the random 8 byte binary number and converts each into

    -- a human readable character by using the random byte to select from the allowed

    -- character list. Note that "modulo" (%) plays an important role here.

    SELECT SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,1,1) AS INT)%c.CharLen+1,1) AS Char1,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,2,1) AS INT)%c.CharLen+1,1) AS Char2,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,3,1) AS INT)%c.CharLen+1,1) AS Char3,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,4,1) AS INT)%c.CharLen+1,1) AS Char4,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,5,1) AS INT)%c.CharLen+1,1) AS Char5,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,6,1) AS INT)%c.CharLen+1,1) AS Char6,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,7,1) AS INT)%c.CharLen+1,1) AS Char7,

    SUBSTRING(c.Characters,CAST(SUBSTRING(b.Bytes,8,1) AS INT)%c.CharLen+1,1) AS Char8

    FROM cteBytes b

    CROSS JOIN cteCharacters c

    )

    ,

    cteRandomCode AS

    ( --=== Now we take all the individual letters and slam them together to make a random code.

    -- In the process, we make sure that no two adjacent characters are alike.

    SELECT Char1+Char2+Char3+Char4+Char5+Char6+Char7+Char8 AS RandomCode

    FROM cteChars

    WHERE Char1<>Char2 AND Char2<>Char3 AND Char3<>Char4 AND Char4<>Char5 AND Char5<> Char6 AND Char6<>Char7 AND Char7<>Char8

    )

    ,

    cteNumberDupes AS

    ( --=== We need to remove dupes and this numbers one or more occurances of the same code

    -- with an incrementing number to count occurances of that same code.

    SELECT ROW_NUMBER() OVER (PARTITION BY RandomCode ORDER BY RandomCode) AS Occurance,

    RandomCode

    FROM cteRandomCode

    ) --=== This selects the top 1.6 million first occurances of all the codes we haven't

    -- rejected so far. Since we only select the first occurance of any given code,

    -- this effectively removes dupes and does so much faster than using DISTINCT.

    -- The final result is stored in a new table called dbo.jbmRandomCode.

    SELECT TOP 1600000

    IDENTITY(INT,1,1) AS RowNum,

    RandomCode COLLATE Latin1_General_BIN AS RandomCode --For case sensitivity

    INTO dbo.jbmRandomCode

    FROM cteNumberDupes

    WHERE Occurance = 1

    ;

    --===== And now, pick what will be the winning number and store it in a new table.

    -- Change the TOP 1 to however many 1st, 2nd, 3rd, etc place winners you want.

    SELECT TOP 1

    IDENTITY(INT,1,1) AS Place,

    RandomCode COLLATE Latin1_General_BIN AS RandomCode --For case sensitivity

    INTO dbo.jbmWinnerCode

    FROM dbo.jbmRandomCode

    ORDER BY NEWID()

    ;

    {edit} Added brute force case sensitivity using COLLATE Latin1_General_BIN just to make life a little easier...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mister.magoo (5/12/2010)


    My 2p worth:

    Jeff has shown you a framework for generating numbers.

    I personally think you could consider generating more numbers and disallowing any with any repeating digits.

    You should probably consider going case sensitive e.g. Ax9G3bgt4 to make it harder to guess.

    You could (if you wanted) generate all possible combinations of 8 alphanumeric characters (about 2.4 trillion I think unless you want case sensitivity, then it's about 200 trillion ) which would take a couple of weeks to do on a desktop pc at least. (Previous experience shows you can generate a trillion rows in about 6 days) and then pick 1.6 million values from there, which gives you pretty good odds of not being guessed.

    Or you could generate batches of random strings and add them to your pool, seeded to deliberately scatter the values so that you don't get silly ones like AAAAAAAA.

    You have to define more clearly the parameters of the request to get a really solid answer.

    Having just seen your other question, I say again - tell us the whole requirement.

    I've incorporated some of the good ideas from your post into the new code I posted above. Thanks, Magoo. Using the characters chosen for the codes (49 in all) and removing any codes having duplicate adjacent letters, it works out to about 33 trillion combinations. Heh... we should start a lottery so old DBA's can retire. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brilliant answer Jeff!

    I like this a lot, but every time I see NEWID() used like this it just makes me yearn for a decent random number generator so that you could just

    select top 1600000 Number from RandomNumbers

    :unsure:

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 15 posts - 16 through 30 (of 36 total)

    You must be logged in to reply to this topic. Login to reply