Random fill.

  • Try this...simply referencing a column from the target table in the APPLY, using OUTER instead of CROSS and updating the referenced column as well forces the randomisation to work on a row-by-row basis...at least in my tests, although I have no reference to show it is guaranteed.

    update t

    set something=x.rnd,somethingelse=x.bb

    from target t

    outer apply (select top(1) t.somethingElse,use_value.column1 from use_value order by newid()) x(bb,rnd)

    where x.rnd is not null;

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (4/19/2014)


    Try this...simply referencing a column from the target table in the APPLY, using OUTER instead of CROSS and updating the referenced column as well forces the randomisation to work on a row-by-row basis...at least in my tests, although I have no reference to show it is guaranteed.

    update t

    set something=x.rnd,somethingelse=x.bb

    from target t

    outer apply (select top(1) t.somethingElse,use_value.column1 from use_value order by newid()) x(bb,rnd)

    where x.rnd is not null;

    Thanks, short and elegant. But all rows get exactly the same value assigned.

    (Running time was under 2 minutes (700000 target rows 4000 source rows).

    For each run the value is different, but the same fro all rows.

    Ben

    Thanks

  • Like I said, I couldn't guarantee it, although on the test data you supplied, it worked for me where the other methods failed.

    You definitely used an outer apply, and updated both columns from the outer apply?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (4/19/2014)


    Like I said, I couldn't guarantee it, although on the test data you supplied, it worked for me where the other methods failed.

    You definitely used an outer apply, and updated both columns from the outer apply?

    YES, BUT:

    Translating your solution to my situation I introduced an error. Used an alias two times, so the inner query did not reverence the target table.

    Query is now running in corrected format. (Over 2 minutes now).

    (over 4 minutes now, this is with the target over over 700000 and the values over 4000 rows).

    I'll be back in 10 minutes if the scripts hasn't finished then I'll abort it.

    Edit :

    space for the result.

    Made some more mistakes, but I think I have the results now.

    And: ((Msg 9002, Level 17, State 4, Line 2

    The transaction log for database 'tempdb' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases))

    3:00 minutes running time and as far as I can see the result is correct.

    Now I first have to rebuild my test dataset and do the run again.

    I have to find a solution to use an extra column from the target table. Although this column is not changed I need a columnname for that. Could pick a random one from the information_schema.

    At the moment I am coming up with an alternate solution, but that takes some development, so that is not ready yet.

    (Involving a large table containing the random values).

    Thanks,

    Ben

  • Yes, I can see that taking a long time as it is performing the outer apply ( selecting 1 from 4000 ) for each of the 700000 target rows.

    I was thinking that a set of 700000 random values from the source data will be quicker, but the sample size didn't warrant that approach.

    I'm sure that will be a much more performant method.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • [p](Code is used for more than 60 or 70 tables in three different databases).

    [/p]

    out of curiosity....whats the reasoning behind this proc?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • mister.magoo (4/19/2014)


    Yes, I can see that taking a long time as it is performing the outer apply ( selecting 1 from 4000 ) for each of the 700000 target rows.

    I was thinking that a set of 700000 random values from the source data will be quicker, but the sample size didn't warrant that approach.

    I'm sure that will be a much more performant method.

    At the moment I am running into a number of problems.

    But am working on:

    First prepare a random table, which can be used for all updates:

    CREATE TABLE [dbo].[#Ben](

    [id] [bigint] NULL,

    [nr] [bigint] IDENTITY(1,1) NOT NULL,

    CONSTRAINT [PK_BEN] PRIMARY KEY CLUSTERED

    (

    [nr] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    --

    -- Script below can create up to 4 Gig of rows with a identity and a 'random' number

    --

    ;

    WITH

    L0 AS(SELECT 0 AS c UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0), -- 4

    L1 AS(select 0 as x from L0 A, L0 B, L0 C, L0 D), -- 4 ^4 = 256

    L2 AS(select 0 as x from L1 A, L1 B, L1 C, L1 D), -- (4 ^ 4) ^4 = 4 Giga

    L9 AS(Select *, row_number() OVER(PARTITION BY x order by x ) as nr from L2) -- add rownumbers

    insert into #Ben select CHECKSUM(NEWID()) from L9 where nr <= 1000000

    -- to drop the table

    -- drop table #ben

    Then use this table inbetween the target and use_value table.

    The nr has to be adjusted in size for the size of the value table.

    Code for this would be:

    declare @Max_number int

    select @max_number = count(*) from use_value

    ;

    WITH

    T AS (SELECT *, ROW_NUMBER() OVER (order by (select NULL)) NR1 FROM Target),

    R AS (SELECT *,convert(integer,(ROUND(rand(checksum(ID))* @max_number,0,1)))+1 AAA FROM #ben),

    V as (SELECT * FROM use_value)

    UPDATE T SET full_name_txt = V.kolom1 FROM T JOIN r ON R.NR = t.NR1 JOIN V ON AAA = V.NUMMER where 1 = 1 -- A condition can be placed here.

    But At the moment I am running into 'trouble', so have not confirmed that the code is working, working correctly or efficiently.

    But am working on this. At the moment the system is not responsive at all, so this has to wait.

    Thanks for your solution,

    Ben

    With the above technique, I'll try to write some scripts to generate data for 700 000 row in a target table.

    And 4000 rows in the use_value table.

    But this has to wait for the system to return to normal.

  • J Livingston SQL (4/19/2014)


    [p](Code is used for more than 60 or 70 tables in three different databases).

    [/p]

    out of curiosity....whats the reasoning behind this proc?

    Code is used for several reasons.

    Mostly to Scramble production data for testing.

    For example all streetnames (target) can be replaced by streetnames from a list (use_value).

    The total procedure is build independently from the 'target' database and can be used on 'any' database. It was once build for a specific database, but now we use it on several databases.

    Ben

  • maybe food for thought....

    the random data is a bit simplistic...but relatively large.

    not sure if this meets your requirements.

    /*set up some simple test data*/

    use tempdb

    go

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

    SELECT TOP 1000000

    Something = '1' + CAST(Abs(Checksum(Newid()) % 90000000 ) AS varchar(50)),

    SomethingElse =cast ( '0' AS varchar(50))

    INTO Target

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CROSS JOIN sys.all_columns ac3

    /*set 'somethingelse' to eqaul 'something' for testing*/

    UPDATE Target

    SET SomethingElse = Something

    SELECT TOP 10000

    number = IDENTITY(INT, 1, 1),

    col1 = CHAR(Abs(Checksum(Newid())) % 26 + 65) + CHAR(Abs(Checksum(Newid())) % 26 + 65) + CHAR(Abs(Checksum(Newid())) % 26 + 65) + CHAR(Abs(Checksum(Newid())) % 26 + 65) + CHAR(Abs(Checksum(Newid())) % 26 + 65),

    col2 = 1 + CAST(Abs(Checksum(Newid()) % 9 + 1) AS INT)

    INTO Use_value

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CROSS JOIN sys.all_columns ac3

    ALTER TABLE [dbo].[Use_value] ADD CONSTRAINT [PK_Use_value] PRIMARY KEY CLUSTERED ([number] ASC)

    --SELECT TOP (100) Something,

    --SomethingElse

    --FROM Target

    --SELECT TOP (100) number,

    --col1,

    --col2

    --FROM Use_value

    /*possible solution*/

    SELECT TOP (100) Something,

    SomethingElse

    FROM Target

    order by Something

    /*create new table with random selection*/

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'T1') drop table T1

    DECLARE @cnt AS INT

    SELECT @cnt = count(*) FROM use_value

    SELECT *,

    randCol = (1 + CAST(Abs(Checksum(Newid()) % @cnt) AS INT))

    INTO T1

    FROM Target

    DROP TABLE Target

    SELECT T1.Something,

    Use_value.col1 AS SomethingElse

    INTO Target

    FROM T1

    INNER JOIN Use_value ON T1.randCol = Use_value.number

    SELECT TOP (100) Something,

    SomethingElse

    FROM Target

    order by Something

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (4/19/2014)


    maybe food for thought....

    the random data is a bit simplistic...but relatively large.

    not sure if this meets your requirements.

    First thanks for your contribution.

    In your solution, the Target table get's dropped and recreated.

    I do not know how 'efficient' this is, but there are a number of reasons why I do not want to drop the table.

    This has mainly to do with referential consistency, indexes, maybe triggers. They all have to be attended to.

    Another reason is that the table get's reorganised, the structurure of the table has to remain intact as far as possible, this is including the fragmentation and deferred rows (if they exist), rebuilding does really affect this a lot.

    The update can affect the fragmentation and deferred rows, I'll still have to examine how much they are affected using these update statements on a number of rows. (Another use of the script).

    Thanks for your contribution and time.

    Today I'll try to get the code finalised.

    Two choices remain at the moment:

    1. Mister.Magoo

    Advantage, this code did perform, is elegant.

    Disadvantage, an extra column from the Target table is needed, this makes the code more complex.

    2. Creating an extra table with 'random' numbers numbered from 1 to at least the max number of rows of the largest table.

    Advantage, simple code for the target tables.

    The random table is only produced once (more control during testing).

    Problem at the moment is performance issue's. (Can this be solved ?)

    I hope to finish the code today and will report then of the choice I have made.

    Ben

  • The problem I see with a truly random fill is that you need to select one value from 4000 for each of 700,000 source rows, so I can see two approaches to bulding that 700K set of random values.

    1. Assign a number to each "use_value" row (you already have "number", so we can use that) and then select from that set of numbers (checksum, newid , abs, modulus)

    update t

    set something=v.column1

    from target t

    join use_value v

    on v.number = abs(checksum(newid(),t.something))%(select count(*) from use_value)+1

    where 1=1;

    2. Build a pool of values that is larger than the target set and select a "random" set of the right size from that pool - the larger the pool, the more apparently random your values will be, but the slower the query will perform...

    with source_cte(N,column1) as

    (

    select top((select count(*) from target)) row_number() over(order by newid()), column1

    from (select top(10*(select count(*) from target)/(select count(*) from use_value)) 1 from million_3) a(dummy)

    ,use_value

    ), target_cte(N,something) as

    (

    select row_number() over(order by newid()), something

    from target

    )

    update target_cte

    set something = column1

    from source_cte

    where source_cte.N = target_cte.N;

    I find option 1 to be quicker and possibly would have "better randomisation" results, although I am no statistician, so cannot verify that. Both methods return similar numbers for this query:

    select something,count(*) from target group by something;

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (4/21/2014)


    The problem I see with a truly random fill is that you need to select one value from 4000 for each of 700,000 source rows, so I can see two approaches to bulding that 700K set of random values.

    1. Assign a number to each "use_value" row (you already have "number", so we can use that) and then select from that set of numbers (checksum, newid , abs, modulus)

    update t

    set something=v.column1

    from target t

    join use_value v

    on v.number = abs(checksum(newid(),t.something))%(select count(*) from use_value)+1

    where 1=1;

    2. Build a pool of values that is larger than the target set and select a "random" set of the right size from that pool - the larger the pool, the more apparently random your values will be, but the slower the query will perform...

    with source_cte(N,column1) as

    (

    select top((select count(*) from target)) row_number() over(order by newid()), column1

    from (select top(10*(select count(*) from target)/(select count(*) from use_value)) 1 from million_3) a(dummy)

    ,use_value

    ), target_cte(N,something) as

    (

    select row_number() over(order by newid()), something

    from target

    )

    update target_cte

    set something = column1

    from source_cte

    where source_cte.N = target_cte.N;

    I find option 1 to be quicker and possibly would have "better randomisation" results, although I am no statistician, so cannot verify that. Both methods return similar numbers for this query:

    select something,count(*) from target group by something;

    At the moment I am running into a diversity of problems.

    1. Performance which is not good or not consistent.

    2. All the values are the same in a column.

    3. The use of one script for different situations. (A generic script where tablenames and columnnames get replaced)

    4. Some constructions fail because of lack of logging space for the tempdb. *)

    (One of the databases used up most of the logging disk, during my experimenting. Yes it was in Simple mode).

    5. I keep making mistakes. (Format and language differences) **)

    Tried your first suggestion in this mail, but have aborted that because it was taking far longer than some previous solutions. I have not investigated why that happens. And lack the time to do that thouroughly. (Sorry).

    My own suggestion by building an 'inbetween' table with a Number to join the target and a random number to join the use_value table, did work but not consistently.

    So for practical reasons I am going to fall back to the 'old' solution and use create and use an extra RND column with each target table.

    I do not like this (it is not elegant) but at least this did work consistent.

    Later on I will try to clean up my act and try the suggested solutions again. But this has to wait till a next weekend I think. Sorry.

    ------------------------------------------------------------------------------------------

    ------------------------------------------------------------------------------------------

    For testing randomness I 'produce'

    TEL, this is a count of the field grouped on the value of that field.

    FREQ, this is the count of TEL grouped on TEL.

    Running total of TEL*FREQ ordered by tel.

    Plot of the curve:

    http://goo.gl/9ZwpSy

    The plot I make is a scatter diagram of the TEL, versus the Running total of (FREQ*TEL). See the picture in the link.

    Bell shaped curve (Normal distribution). A bit jagged because of the variations, but still recognisable as bell shaped.

    The S curve is the area under the bell. (If normalised to 100 percent it would be a percentile curve). This shape should be smooth.

    (A good random procedure should deliver a smooth S curve. This is a reasonable indication. Not a total indication).

    Thanks all for you contributions,

    for now I will give this a rest and hopefully come with some conclusive results next week.

    Ben

    **)

    I have translated my code in 'four' ways.

    1. From Generic code which has placeholders for the tablename, columnname etc.

    2. My tables to the tables I use in the example.

    3. Dutch versus English, In my original code there is some Dutch. (Nummer instead of Number, kolom instead of column).

    4. I was not consistent in alias names (read letters) for the tables.

  • I would guess the first one is not performing well because of a lack of a clustered index on the "number" column?

    Anyway, I hope you can find a good solution and let us know how you get on as this is an interesting problem 😀

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • I realise I forgot to mention, the first method above (using the numeric lookup on the use_values table) worked in about 8 seconds on my PC with a ~700K target table and ~4K use_values table, with a clustered primary key on "number".

    So it is a method that can work as long as you have the supporting index.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 14 posts - 16 through 28 (of 28 total)

    You must be logged in to reply to this topic. Login to reply