April 18, 2014 at 6:37 am
Two tables,
First table has one numeric field from 1 to n (consequtive).
(thousands of rows).
And one or more columns to hold values.
Second table has a large number of rows (millions).
A number of fields. One or more fields have to be filled randomly with a value from the first table.
In the simplest scenario one column from table two has to get a random value from table 1.
Example Table1
1 B
2 E
3 N
Table2 Before
x y z
p g m
a b c
p q r
d e f
Second column is target column after the fill table 2 could look like
Table2
x E z
p B m
a B c
p N r
d E f
Any simple and fast script for this ?
Ben
April 18, 2014 at 6:51 am
This could be one way to do it.
UPDATE T2
SET SomeColumn = RandomValues.SomeValue
FROM Table2 AS T2
CROSS APPLY (
SELECT TOP(1) SomeValue
FROM Table1
ORDER BY CHECKSUM(NEWID())
) AS RandomValues
-- Gianluca Sartori
April 18, 2014 at 7:23 am
spaghettidba (4/18/2014)
This could be one way to do it.
UPDATE T2
SET SomeColumn = RandomValues.SomeValue
FROM Table2 AS T2
CROSS APPLY (
SELECT TOP(1) SomeValue
FROM Table1
ORDER BY CHECKSUM(NEWID())
) AS RandomValues
Thanks for your code example,
But it produces only a single value in the target column.
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value
Create table Target
(
something varchar(30),
SomethingElse varchar(30),
full_name_txt varchar(30)
)
insert into Target values('a','g','m')
insert into Target values('b','h','n')
insert into Target values('c','i','o')
insert into Target values('d','j','p')
insert into Target values('d','k','q')
insert into Target values('f','l','r')
Create table Use_value
(
number int identity,
column1 varchar(30),
column2 varchar(30)
)
insert into Use_value values('rx','r1')
insert into Use_value values('ry','r2')
insert into Use_value values('rp','r3')
insert into Use_value values('rg','r4')
insert into Use_value values('rc','r5')
insert into Use_value values('rd','r6')
UPDATE T2
SET full_name_txt = RandomValues.column1
FROM Target AS T2
CROSS APPLY (
SELECT TOP(1) column1
FROM Use_Value
ORDER BY CHECKSUM(NEWID())
) AS RandomValues
select * from Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value
somethingSomethingElsefull_name_txt
agry
bhry
ciry
djry
dkry
flry
A working example:
But this is not simple.
Although it is fast.
With a CTE I could use a virtual column RND, but that was slow in performance.
--
-- Step 1 add an extra column (rnd) to a table.
-- Step 2 determine the maximum number of rows in the feed table. (count(*) and max(number) should have the same value).
-- Step 3 give the extra column a random value.
-- Step 4 set the column with the 'appointed' random value'
-- Step 5 Remove the extra column
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value
Create table Target
(
something varchar(30),
SomethingElse varchar(30),
full_name_txt varchar(30)
)
insert into Target values('a','g','m')
insert into Target values('b','h','n')
insert into Target values('c','i','o')
insert into Target values('d','j','p')
insert into Target values('d','k','q')
insert into Target values('f','l','r')
Create table Use_value
(
number int identity,
column1 varchar(30),
column2 varchar(30)
)
insert into Use_value values('rx','r1')
insert into Use_value values('ry','r2')
insert into Use_value values('rp','r3')
insert into Use_value values('rg','r4')
insert into Use_value values('rc','r5')
insert into Use_value values('rd','r6')
BEGIN TRANSACTION
ALTER TABLE Target ADD
rnd int NULL
COMMIT
declare @Max_number int
select @max_number = count(*) from Target
UPDATE Target SET rnd =
convert(integer,(ROUND(rand(checksum(newid()))* @max_number,0,1)))+1
-- fill the column according to the random number
UPDATE T SET full_name_txt = V.column1
FROM Target T
JOIN Use_value V ON rnd = V.number
select * from Target
BEGIN TRANSACTION
ALTER TABLE Target
DROP COLUMN rnd
COMMIT
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value
/*
-- produces
somethingSomethingElsefull_name_txtrnd
agrc5
bhry2
circ5
djry2
dkry2
flrx1
*/
Sorry that I didn't come up with an example.
And now have changed the names of the tables/columns Sorry. (:blush:)
Ben
April 18, 2014 at 7:34 am
My bad.
The subquery returns always the same value because it is not correlated with the outer query.
Well, looks like you solved your issue.
-- Gianluca Sartori
April 18, 2014 at 8:23 am
spaghettidba (4/18/2014)
My bad.The subquery returns always the same value because it is not correlated with the outer query.
Well, looks like you solved your issue.
This was the old solution.
But I often run into trouble with this old script. **)
It is part of a larger and more complex script.
I would prefere a 'simpler' script without the extra columns.
So I am trying to get rid of the extra RND column.
Functional I succeeded with a CTE script but the performance is far less than the solution with the extra columns.
So I am still looking for a simpler script which does perform well.
(Have not succeeded yet).
Ben.
**)
The extra columns can give all sort of problems.
Collation problems.
Code can not be rerun if aborted in a previous run.
Makes rows longer and therefore can create deferred rows.
April 18, 2014 at 8:47 am
You don't want to modify the target table structure, as that will have lots of overhead.
Does the target table have a unique key column(s)? If not, can you add an IDENTITY column to the target table to serve as a key?
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
April 18, 2014 at 9:20 am
WITH T1 AS (
SELECT *, randowCol = NEWID()
FROM Target
),
T2 AS (
SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)
FROM T1
)
UPDATE T2
SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)
-- Gianluca Sartori
April 18, 2014 at 9:43 am
ScottPletcher (4/18/2014)
You don't want to modify the target table structure, as that will have lots of overhead.
Actually adding a RND column to the table, using it and removing it hardly produces overhead. That solution was a lot faster than using a virtual RND column.
(The extra RND column can become a real problem in some scenario's, it's not a performance problem, but does disturb a number of things).
ScottPletcher (4/18/2014)
Does the target table have a unique key column(s)? If not, can you add an IDENTITY column to the target table to serve as a key?
No I cannot change the table permanently, only temporary as adding and later on removing the RND column. Adding an IDENTITY gives far far more overhead than adding the RND, so that is not a desired solution. The script is used on serveral very different databases on a number of tables. Some have an identity, some have unique key columns, but there are variations. (Code is used for more than 60 or 70 tables in three different databases).
But how would a Unique key and/or an identity help. Maybe I can adjust the code to that. I have full control over the code/script not over the table desing.
Ben.
**)
In my current version I have made the code for adding and removing the RND column more dynamic removing a part of the issues. But some issues still remain.
April 18, 2014 at 10:23 am
spaghettidba (4/18/2014)
WITH T1 AS (
SELECT *, randowCol = NEWID()
FROM Target
),
T2 AS (
SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)
FROM T1
)
UPDATE T2
SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)
Just came home for dinner and just saw this solution.
I'll try this after dinner.
Thanks, hope that this works better than my CTE solution, this would reduce the complexity of the code.
(Better for maintenance and usage of the code).
Thanks, I'll report later on.
Ben
April 18, 2014 at 1:08 pm
ben.brugman (4/18/2014)
spaghettidba (4/18/2014)
WITH T1 AS (
SELECT *, randowCol = NEWID()
FROM Target
),
T2 AS (
SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)
FROM T1
)
UPDATE T2
SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)
Just came home for dinner and just saw this solution.
I'll try this after dinner.
Thanks, hope that this works better than my CTE solution, this would reduce the complexity of the code.
(Better for maintenance and usage of the code).
Thanks, I'll report later on.
Ben
Sorry but this did not work.
The Rn number was 1 to the max of the target table. (Not the scale of the Use_value table).
Numbers of RN Ran far higher than the number of rows in the Use_table, producing an error that a NULL could not be inserted.
I could take a modula function on that but then it would be a repeating list not a random list.
Thanks, but this did not work.
I'll show the slow CTE sollution which worked but very slowly.
But have to adjust that to the names of the example.
(And run a test how long that is taking).
Ben
April 18, 2014 at 1:25 pm
print '-- Start of use_value.full_name_txt : '+convert(varchar(30),getdate(),126)
--
-- Get the max from the use_value table.
--
declare @Max_number int
select @max_number = count(*) from use_value
print 'number use_value, full_name_txt :'+convert(varchar(10), @max_number)
;
WITH
T AS (SELECT *, convert(integer,(ROUND(rand(checksum(newid()))* @max_number,0,1))) AS RND2 FROM Target)
UPDATE T SET full_name_txt = b.column1
FROM T
JOIN use_value B
ON rnd2 = B.number
print '-- END of use_value.full_name_txt : '+convert(varchar(30),getdate(),126)
This code is now allready running for over 10 minutes. (not finished yet) (still running 20 minutes)(aborted it after 27 minutes)
The code with the extra RND field run for less than 2 minutes.
Ben
April 18, 2014 at 4:53 pm
Well, it worked on the sample data you provided.
If you gave us more significant sample data, maybe someone could come up with something better.
-- Gianluca Sartori
April 18, 2014 at 5:42 pm
spaghettidba (4/18/2014)
Well, it worked on the sample data you provided.If you gave us more significant sample data, maybe someone could come up with something better.
Both tables where of the same size. I should have provided larger sample tables and of different sizes.
But even with the tables of the same size your algoritm always produces a table were each value from the use_value table is used exactly once.
If the target table would be 3 rows, only the first three rows from the use_value table would be used one time each.
So this is not very random.
Had I used tables of different sizes his effect would have been more visible.
Sorry to have provided tables of the same sizes.
See the code below.
(Only the first rows of use_value are used and each only once).
Ben
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value
Create table Target
(
something varchar(30),
SomethingElse varchar(30),
full_name_txt varchar(30)
)
insert into Target values('a','g','m')
insert into Target values('b','h','n')
insert into Target values('c','i','o')
insert into Target values('d','j','p')
-- insert into Target values('d','k','q')
-- insert into Target values('f','l','r')
-- insert into Target values('f2','l3','r3')
-- insert into Target values('f1','l2','r3')
Create table Use_value
(
number int identity,
column1 varchar(30),
column2 varchar(30)
)
insert into Use_value values('111','r1')
insert into Use_value values('222','r2')
insert into Use_value values('333','r3')
insert into Use_value values('444','r4')
insert into Use_value values('555','r5')
insert into Use_value values('666','r6')
insert into Use_value values('777','r6')
insert into Use_value values('888','r6')
insert into Use_value values('999','r6')
;
WITH T1 AS (
SELECT *, randowCol = NEWID()
FROM Target
),
T2 AS (
SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)
FROM T1
)
UPDATE T2
SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)
select * from Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target
If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value
April 19, 2014 at 3:16 am
Oh, I see what you mean.
Try this:
WITH T1 AS (
SELECT *, randowCol = NEWID()
FROM Target
),
T2 AS (
SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)
FROM T1
),
T3 AS (
SELECT *, randowCol = NEWID()
FROM Use_value
),
T4 AS (
SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)
FROM T3
)
UPDATE T2
SET full_name_txt = (SELECT column1 FROM T4 AS V WHERE V.RN = T2.RN )
-- Gianluca Sartori
April 19, 2014 at 4:53 am
spaghettidba (4/19/2014)
Oh, I see what you mean.Try this:
Tried the solution.
And I am trying to understand the solution.
First the solution does not work if the target table is larger than the use_value table.
Understanding the solution.
At first I did not (and maybe still do not) understand the solution.
T2 gives an unique incremental number for each row.
T4 gives an unique incremental number for each row.
So connecting these two numbers 'should' give only one unique combination for each row.
Running the software the result is that some number are used more then once or not. (As we want for the random requirement).
I did not understand this.
So I made a breakdown of the solution.
Created a T2 table by using : SELECT * INTO T2 from T2
Created a T2 table by using : SELECT * INTO T4 from T4
Both at the end of the with statement.
Then I run the update statement which can then be run exactly as it is at the end of the with statement.
Then the affected column contains unique values.
So clearly breaking the With statement down to distinct statements gives a different result.
A conclusion for the moment could be:
Within the WITH statement the T3 and T4 component are executed once for each target row.
This could be an explanation why my slow solution is so extremely slow. Also using this solution on a target table with a lot of rows, this solution would be very slow.
This is an important learning moment for me. Here is an example were the WITH construction behaves different from a 'stepped' constructing.
The example:
The example is kept 'small' so it is easier to inspect the result. The actual number or rows varies a lot but typical for the target is millions of rows and for the use_value it is thousands of row. But there are also occurences where the target and or the use_value table have a very limited number of rows. The target table even might be empty. The use_value table should at least have one row.
There was only one example, I kept the target small so that it would be easy to inspect the endresult.
I hoped that the problem was stated clearly enought to give this limited example. Sorry for that.
Thanks for your time and attention,
at least now I have an excelent example where the WITH statement behaves differently from a stepped solution.
And this could be the anwser why my solution was extremely slow. (Thanks for your help with that).
(Probably because the statements are not deterministic, by the random function).
Thanks,
Ben
Viewing 15 posts - 1 through 15 (of 28 total)
You must be logged in to reply to this topic. Login to reply