select random records but for each category

Question

Post reply

select random records but for each category

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 1

Paul White (3/14/2010)
(though it is a shame that NEWID is illegal in a function):

Old gaming advice... if you can't move forward, backward, left, right, up, down, then change "color". 😉

You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.

CREATE VIEW MyNewID

AS

SELECT NEWID() AS ANewID

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 2

Jeff Moden (3/14/2010)
You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.

Thanks Jeff! I had read that before, but did not make the connection. It will complicate the SELECT a bit, but I will have a go...

The function will be non-deterministic of course, but that's kinda the point I suppose.

Shame though, because the optimizer really hates non-deterministic things.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 3

But not too complicated...

SELECT TOP (1)

nc = C.n,

nt = T.n,

nr = R.n

FROM dbo.Number C,

dbo.Number T,

dbo.Number R

WHERE C.n BETWEEN 3 AND 19

AND T.n BETWEEN 3 AND 19

AND R.n BETWEEN 3 AND 19

AND C.n + T.n + R.n = 25

ORDER BY

(SELECT ANewID FROM dbo.MyNewID) --This used to be NEWID()

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 4

Jeff Moden (3/14/2010)
But not too complicated...

There is another one inside the CROSS APPLY. Not difficult either, I am just hoping the query plan holds together.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150467 More actions · Answer 5

OK, all done. I had to include the RAND in the wrapper view too, for the same reason.

Here is the full script:

USE tempdb;

-- Drop the in-line function

IF OBJECT_ID(N'dbo.GetSourceSample', N'IF')

IS NOT NULL

DROP FUNCTION dbo.GetSourceSample

GO

-- Drop the view wrapping NEWID

IF OBJECT_ID(N'dbo.Random', N'V')

IS NOT NULL

DROP VIEW dbo.Random;

GO

-- Drop the combinations table

IF OBJECT_ID(N'dbo.Combinations', N'U')

IS NOT NULL

DROP TABLE dbo.Combinations;

GO

-- Drop our test table

IF OBJECT_ID(N'dbo.Source', N'U')

IS NOT NULL

DROP TABLE dbo.Source;

GO

-- Create numbers table if required

IF OBJECT_ID(N'dbo.Number', N'U')

IS NULL

BEGIN

CREATE TABLE dbo.Number (n SMALLINT PRIMARY KEY);

-- 736 SMALLINTs exactly fill one 8K page

INSERT dbo.Number (n)

SELECT TOP (736)

ROW_NUMBER() OVER (ORDER BY (SELECT 0))

FROM master.sys.columns C1,

master.sys.columns C2,

master.sys.columns C3;

END;

GO

-- Create the test table

CREATE TABLE dbo.Source

(

row_id INTEGER IDENTITY PRIMARY KEY,

vehicle_type CHAR(1) NOT NULL,

vehicle_id INTEGER NOT NULL

);

GO

-- View wrapping NEWID

CREATE VIEW dbo.Random

WITH SCHEMABINDING

AS SELECT n = RAND(CHECKSUM(NEWID()));

GO

-- All possible combinations of values [3...19]

-- that sum to 25

SELECT row_id = IDENTITY(INT, 0, 1),

nc = C.n,

nt = T.n,

nr = R.n

INTO dbo.Combinations

FROM dbo.Number C,

dbo.Number T,

dbo.Number R

WHERE C.n BETWEEN 3 AND 19

AND T.n BETWEEN 3 AND 19

AND R.n BETWEEN 3 AND 19

AND C.n + T.n + R.n = 25

-- Create clustered primary key

ALTER TABLE dbo.Combinations

ADD PRIMARY KEY (row_id)

WITH (FILLFACTOR = 100);

GO

-- Add 100,000 random rows, roughly evenly distributed

-- across vehicle types

INSERT dbo.Source

(

vehicle_type,

vehicle_id

)

SELECT TOP (100000)

CASE

-- Chance of being a car

WHEN RAND(CHECKSUM(NEWID(), N1.n)) <= 0.3333 THEN 'C'

-- Chance of being a truck

WHEN RAND(CHECKSUM(NEWID(), N2.n)) <= 0.3333 THEN 'T'

-- Chance of being an RV

ELSE 'R'

END,

-- Random vehicle id

ABS(CHECKSUM(NEWID()))

FROM dbo.Number N1,

dbo.Number N2,

dbo.Number N3;

GO

-- A helpful index

CREATE UNIQUE INDEX [UQ dbo.Source vehicle_type, row_id (vehicle_id)]

ON dbo.Source

(vehicle_type ASC, row_id ASC)

INCLUDE (vehicle_id)

WITH (FILLFACTOR = 100);

GO

-- Show the number of vehicles of each type

SELECT vehicle_type,

row_count = COUNT_BIG(*)

FROM dbo.Source

GROUP BY

vehicle_type

ORDER BY

vehicle_type;

GO

-- Create the function

CREATE FUNCTION dbo.GetSourceSample ()

RETURNS TABLE

WITH SCHEMABINDING

AS RETURN

WITH SelectionCounts

AS (

-- Number of rows to select from each group

-- Minimum 3, total for all groups = 25

-- Choose one of the possible combinations at random

SELECT TOP (1)

nc = C.n,

nt = T.n,

nr = R.n

FROM dbo.Number C,

dbo.Number T,

dbo.Number R

WHERE C.n BETWEEN 3 AND 19

AND T.n BETWEEN 3 AND 19

AND R.n BETWEEN 3 AND 19

AND C.n + T.n + R.n = 25

ORDER BY

(SELECT R.n FROM dbo.Random R)

),

GroupCounts

AS (

--Find the total number of vehicles of each type

SELECT vehicle_type,

group_size = COUNT_BIG(*)

FROM dbo.Source

WHERE vehicle_type = 'C'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*)

FROM dbo.Source

WHERE vehicle_type = 'T'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*)

FROM dbo.Source

WHERE vehicle_type = 'R'

GROUP BY

vehicle_type

),

Parameters

AS (

SELECT GC.group_size,

GC.vehicle_type,

sample_size =

CASE GC.vehicle_type

WHEN 'C' THEN SC.nc

WHEN 'T' THEN SC.nt

WHEN 'R' THEN SC.nr

ELSE 0

END

FROM GroupCounts GC,

SelectionCounts SC

),

RandomRows

AS (

-- Generate a random row number for each row required from

-- each group. Maximum rows returned = 25 * group_count

--

-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

-- force the query plan to calculate the random row number

-- before the final join

SELECT TOP (9223372036854775807)

P.vehicle_type,

Random.rn

FROM Parameters P

CROSS

APPLY (

SELECT TOP (P.sample_size)

rn = CONVERT(BIGINT, R.n * P.group_size + 1)

FROM dbo.Number N,

dbo.Random R

ORDER BY

N.n ASC

) Random

ORDER BY

P.vehicle_type,

Random.rn

),

NumberedSourceRows

AS (

-- Number the source rows, restarting the numbering for each group

SELECT S.row_id,

S.vehicle_type,

S.vehicle_id,

rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

FROM dbo.Source S

)

-- Fetch the numbered rows that match the random row number, per group

SELECT NSR.row_id,

NSR.vehicle_type,

NSR.vehicle_id

FROM RandomRows RR

JOIN NumberedSourceRows NSR

ON NSR.vehicle_type = RR.vehicle_type

AND NSR.rn = RR.rn;

GO

SET STATISTICS TIME ON;

SELECT GSS.row_id,

GSS.vehicle_type,

GSS.vehicle_id

FROM dbo.GetSourceSample() GSS;

SET STATISTICS TIME OFF;

GO

-- Drop the in-line function

IF OBJECT_ID(N'dbo.GetSourceSample', N'IF')

IS NOT NULL

DROP FUNCTION dbo.GetSourceSample

-- Drop the view wrapping NEWID

IF OBJECT_ID(N'dbo.Random', N'V')

IS NOT NULL

DROP VIEW dbo.Random;

-- Drop the combinations table

IF OBJECT_ID(N'dbo.Combinations', N'U')

IS NOT NULL

DROP TABLE dbo.Combinations;

-- Drop our test table

IF OBJECT_ID(N'dbo.Source', N'U')

IS NOT NULL

DROP TABLE dbo.Source;

Plan stayed exactly the same, 94ms average CPU time.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

SQL_NuB Hall of Fame Points: 3666 More actions · Answer 6

Thanks for the input, its greatly appreciated. However, I'm trying some of the suggested queries and it keeps kicking me out an error:

SELECT TOP (1)

nc = C.n,

nt = T.n,

nr = R.n

FROM dbo.Feed C,

dbo.Feed T,

dbo.Feed R

WHERE C.n BETWEEN 3 AND 19

AND T.n BETWEEN 3 AND 19

AND R.n BETWEEN 3 AND 19

AND C.n + T.n + R.n = 25

ORDER BY

(SELECT ANewID FROM dbo.MyNewID)

I get:

Msg 207, Level 16, State 1, Line 8

Invalid column name 'n'.

what does that mean?

Paul White SSC Guru Points: 150467 More actions · Answer 7

SQL_NuB (3/15/2010)
Thanks for the input, its greatly appreciated. However, I'm trying some of the suggested queries and it keeps kicking me out an error: Invalid column name 'n'.

That was 'concept code' from Jeff to me to illustrate a point, it is not meant to run.

You should be able to run my last contribution though - the one marked 'full script'. That includes Jeff's idea.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

SQL_NuB Hall of Fame Points: 3666 More actions · Answer 8

SQL_NuB

Hall of Fame

Points: 3666

March 15, 2010 at 7:46 am

#1133712

I got an error on that was well referring to "N"

Paul White SSC Guru Points: 150467 More actions · Answer 9

SQL_NuB (3/15/2010)
I got an error on that was well referring to "N"

Oh Lord. You're not using SQL Server 2000 are you?

In case it is a copy/paste error, the full script is attached to this post as a text file.

edit: fixed 2005-only bug!

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

SQL_NuB Hall of Fame Points: 3666 More actions · Answer 10

LOL,

No, I'm using 2005, I tried it in 08 to see if was an SQL version issue, but my 08 crashed. (good times)

Paul White SSC Guru Points: 150467 More actions · Answer 11

SQL_NuB (3/15/2010)
No, I'm using 2005, I tried it in 08 to see if was an SQL version issue, but my 08 crashed. (good times)

:laugh:

Actually, I just tried it on my 2005 instance and there is a bug! This line:

SELECT row_id = IDENTITY(INTEGER, 0, 1), needs to read:

SELECT row_id = IDENTITY(INT, 0, 1),

I have changed my post and the text file attachment.

Both forms work in 2008 - but not in 2005! :w00t:

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

mister.magoo SSC-Forever Points: 47174 More actions · Answer 12

I realise this has been answered, but I liked the challenge....

Here's mine : not as quick or elegant as Paul/Jeff but I think the results are to my mind "more random" and the code is compact.

;WITH DATA AS

(

--here is the 10 percent part

SELECT TOP 10 PERCENT

row_id,

source.vehicle_type,

vehicle_id,

--here is where a decision is made about which rows are part of the "three" per vehicle type

CASE

WHEN ROW_NUMBER() OVER(PARTITION BY source.vehicle_type ORDER BY NEW.ID) <4

THEN 0

ELSE 1

END AS primary_rows

FROM source

-- generate a new guid for each row and use it to sort

CROSS APPLY(SELECT NEWID()) AS NEW(ID)

-- sorting using a column number !ugh! but how else to do it?

ORDER BY 4, NEW.ID

)

-- now restrict to max 25 rows

SELECT TOP 25 row_id,vehicle_type,vehicle_id

FROM DATA

ORDER BY primary_rows

-- ensuring the "three" per type are always selected first

Results of Paul/Jeff on 201 runs:

vehicle_type count

T1739

R1653

C1633

Note the even distrubution...

Results of MM on 201 runs:

vehicle_type count

T1137

R2226

C1662

Note the resulting distribution reflects the source data distribution more closely...

Source data distribution:

vehicle_type count

T22128

R44485

C33387

There is an issue with the previous solution (Paul's) in that it goes wrong when the source table does not have enough records. If for example, you only have 2 type C records, the output looks like this:

[font="Courier New"]row_idvehicle_typevehicle_id

99999C1004507871

100000C1320721006

9512R116494338

10254R1472793309

22707R326756140

34558R446140356

38903R583366585

79984R971756239

88665R881544120

93146R1713443659

97315R1165209366

98760R2042364618

16998T134639693

31935T519661065

50739T551201165

73209T1276413656

89283T40684550[/font]

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Jeff Moden SSC Guru Points: 1004356 More actions · Answer 13

mister.magoo (3/15/2010)
There is another problem with the previous solution

Which previous solution? 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 14

mister.magoo (3/15/2010)
I think the results are to my mind "more random"

:blink: Really? 😉

There are some good points in your post, but I have to disagree with that one.

Your code is compact (though arguably who cares once it is in a function?) but slower, as you say.

I will give some more thought to this later.

I don't want to come across as just being critical - I enjoyed your post. It is an interesting challenge isn't it?

Side note: I had an email from the OP last night clarifying the real requirements. He should be posting them here soon, just to add an interesting twist.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

mister.magoo SSC-Forever Points: 47174 More actions · Answer 15

Jeff Moden (3/15/2010)
mister.magoo (3/15/2010)
There is another problem with the previous solution
Which previous solution? 🙂

The one Paul provided http://www.sqlservercentral.com/Forums/Attachment5424.aspx

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);