SELECT 1 random row query help, please

Question

Post reply

SELECT 1 random row query help, please

kiril.lazarov.77

Hall of Fame

Points: 3742
More actions
December 8, 2015 at 6:28 pm

#306990

Hi all,
While the query below works as expected, it's very slow when I run it on a table with 1million+ rows. Is there an alternative to ORDER BY NEWID()?
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
SELECT TOP 1 Id,
CategoryId,
SKU
FROM @t
ORDER BY NEWID()
I also tried the below code (taken from https://msdn.microsoft.com/en-us/library/cc441928.aspx) but it sometimes returns nothing so is unreliable and changing "< 10" to "> 0" always returns the first row. Please help me getting one single & random row from the above data set efficiently. Thank you.
SELECT TOP 1 Id,
CategoryId,
SKU
FROM @t
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Aditya Daruka Old Hand Points: 339 More actions · Answer 1

Here is something to play with

DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)

INSERT INTO @T ( CategoryId, SKU)

SELECT 3200, '978-654-321' UNION ALL

SELECT 3200, '000-258-963' UNION ALL

SELECT 3200, '969-555-010' UNION ALL

SELECT NULL, '888-555-010' UNION ALL

SELECT NULL, '53-555-64' UNION ALL

SELECT NULL, '888-555-66' UNION ALL

SELECT NULL, '23-555-222' UNION ALL

SELECT NULL, '77765-555-0990'

DECLARE @RandomRowNumber BIGINT = (SELECT CAST(ROUND((COUNT(*) - 1) * Rand() + 1, 0) AS BIGINT) FROM @T) ;

;WITH dataCTE AS

(

SELECT *

, RN = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

FROM @T

)

SELECT *

FROM dataCTE

WHERE RN = @RandomRowNumber ;

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

Aditya Daruka (12/8/2015)
Here is something to play with
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
DECLARE @RandomRowNumber BIGINT = (SELECT CAST(ROUND((COUNT(*) - 1) * Rand() + 1, 0) AS BIGINT) FROM @T) ;
;WITH dataCTE AS
(
SELECT *
, RN = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM @T
)
SELECT *
FROM dataCTE
WHERE RN = @RandomRowNumber ;

Replace one of those two clustered index scans with a seek:

SELECT *

FROM (SELECT RandomRowNumber = 1+ABS(CHECKSUM(NEWID()))%COUNT(*) FROM @T) t

CROSS APPLY (SELECT TOP 1 * FROM @T WHERE RandomRowNumber > Id ORDER BY Id DESC) x

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

serg-52 SSCrazy Eights Points: 9913 More actions · Answer 3

See my next post, don't know how to delete this error posting.

serg-52 SSCrazy Eights Points: 9913 More actions · Answer 4

Most probably there were a lot of DELETEs an INSERTs on a table with 1million+ rows. So IDs hardly are evenly distributed on (1 .. count(*)) interval or even on (Min(id) ..Max(id) ) interval. Output will not be really random.

For example

DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)

INSERT INTO @T ( CategoryId, SKU)

SELECT 3200, '978-654-321' UNION ALL

SELECT 3200, '000-258-963' UNION ALL

SELECT 3200, '969-555-010' UNION ALL

SELECT NULL, '888-555-010' UNION ALL

SELECT NULL, '53-555-64' UNION ALL

SELECT NULL, '888-555-66' UNION ALL

SELECT NULL, '23-555-222' UNION ALL

SELECT NULL, '77765-555-0990'

;

delete from @t where id>3 and CategoryId=3200;

SELECT *

FROM (SELECT RandomRowNumber = min(id)+1+ABS(CHECKSUM(NEWID()))%(max(id)-min(id)+1) FROM @T) t

CROSS APPLY (SELECT TOP 1 * FROM @T WHERE RandomRowNumber > Id ORDER BY Id DESC) x

This will return Id=3 far frequently then others.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

You simply MUST NOT order all rows to get one!! That is an immensely costly operation!

One quick-and-easy way is to get min and max ID values and create a WHILE loop and generate a random variable int value between those (variety of ways to accomplish this) and then seek on that value with a WHERE id = @id filter. This will be a VERY efficient PK index seek operation. If you don't ever do any deletes and you start with 1 and go up you simply need to get the max value and you will be almost guaranteed to get a hit on the first call, only missing if someone rolled back an insert. Even if you have fairly large ranges of missing records this will almost certainly be faster than scanning all rows into a temp table and doing the row_number, NEWID() thing.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 6

serg-52 (12/9/2015)
Most probably there were a lot of DELETEs an INSERTs on a table with 1million+ rows. So IDs hardly are evenly distributed on (1 .. count(*)) interval or even on (Min(id) ..Max(id) ) interval. Output will not be really random.
For example
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
;
delete from @t where id>3 and CategoryId=3200;
SELECT *
FROM (SELECT RandomRowNumber = min(id)+1+ABS(CHECKSUM(NEWID()))%(max(id)-min(id)+1) FROM @T) t
CROSS APPLY (SELECT TOP 1 * FROM @T WHERE RandomRowNumber > Id ORDER BY Id DESC) x
This will return Id=3 far frequently then others.

Yes it will. You could mitigate that somewhat by using TOP(n), whatever value you want for n, and choosing one of those rows at random - but see Kevin's post below for an alternative method where ID's have been deleted.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

serg-52 SSCrazy Eights Points: 9913 More actions · Answer 7

TheSQLGuru (12/9/2015)
You simply MUST NOT order all rows to get one!! That is an immensely costly operation!
One quick-and-easy way is to get min and max ID values and create a WHILE loop and generate a random variable int value between those (variety of ways to accomplish this) and then seek on that value with a WHERE id = @id filter. This will be a VERY efficient PK index seek operation. If you don't ever do any deletes and you start with 1 and go up you simply need to get the max value and you will be almost guaranteed to get a hit on the first call, only missing if someone rolled back an insert. Even if you have fairly large ranges of missing records this will almost certainly be faster than scanning all rows into a temp table and doing the row_number, NEWID() thing.

I think it totally depends on how much "random" this random row is required to be. If (p(id) >0 for any id) is quite enough then i agree, no need to enumerate rows. If more strict (p(id1)=p(id2) for any id1,id2) is required then i can't see how to achieve this without row_numbering ids first in general case.

kiril.lazarov.77 Hall of Fame Points: 3742 More actions · Answer 8

Thank you everyone for the replies, very helpful!!

Unfortunately there's no guarantee that there would be no gaps. Also, I need ti implement a logic similar to this. E.g. pick a random row if the declared CategoryId exists, otherwise pick a random row from the set where categoryId is null. I know it's ugly but I just want to demonstrate the logic and expected results. Thank you.

DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)

INSERT INTO @T ( CategoryId, SKU)

SELECT 3200, '978-654-321' UNION ALL

SELECT NULL, '53-555-64' UNION ALL

SELECT 3200, '969-555-010' UNION ALL

SELECT NULL, '888-555-010' UNION ALL

SELECT NULL, '888-555-66' UNION ALL

SELECT 3200, '969-555-010' UNION ALL

SELECT NULL, '23-555-222' UNION ALL

SELECT 3200, '000-258-963' UNION ALL

SELECT 3200, '969-555-010' UNION ALL

SELECT 6565, '969-555-010' UNION ALL

SELECT 6565, '23-555-222' UNION ALL

SELECT 6565, '000-258-963' UNION ALL

SELECT NULL, '888-555-010' UNION ALL

SELECT 55, '888-555-66' UNION ALL

SELECT 55, '969-555-010' UNION ALL

SELECT 33, '969-555-010' UNION ALL

SELECT NULL, '77765-555-0990'

DECLARE @DS1MinValue INT

DECLARE @DS1MaxValue INT

DECLARE @DS2MinValue INT

DECLARE @DS2MaxValue INT

DECLARE @CategoryId INT = 6565

CREATE TABLE #TBL (Id INT, Rn INT, CategoryId INT, SKU VARCHAR(50))

INSERT INTO #TBL(Id, Rn, CategoryId, SKU)

SELECT Id, row_number() over(partition by case when categoryId is null then 0 else categoryId end order by case when categoryId is null then 0 else categoryId end ) AS Rn, CategoryId, SKU

from @t

SELECT @DS1MinValue= (select MIN(Rn) from #TBL where categoryId = @CategoryId)

SELECT @DS1MaxValue= (select MAX(Rn) from #TBL where categoryId = @CategoryId)

SELECT @DS2MinValue= (select MIN(Rn) from #TBL where categoryId is null)

SELECT @DS2MaxValue= (select MAX(Rn) from #TBL where categoryId is null)

DECLARE @DS1Rnd INT = (SELECT ROUND(((@DS1MaxValue - @DS1MinValue) * RAND() + @DS1MinValue), 0))

DECLARE @DS2Rnd INT = (SELECT ROUND(((@DS2MaxValue - @DS2MinValue) * RAND() + @DS2MinValue), 0))

if exists (select Id from #TBL where CategoryId = @CategoryId)

begin

-- pick a random row from data set where CategoryId IS NOT NULL

SELECT * fROM #TBL WHERE CategoryId = @CategoryId AND Rn = @DS1Rnd

end

else

begin

-- pick a random row from data set where CategoryId IS NULL

SELECT * fROM #TBL WHERE CategoryId IS NULL AND Rn = @DS2Rnd

end

DROP TABLE #TBL

ScottPletcher SSC Guru Points: 101099 More actions · Answer 9

Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

Handling gaps with my mechanism really is simple and efficient. Been there, done that, got the tshirt.

But your new requirement about category makes it much more likely you will need to do the sort thing, although possibly with an exists check first. If row exists with your category, do the sort/rownumber thing for that category. If it doesn't exist do the sort/rownumber thing for null categories. As Scott said, these would hopefully all be narrow-index hits.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

kiril.lazarov.77 Hall of Fame Points: 3742 More actions · Answer 11

TheSQLGuru (12/9/2015)
Handling gaps with my mechanism really is simple and efficient. Been there, done that, got the tshirt.
But your new requirement about category makes it much more likely you will need to do the sort thing, although possibly with an exists check first. If row exists with your category, do the sort/rownumber thing for that category. If it doesn't exist do the sort/rownumber thing for null categories. As Scott said, these would hopefully all be narrow-index hits.

Hi Kevin, Many thanks for your help. So basically you're saying I should use the script I attached a couple of comments above? Any way I can make that faster? That's very slow when I run it on my db. Thanks

kiril.lazarov.77 Hall of Fame Points: 3742 More actions · Answer 12

ScottPletcher (12/9/2015)
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.

Thanks for your response. Can you provide an example, please? Index on the Ids? That's my primary key (clustered index) so not sure what you mean. Thanks!!

ScottPletcher SSC Guru Points: 101099 More actions · Answer 13

kiril.lazarov.77 (12/10/2015)
ScottPletcher (12/9/2015)
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.
Thanks for your response. Can you provide an example, please? Index on the Ids? That's my primary key (clustered index) so not sure what you mean. Thanks!!

Depends on the width of the row. If it's much more than 4 bytes -- and usually it is -- you'l need the extra index to prevent a full table scan. In fact, SQL loves clustered indexes so much, you might even have to force it to use the nonclus index on Id.

CREATE UNIQUE NONCLUSTERED INDEX [index_name] ON [table_name] ( Id ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [filegroupname];

DECLARE @Id int

SELECT TOP (1) @Id = Id

FROM [table_name] WITH (INDEX([index_name]))

ORDER BY NEWID()

SELECT * --col,col,col...

FROM [table_name]

WHERE Id = @Id

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ScottPletcher SSC Guru Points: 101099 More actions · Answer 14

kiril.lazarov.77 (12/10/2015)
ScottPletcher (12/9/2015)
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.
Thanks for your response. Can you provide an example, please? Index on the Ids? That's my primary key (clustered index) so not sure what you mean. Thanks!!

Depends on the width of the row. If it's much more than 4 bytes -- and usually it is -- you'l need the extra index to prevent a full table scan. In fact, SQL loves clustered indexes so much, you might even have to force it to use the nonclus index on Id.

CREATE UNIQUE NONCLUSTERED INDEX [index_name] ON [table_name] ( Id ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [filegroupname];

DECLARE @Id int

SELECT TOP (1) @Id = Id

FROM [table_name] WITH (INDEX([index_name]))

ORDER BY NEWID()

SELECT * --col,col,col...

FROM [table_name]

WHERE Id = @Id

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".