December 8, 2015 at 6:28 pm
Hi all,
While the query below works as expected, it's very slow when I run it on a table with 1million+ rows. Is there an alternative to ORDER BY NEWID()?
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
SELECT TOP 1 Id,
CategoryId,
SKU
FROM @t
ORDER BY NEWID()
I also tried the below code (taken from https://msdn.microsoft.com/en-us/library/cc441928.aspx) but it sometimes returns nothing so is unreliable and changing "< 10" to "> 0" always returns the first row. Please help me getting one single & random row from the above data set efficiently. Thank you.
SELECT TOP 1 Id,
CategoryId,
SKU
FROM @t
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
December 8, 2015 at 7:42 pm
Here is something to play with
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
DECLARE @RandomRowNumber BIGINT = (SELECT CAST(ROUND((COUNT(*) - 1) * Rand() + 1, 0) AS BIGINT) FROM @T) ;
;WITH dataCTE AS
(
SELECT *
, RN = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM @T
)
SELECT *
FROM dataCTE
WHERE RN = @RandomRowNumber ;
December 9, 2015 at 3:39 am
Aditya Daruka (12/8/2015)
Here is something to play with
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
DECLARE @RandomRowNumber BIGINT = (SELECT CAST(ROUND((COUNT(*) - 1) * Rand() + 1, 0) AS BIGINT) FROM @T) ;
;WITH dataCTE AS
(
SELECT *
, RN = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM @T
)
SELECT *
FROM dataCTE
WHERE RN = @RandomRowNumber ;
Replace one of those two clustered index scans with a seek:
SELECT *
FROM (SELECT RandomRowNumber = 1+ABS(CHECKSUM(NEWID()))%COUNT(*) FROM @T) t
CROSS APPLY (SELECT TOP 1 * FROM @T WHERE RandomRowNumber > Id ORDER BY Id DESC) x
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 9, 2015 at 4:25 am
See my next post, don't know how to delete this error posting.
December 9, 2015 at 4:27 am
Most probably there were a lot of DELETEs an INSERTs on a table with 1million+ rows. So IDs hardly are evenly distributed on (1 .. count(*)) interval or even on (Min(id) ..Max(id) ) interval. Output will not be really random.
For example
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
;
delete from @t where id>3 and CategoryId=3200;
SELECT *
FROM (SELECT RandomRowNumber = min(id)+1+ABS(CHECKSUM(NEWID()))%(max(id)-min(id)+1) FROM @T) t
CROSS APPLY (SELECT TOP 1 * FROM @T WHERE RandomRowNumber > Id ORDER BY Id DESC) x
This will return Id=3 far frequently then others.
December 9, 2015 at 6:07 am
You simply MUST NOT order all rows to get one!! That is an immensely costly operation!
One quick-and-easy way is to get min and max ID values and create a WHILE loop and generate a random variable int value between those (variety of ways to accomplish this) and then seek on that value with a WHERE id = @id filter. This will be a VERY efficient PK index seek operation. If you don't ever do any deletes and you start with 1 and go up you simply need to get the max value and you will be almost guaranteed to get a hit on the first call, only missing if someone rolled back an insert. Even if you have fairly large ranges of missing records this will almost certainly be faster than scanning all rows into a temp table and doing the row_number, NEWID() thing.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 9, 2015 at 6:15 am
serg-52 (12/9/2015)
Most probably there were a lot of DELETEs an INSERTs on a table with 1million+ rows. So IDs hardly are evenly distributed on (1 .. count(*)) interval or even on (Min(id) ..Max(id) ) interval. Output will not be really random.For example
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT NULL, '77765-555-0990'
;
delete from @t where id>3 and CategoryId=3200;
SELECT *
FROM (SELECT RandomRowNumber = min(id)+1+ABS(CHECKSUM(NEWID()))%(max(id)-min(id)+1) FROM @T) t
CROSS APPLY (SELECT TOP 1 * FROM @T WHERE RandomRowNumber > Id ORDER BY Id DESC) x
This will return Id=3 far frequently then others.
Yes it will. You could mitigate that somewhat by using TOP(n), whatever value you want for n, and choosing one of those rows at random - but see Kevin's post below for an alternative method where ID's have been deleted.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 9, 2015 at 6:43 am
TheSQLGuru (12/9/2015)
You simply MUST NOT order all rows to get one!! That is an immensely costly operation!One quick-and-easy way is to get min and max ID values and create a WHILE loop and generate a random variable int value between those (variety of ways to accomplish this) and then seek on that value with a WHERE id = @id filter. This will be a VERY efficient PK index seek operation. If you don't ever do any deletes and you start with 1 and go up you simply need to get the max value and you will be almost guaranteed to get a hit on the first call, only missing if someone rolled back an insert. Even if you have fairly large ranges of missing records this will almost certainly be faster than scanning all rows into a temp table and doing the row_number, NEWID() thing.
I think it totally depends on how much "random" this random row is required to be. If (p(id) >0 for any id) is quite enough then i agree, no need to enumerate rows. If more strict (p(id1)=p(id2) for any id1,id2) is required then i can't see how to achieve this without row_numbering ids first in general case.
December 9, 2015 at 1:15 pm
Thank you everyone for the replies, very helpful!!
Unfortunately there's no guarantee that there would be no gaps. Also, I need ti implement a logic similar to this. E.g. pick a random row if the declared CategoryId exists, otherwise pick a random row from the set where categoryId is null. I know it's ugly but I just want to demonstrate the logic and expected results. Thank you.
DECLARE @T TABLE (Id INT IDENTITY(1,1) PRIMARY KEY, CategoryId INT NULL, SKU VARCHAR(60) NOT NULL)
INSERT INTO @T ( CategoryId, SKU)
SELECT 3200, '978-654-321' UNION ALL
SELECT NULL, '53-555-64' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT NULL, '888-555-66' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT NULL, '23-555-222' UNION ALL
SELECT 3200, '000-258-963' UNION ALL
SELECT 3200, '969-555-010' UNION ALL
SELECT 6565, '969-555-010' UNION ALL
SELECT 6565, '23-555-222' UNION ALL
SELECT 6565, '000-258-963' UNION ALL
SELECT NULL, '888-555-010' UNION ALL
SELECT 55, '888-555-66' UNION ALL
SELECT 55, '969-555-010' UNION ALL
SELECT 33, '969-555-010' UNION ALL
SELECT NULL, '77765-555-0990'
DECLARE @DS1MinValue INT
DECLARE @DS1MaxValue INT
DECLARE @DS2MinValue INT
DECLARE @DS2MaxValue INT
DECLARE @CategoryId INT = 6565
CREATE TABLE #TBL (Id INT, Rn INT, CategoryId INT, SKU VARCHAR(50))
INSERT INTO #TBL(Id, Rn, CategoryId, SKU)
SELECT Id, row_number() over(partition by case when categoryId is null then 0 else categoryId end order by case when categoryId is null then 0 else categoryId end ) AS Rn, CategoryId, SKU
from @t
SELECT @DS1MinValue= (select MIN(Rn) from #TBL where categoryId = @CategoryId)
SELECT @DS1MaxValue= (select MAX(Rn) from #TBL where categoryId = @CategoryId)
SELECT @DS2MinValue= (select MIN(Rn) from #TBL where categoryId is null)
SELECT @DS2MaxValue= (select MAX(Rn) from #TBL where categoryId is null)
DECLARE @DS1Rnd INT = (SELECT ROUND(((@DS1MaxValue - @DS1MinValue) * RAND() + @DS1MinValue), 0))
DECLARE @DS2Rnd INT = (SELECT ROUND(((@DS2MaxValue - @DS2MinValue) * RAND() + @DS2MinValue), 0))
if exists (select Id from #TBL where CategoryId = @CategoryId)
begin
-- pick a random row from data set where CategoryId IS NOT NULL
SELECT * fROM #TBL WHERE CategoryId = @CategoryId AND Rn = @DS1Rnd
end
else
begin
-- pick a random row from data set where CategoryId IS NULL
SELECT * fROM #TBL WHERE CategoryId IS NULL AND Rn = @DS2Rnd
end
DROP TABLE #TBL
December 9, 2015 at 1:18 pm
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 9, 2015 at 3:25 pm
Handling gaps with my mechanism really is simple and efficient. Been there, done that, got the tshirt.
But your new requirement about category makes it much more likely you will need to do the sort thing, although possibly with an exists check first. If row exists with your category, do the sort/rownumber thing for that category. If it doesn't exist do the sort/rownumber thing for null categories. As Scott said, these would hopefully all be narrow-index hits.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 10, 2015 at 5:59 pm
TheSQLGuru (12/9/2015)
Handling gaps with my mechanism really is simple and efficient. Been there, done that, got the tshirt.But your new requirement about category makes it much more likely you will need to do the sort thing, although possibly with an exists check first. If row exists with your category, do the sort/rownumber thing for that category. If it doesn't exist do the sort/rownumber thing for null categories. As Scott said, these would hopefully all be narrow-index hits.
Hi Kevin, Many thanks for your help. So basically you're saying I should use the script I attached a couple of comments above? Any way I can make that faster? That's very slow when I run it on my db. Thanks
December 10, 2015 at 6:01 pm
ScottPletcher (12/9/2015)
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.
Thanks for your response. Can you provide an example, please? Index on the Ids? That's my primary key (clustered index) so not sure what you mean. Thanks!!
December 10, 2015 at 8:19 pm
kiril.lazarov.77 (12/10/2015)
ScottPletcher (12/9/2015)
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.Thanks for your response. Can you provide an example, please? Index on the Ids? That's my primary key (clustered index) so not sure what you mean. Thanks!!
Depends on the width of the row. If it's much more than 4 bytes -- and usually it is -- you'l need the extra index to prevent a full table scan. In fact, SQL loves clustered indexes so much, you might even have to force it to use the nonclus index on Id.
CREATE UNIQUE NONCLUSTERED INDEX [index_name] ON [table_name] ( Id ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [filegroupname];
DECLARE @Id int
SELECT TOP (1) @Id = Id
FROM [table_name] WITH (INDEX([index_name]))
ORDER BY NEWID()
SELECT * --col,col,col...
FROM [table_name]
WHERE Id = @Id
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 10, 2015 at 8:19 pm
kiril.lazarov.77 (12/10/2015)
ScottPletcher (12/9/2015)
Create a separate index of just the IDs. Then select a random ID using ORDER BY NEWID(), then use that ID to get the rest of the columns. You certainly don't want to have to sort the entire row.Thanks for your response. Can you provide an example, please? Index on the Ids? That's my primary key (clustered index) so not sure what you mean. Thanks!!
Depends on the width of the row. If it's much more than 4 bytes -- and usually it is -- you'l need the extra index to prevent a full table scan. In fact, SQL loves clustered indexes so much, you might even have to force it to use the nonclus index on Id.
CREATE UNIQUE NONCLUSTERED INDEX [index_name] ON [table_name] ( Id ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [filegroupname];
DECLARE @Id int
SELECT TOP (1) @Id = Id
FROM [table_name] WITH (INDEX([index_name]))
ORDER BY NEWID()
SELECT * --col,col,col...
FROM [table_name]
WHERE Id = @Id
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply