August 9, 2007 at 11:53 am
Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/awarren/3175.asp
August 22, 2007 at 9:59 pm
How about this?
http://askgabbar.blogspot.com/2007/03/pick-up-random-records-from-table.html
Didn't check it for performance though.
August 22, 2007 at 10:39 pm
I had a need for pseudo randomness (display a few "random" images from a gallery) that I addressed thusly:
-Assign a random value to each row as a column value and have a lookup index on this.
-Each time I want 10 random images, I grab TOP 10 order by abs(@currentRandom - persistedRandom) ASC. So this gives me the nearest 10 rows to whatever my random seek point is. The problem is you will get clusters of results for a given set of persistedRandom values.
I'm sure there are better ways but this worked decent for my needs.
August 23, 2007 at 2:51 am
I have also not had time to benchmark performance, but here is another technique that gives the illusion of randomisation:
ORDER
BY SUBSTRING(REVERSE(REPLACE(table.textcolumn,' ','')),convert(int,rand()*10),1) ASC
This uses the text in a column from the query, reverses it and removes spaces, and then extracts a randomly selected character to order the results.
Obviously this is limited to situations where you have a text column to work with...
August 23, 2007 at 6:50 am
how do you calculate the reads?
August 23, 2007 at 7:45 am
Minor point... At the end of the article where you say you could set the uniqueID column as a pkey I think you mean to set it as a clustered index thus allowing returning of rows in the GUID order without any sorting effort. Common to mix up the pkeys & the clustered index of a table as they often tend to be one and the same if the defaults are accepted
It is tricky getting random data in a set-based fashion. Perhaps you could do something along the lines of
1. Having a unique row number for each row in your table without gaps - an identity column would be perfect
2. Indexing by this column either clustered or using at least a covering index
3. Use some mechanism to generate row numbers to select. Perhaps the pairs of digits in a generated GUID (or triplets, quads, etc depending on how many rows you have)....
Step #3 might be the lengthy & wasteful bit though..... Will give it some more thought
August 23, 2007 at 8:00 am
When I first started to read this, I thought "Jeez... that's a little obvious, Andy. Hardly worth an article." Now that I see how some folks are trying to randomize data, I have to admit that such a short and sweet article about it is well overdue.
I don't believe that anyone will come up with a method that is either faster or more consistently random than just doing the ORDER BY NEWID()... but it is fun to watch folks try.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 23, 2007 at 8:36 am
Using "ORDER BY CHECKSUM(NEWID())" rather than "ORDER BY NEWID()" apparently produces values with a better random distribution.
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
August 23, 2007 at 9:23 am
Ian,
I think this is a good approach and it is actually what some coworkers and I ended up with awhile back (more them than me ). Our problem statement involved noncontiguous ranges, so using an identity column was not sufficient. The solution was to use row_number() ranking function and join against a temp table of random values.
I find that with 3.3 MM rows, this approach is about half the cpu and twice as fast as doing order by newid() for this contrived test. Subsequent executions take about 20% of the cpu that order by newid(), and still about twice as fast.
So when you have many rows and a noncontiguous key to index off of, this approach appears to be superior if not as simple.
-- BEGIN TEST PREP
create table test_table (number int primary key clustered, payload varchar(1));
WITH
-- first CTE which returns 10 rows (0-9)
digits AS (
SELECT 0 as Number
UNION SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
)
-- second CTE which returns 10 million rows by using
-- a CROSS JOIN on the first CTE
, dig AS (
SELECT
(millions.Number * 1000000)
+ (hThousands.Number * 100000)
+ (tThousands.Number * 10000)
+ (thousands.Number * 1000)
+ (hundreds.Number * 100)
+ (tens.Number * 10)
+ ones.Number AS Number
FROM digits AS ones
CROSS JOIN digits AS tens
CROSS JOIN digits AS hundreds
CROSS JOIN digits AS thousands
CROSS JOIN digits AS tThousands
CROSS JOIN digits AS hThousands
CROSS JOIN digits as millions
)
INSERT test_table(number, payload)
SELECT number, 'z'
FROM dig
WHERE number % 3 = 0
go
--END TEST PREP
set statistics time on
set statistics io on
--BEGIN TEST 1
select top 10 number, payload from test_table order by newid();
GO 2
--END TEST 1
--BEGIN TEST 2
DECLARE @TV table(rand_num int)
DECLARE @max-2 int, @num int
select @num=10, @max-2 = count(*) from test_table;
DECLARE @i int
SET @i = 0
WHILE (@i < @num)
BEGIN
INSERT @TV values (ceiling(rand() * @max-2))
SET @i = @i + 1
END
SELECT *
FROM
(
select
row_number() over (order by number) row,
*
from test_table
) a
WHERE a.row in (SELECT rand_num from @TV)
GO 2
--END TEST 2
August 23, 2007 at 10:28 am
Jeff, I'll grant it is a simple topic - I figured by now I'd get clobbered by some statistics major for even using the word random! Maybe someone with a stats background could analyze all the options posed to see which is most random/best random?:-)
August 23, 2007 at 11:57 am
I actually needed to use this on something, so I've just spent an hour playing around with it.
Here's some testing I've just done on a Messages table I have which has just over a millon rows.
It compares the NEWID() method, the CHECKSUM(NEWID()) method, and a 3rd method I've just come up with (which is heavily based on the 'important' note in this link: http://msdn2.microsoft.com/en-us/library/ms189108.aspx).
Sadly, I've not had time to play around with the other methods suggested here, and I'm just going to go for Method 3 for the time being (for my data, it's more than twice as fast as newid() on average, and up to 6 times as fast for small sample sizes).
Here's the code I used, with the results underneath...
--table to hold results
CREATE TABLE #Results (NumberOfTableRows INT, NumberOfRows INT, ProbabilityOfEachRow FLOAT,
Method1_TimeInMs INT, Method2_TimeInMs INT, Method3_TimeInMs INT)
DECLARE @Counter TINYINT
SET @Counter = 0
WHILE @Counter <= 10 --perform the comparison this many times
BEGIN
DECLARE @NumberOfRows INT
SET @NumberOfRows = power(10, rand()*5) --random sample size to test with
--
DECLARE @d DATETIME, @t1 INT, @t2 INT, @t3 INT
SET @d = GETDATE();
--Method 1
SELECT TOP (@NumberOfRows) MessageId FROM dbo.Messages ORDER BY NEWID()
--
SET @t1 = DATEDIFF(ms, @d, GETDATE()); SET @d = GETDATE();
--Method 2
SELECT TOP (@NumberOfRows) MessageId FROM dbo.Messages ORDER BY CHECKSUM(NEWID())
--
SET @t2 = DATEDIFF(ms, @d, GETDATE()); SET @d = GETDATE();
--Method 3
DECLARE @NumberOfTableRows INT, @ProbabilityOfEachRow FLOAT
SELECT @NumberOfTableRows = COUNT(*) from dbo.Messages
SET @ProbabilityOfEachRow = (1+SQRT(1.0/@NumberOfRows)*10) * @NumberOfRows / @NumberOfTableRows --weighted overestimate
SELECT TOP (@NumberOfRows) MessageId FROM Messages --select required number from overestimated sample
WHERE @ProbabilityOfEachRow >= CAST(CHECKSUM(NEWID(), MessageId) & 0x7fffffff AS FLOAT) / CAST (0x7fffffff AS INT)
--
SET @t3 = DATEDIFF(ms, @d, GETDATE())
--Insert results
INSERT #Results SELECT @NumberOfTableRows, @NumberOfRows, @ProbabilityOfEachRow, @t1, @t2, @t3
--
SET @Counter = @Counter + 1
END
--select results and tidy up
SELECT *, NumberOfTableRows * ProbabilityOfEachRow as NumberOfRows_OverEstimate FROM #Results
SELECT AVG(Method1_TimeInMs) AS 'Method1', AVG(Method2_TimeInMs) AS 'Method2', AVG(Method3_TimeInMs) AS 'Method3' FROM #Results
--DROP TABLE #Results
/*
NumberOfTableRows NumberOfRows ProbabilityOfEachRow Method1_TimeInMs Method2_TimeInMs Method3_TimeInMs NumberOfRows_OverEstimate
----------------- ------------ ---------------------- ---------------- ---------------- ---------------- -------------------------
1079894 7365 0.00761481754808218 2326 1796 1406 8223.19578126865
1079894 49655 0.0480448452758635 2940 2420 1580 51883.3401443334
1079894 16593 0.0165582345926519 2373 1860 1513 17881.1381871972
1079894 63666 0.0612923233001256 3220 2626 1653 66189.2121778658
1079894 15 4.97547291325635E-05 2143 1546 530 53.7298334618805
1079894 18703 0.0185857029646503 2486 1843 1530 20070.589117308
1079894 12255 0.0123734579816154 2266 1763 1406 13362.0230335986
1079894 8403 0.00863017924625119 2220 1703 1450 9319.67878695118
1079894 10 3.85433909269649E-05 2283 1576 390 41.6227766016838
1079894 1 1.01861849403738E-05 2190 1593 376 11
1079894 6 2.82387877215567E-05 2153 1593 266 30.4948974277828
(11 row(s) affected)
--Average time in ms
Method1 Method2 Method3
----------- ----------- -----------
2418 1847 1100
(1 row(s) affected)
*/
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
August 23, 2007 at 12:05 pm
Ryan,
Approach #3 is very interesting, but I can't imagine it outperforms the solution I posted on page 1 of this thread. If you end up testing that one out as well I would be very curious to see how it compares.
August 23, 2007 at 5:31 pm
You know me, Ryan... extraordinary claims require extraordinary proof... got code to prove that?
--Jeff Moden
Change is inevitable... Change for the better is not.
August 23, 2007 at 5:36 pm
Heh... I'm no stats Ninja, either... but it's a heck of a lot more random with fairly even and unpredictable results over both small and large ranges of data than the other solutions I've seen.
I've tested it about a gazillion times in the past... I'm happy with it.. someone else can write some code to prove otherwise
--Jeff Moden
Change is inevitable... Change for the better is not.
August 23, 2007 at 10:55 pm
Dang... sorry about that... I responded to the first page and didn't look on the 2nd page... sorry Ryan.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply