March 4, 2014 at 10:33 am
KoldCoffee (3/4/2014)
Hi Dixie Flatline, I saw this on another post and was trying to understand it so I appreciate that you explained the code more in this version. I will study it and thanks.
No problem, KC. Be happy to answer any questions.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 4, 2014 at 10:42 am
The Dixie Flatline (3/4/2014)
Masterful? Do you need to borrow money or something, Jason?
Well, since you asked...:-D
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
March 4, 2014 at 11:47 am
Here is a patch for the original code that was intended to find a small number of rows (50). It assumes that the key number N in the Million_3 table is not all gaps. It gets speed by avoiding a full table scan and sort.
Essentially it generates three times the number of randomIDs needed for the result set, then joins the set of randomIDs to the Million_3 table (ensuring that only "hits" are returned), and does a DISTINCT TOP (@rows) to eliminate duplicates. While its theoretically possible for less than @rows to be returned, it is so much faster for small values of @rows that a loop could check and make sure that @rows were actually returned and rerun the query if necessary.
The attached code will work with the Million_3 table from yesterday's example.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 4, 2014 at 12:16 pm
just a small note. Which method you choose may depend on your environment. By that I mean some environments are more disk bound that CPU, some vice sera, etc.
If you look at the method I posted back in 12/16/2010 11:17:04 PM you will see that it is slightly slower than the method Dixie posted. However, it is less resource intensive. Granted the difference is rather trivial, but perhaps it is of interest to someone. Here are the statistical differences:
Dixie's:
Table 'Million_3'. Scan count 5, logical reads 14085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1980 ms, elapsed time = 822 ms.
Lamprey's:
Table 'Million_3'. Scan count 1, logical reads 9418, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1450 ms, elapsed time = 1476 ms.
March 4, 2014 at 12:32 pm
Personally, I use a dart board. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
March 4, 2014 at 12:36 pm
Lamprey13 (12/16/2010)
Here is another method I've seen/used that is pretty similar to the ORDER BY NEWID(), but much faster:NOTE: Repalce "SomeColumnName" with some columns or set of columns from your table, PKs work well.
SELECT TOP 10 * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SomeColumnName) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
I like the CHECKSUM idea because you can use it for non-integer keys. However there is a caveat.
This technique would certainly be fast, but its subject to what I refer to as "frontloading". That is to say, because it takes the first ten rows that match its where condition, there is a tendency to pull rows from the first part of the table by whatever sort order is specified.
Running the above code against the Million_3 table, it always produced selections taken from the first couple thousand rows. Depending on the data, this may defeat the randomness of the sampling.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 4, 2014 at 12:37 pm
Jeff Moden (3/4/2014)
Personally, I use a dart board. 😛
How hard to you have to fling a pork chop to make it stick, Jeff? 😀
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 4, 2014 at 12:48 pm
The Dixie Flatline (3/4/2014)
Lamprey13 (12/16/2010)
Here is another method I've seen/used that is pretty similar to the ORDER BY NEWID(), but much faster:NOTE: Repalce "SomeColumnName" with some columns or set of columns from your table, PKs work well.
SELECT TOP 10 * FROM MyTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SomeColumnName) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
I like the CHECKSUM idea because you can use it for non-integer keys. However there is a caveat.
This technique would certainly be fast, but its subject to what I refer to as "frontloading". That is to say, because it takes the first ten rows that match its where condition, there is a tendency to pull rows from the first part of the table by whatever sort order is specified.
Running the above code against the Million_3 table, it always produced selections taken from the first couple thousand rows. Depending on the data, this may defeat the randomness of the sampling.
Fully agreed.
Like you mentioned it does work on other datatypes. However, it does skew things a bit in its "randomness." I did get slightly different results in that it always got over 2m. So, that may not be random enough for some.
Just food for though depending on people needs. 🙂
March 4, 2014 at 1:10 pm
Okay, I just had to do some testing.
If frontloading isn't an issue, you can use the following in place of the CHECKSUM test and get similar results with fewer CPU cycles. However, as Jason pointed out, selection of random rows is often a one-off task. Like you said, it depends on the circumstances of each problem.
---------------------
;with cte as (SELECT *, ABS(CHECKSUM(NEWID())) % 1000 as randomNo
FROM Million_3)
select top 50 *
from cte where randomNo < 10
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
July 12, 2018 at 5:34 pm
The Dixie Flatline - Friday, December 17, 2010 11:53 AMHere is an example of pulling 20 random rows from a million row table.Note to Jeff Moden: It may not be "Nasty Fast" but it's "Wicked Quick" :-DHad to make this an attachment. For some reason, SSC is blocking my posts.
Add a random seed to it to make it more random. Use the thousanth's of a second integer as the seed. The odds of getting the exact same hundredth of a second is pretty rare and makes it start at random point in the random number table. computers can't generate real random numbers, they use random number tables previously generated and successive random numbers follow the sequence of that table (so not really random). If you start up the computer and follow the exact same sequence you will generate the same random numbers each time. Rand(datepart(ms, getdate())) will make it start out at a pretty random point in the table each time. Hard to hit the enter key at the exact same thousandth of a second.
Viewing 10 posts - 16 through 24 (of 24 total)
You must be logged in to reply to this topic. Login to reply