Fast random row selector

Question

Post reply

Fast random row selector

ssc-574207

SSC Enthusiast

Points: 190
More actions
February 22, 2010 at 1:42 pm

#215998

Hi,
We are currently taking a small sample from a 12-million row table several times a day and we need it to perform as fast as possible. Here is the stored procedure code we currently use, passing in the sample size (@count) and some filters (@forumID, @keyword):
SET NOCOUNT ON;
SELECT TOP (@count) k.ID, k.Keyword
FROM Keyword k with(nolock)
INNER JOIN Topic__Keyword tk with(nolock)
ON k.ID = tk.Keyword_ID
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), k.ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
AND k.SubCatID = @forumID AND k.ID <> @keyword
GROUP BY k.ID, k.Keyword
I would greatly appreciate your comments if you know any way to select a true random sample faster, or to optimize the query.
Thanks,
Alan

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Steve Cullen SSCertifiable Points: 5598 More actions · Answer 1

You could use randon row sampling in SSIS.

Converting oxygen into carbon dioxide, since 1955.

Ron McCullough SSC Guru Points: 63877 More actions · Answer 2

Mmight I suggest reading:

Limiting Result Sets by Using TABLESAMPLE at:

http://technet.microsoft.com/en-us/library/ms189108(SQL.90).aspx

Note:
When TABLESAMPLE is used against databases that are upgraded to SQL Server 2005, the compatibility level of the database must be set to 90. To set the database compatibility level, see sp_dbcmptlevel (Transact-SQL).
You can use TABLESAMPLE to quickly return a sample from a large table when either of the following conditions is true:
The sample does not have to be a truly random sample at the level of individual rows.
Rows on individual pages of the table are not correlated with other rows on the same page.

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 3

Surely...ORDER BY NEWID()

would be faster than

WHERE 0.01 >= CAST(CHECKSUM(NEWID(), k.ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Ron McCullough SSC Guru Points: 63877 More actions · Answer 4

ChrisM@home
Surely...ORDER BY NEWID()
would be faster

An order by clause means a sort ... a 10 percent sample of a

We are currently taking a small sample from a 12-million row table

Would mean the need to sort 1,200,000 rows ... not a quick task. Of course the percentage selected could also be reduced.

Would also eliminate the TOP (@Count) clause as well.

With or without those clauses click on either the "estimated or actual query execution plan icon in SSMS and review the displayed plan and then decide the T-SQL that you want to use.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

ssc-574207 SSC Enthusiast Points: 190 More actions · Answer 5

Thanks. We had tried ORDER BY NEWID() and as Ron suggested it was definitely slower.

TABLESAMPLE didn't work because we need a sample of joined rows and you can only perform a sample on each table.

SSIS is intriguing...

Ron McCullough SSC Guru Points: 63877 More actions · Answer 6

ssc-574207

A little off the wall thinking here.

Never have attempted this, an only presented to get your thinking moving a different way.

1. Use TABLESAMPLE on table "A" to SELECT INTO a TEMP table or for that matter a VIEW.

2. Join the Temp table (VIEW) to table "B" to arrive at your final sample.

A drawback (and it could be major) is that you may not have a lot of matches between Table "A" and Table "B". You would have to test the sample percentage or number of rows returned from Table "A" to adjust for the miss match if it were substantial enough to interfere with what is your final objective.

Like I said off the wall thinking, just meant to nudge you into a different direction of thinking.

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 7

Have you tried joining to a table containing "random" ID's?

DROP TABLE #Sampler

DECLARE @TableSize INT, @SampleSize INT

SET @TableSize = 1200000

SET @SampleSize = @TableSize/10 -- 10%

SELECT TOP(@SampleSize) SampleID = ABS(CHECKSUM(NEWID()))%@TableSize

INTO #Sampler

FROM master.sys.columns a, master.sys.columns b

CREATE CLUSTERED INDEX [SampleID] ON #Sampler ([SampleID] ASC)

Cheers

ChrisM

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 8

ChrisM@home (2/26/2010)
Have you tried joining to a table containing "random" ID's?
DROP TABLE #Sampler
DECLARE @TableSize INT, @SampleSize INT
SET @TableSize = 1200000
SET @SampleSize = @TableSize/10 -- 10%
SELECT TOP(@SampleSize) SampleID = ABS(CHECKSUM(NEWID()))%@TableSize
INTO #Sampler
FROM master.sys.columns a, master.sys.columns b
CREATE CLUSTERED INDEX [SampleID] ON #Sampler ([SampleID] ASC)
Cheers
ChrisM

Hmmmm.... that gives me an idea... Thanks, Chris.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 9

Is the ID field on the Keyworld table a sequence number? Is it guaranteed to be contiguous? If so, simple determine the lowest (MIN) and highest (MAX) current ID, generate some random numbers inline, and use them to seek into the table directly. This method works especially well for small samples, but tends to be less efficient when a very large number of rows is involved. Probably worth exploring.

Oddly enough, I used this very technique earlier today when answering another post. Anyone interested can see the post here: http://www.sqlservercentral.com/Forums/FindPost873813.aspx

The relevant section is marked something like 'Setting date_time to NULL on random rows...'

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 10

Jeff Moden (2/26/2010)
Hmmmm.... that gives me an idea... Thanks, Chris.

You're welcome Jeff - it's always interesting to see what surprise you pull out of a hat when you have one of these moments 😉

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 11

ssc-574207 (2/25/2010)
Thanks. We had tried ORDER BY NEWID() and as Ron suggested it was definitely slower.
TABLESAMPLE didn't work because we need a sample of joined rows and you can only perform a sample on each table.
SSIS is intriguing...

You may have already sussed this but could you post the query of joined rows? I'd like to play with this a bit because it's come up more than once. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)