February 22, 2010 at 1:42 pm
Hi,
We are currently taking a small sample from a 12-million row table several times a day and we need it to perform as fast as possible. Here is the stored procedure code we currently use, passing in the sample size (@count) and some filters (@forumID, @keyword):
SET NOCOUNT ON;
SELECT TOP (@count) k.ID, k.Keyword
FROM Keyword k with(nolock)
INNER JOIN Topic__Keyword tk with(nolock)
ON k.ID = tk.Keyword_ID
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), k.ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
AND k.SubCatID = @forumID AND k.ID <> @keyword
GROUP BY k.ID, k.Keyword
I would greatly appreciate your comments if you know any way to select a true random sample faster, or to optimize the query.
Thanks,
Alan
February 22, 2010 at 1:50 pm
You could use randon row sampling in SSIS.
February 22, 2010 at 2:36 pm
Mmight I suggest reading:
Limiting Result Sets by Using TABLESAMPLE at:
http://technet.microsoft.com/en-us/library/ms189108(SQL.90).aspx
Note:
When TABLESAMPLE is used against databases that are upgraded to SQL Server 2005, the compatibility level of the database must be set to 90. To set the database compatibility level, see sp_dbcmptlevel (Transact-SQL).
You can use TABLESAMPLE to quickly return a sample from a large table when either of the following conditions is true:
The sample does not have to be a truly random sample at the level of individual rows.
Rows on individual pages of the table are not correlated with other rows on the same page.
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1.
February 22, 2010 at 2:40 pm
Surely...ORDER BY NEWID()
would be faster than
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), k.ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
February 22, 2010 at 4:47 pm
ChrisM@home
Surely...ORDER BY NEWID()
would be faster
An order by clause means a sort ... a 10 percent sample of a
We are currently taking a small sample from a 12-million row table
Would mean the need to sort 1,200,000 rows ... not a quick task. Of course the percentage selected could also be reduced.
Would also eliminate the TOP (@Count) clause as well.
With or without those clauses click on either the "estimated or actual query execution plan icon in SSMS and review the displayed plan and then decide the T-SQL that you want to use.
February 25, 2010 at 3:18 pm
Thanks. We had tried ORDER BY NEWID() and as Ron suggested it was definitely slower.
TABLESAMPLE didn't work because we need a sample of joined rows and you can only perform a sample on each table.
SSIS is intriguing...
February 26, 2010 at 1:27 pm
ssc-574207
A little off the wall thinking here.
Never have attempted this, an only presented to get your thinking moving a different way.
1. Use TABLESAMPLE on table "A" to SELECT INTO a TEMP table or for that matter a VIEW.
2. Join the Temp table (VIEW) to table "B" to arrive at your final sample.
A drawback (and it could be major) is that you may not have a lot of matches between Table "A" and Table "B". You would have to test the sample percentage or number of rows returned from Table "A" to adjust for the miss match if it were substantial enough to interfere with what is your final objective.
Like I said off the wall thinking, just meant to nudge you into a different direction of thinking.
February 26, 2010 at 1:57 pm
Have you tried joining to a table containing "random" ID's?
DROP TABLE #Sampler
DECLARE @TableSize INT, @SampleSize INT
SET @TableSize = 1200000
SET @SampleSize = @TableSize/10 -- 10%
SELECT TOP(@SampleSize) SampleID = ABS(CHECKSUM(NEWID()))%@TableSize
INTO #Sampler
FROM master.sys.columns a, master.sys.columns b
CREATE CLUSTERED INDEX [SampleID] ON #Sampler ([SampleID] ASC)
Cheers
ChrisM
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
February 26, 2010 at 8:35 pm
ChrisM@home (2/26/2010)
Have you tried joining to a table containing "random" ID's?
DROP TABLE #Sampler
DECLARE @TableSize INT, @SampleSize INT
SET @TableSize = 1200000
SET @SampleSize = @TableSize/10 -- 10%
SELECT TOP(@SampleSize) SampleID = ABS(CHECKSUM(NEWID()))%@TableSize
INTO #Sampler
FROM master.sys.columns a, master.sys.columns b
CREATE CLUSTERED INDEX [SampleID] ON #Sampler ([SampleID] ASC)
Cheers
ChrisM
Hmmmm.... that gives me an idea... Thanks, Chris.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2010 at 6:28 am
Is the ID field on the Keyworld table a sequence number? Is it guaranteed to be contiguous? If so, simple determine the lowest (MIN) and highest (MAX) current ID, generate some random numbers inline, and use them to seek into the table directly. This method works especially well for small samples, but tends to be less efficient when a very large number of rows is involved. Probably worth exploring.
Oddly enough, I used this very technique earlier today when answering another post. Anyone interested can see the post here: http://www.sqlservercentral.com/Forums/FindPost873813.aspx
The relevant section is marked something like 'Setting date_time to NULL on random rows...'
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
February 27, 2010 at 8:37 am
Jeff Moden (2/26/2010)
Hmmmm.... that gives me an idea... Thanks, Chris.
You're welcome Jeff - it's always interesting to see what surprise you pull out of a hat when you have one of these moments 😉
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
February 27, 2010 at 9:56 am
ssc-574207 (2/25/2010)
Thanks. We had tried ORDER BY NEWID() and as Ron suggested it was definitely slower.TABLESAMPLE didn't work because we need a sample of joined rows and you can only perform a sample on each table.
SSIS is intriguing...
You may have already sussed this but could you post the query of joined rows? I'd like to play with this a bit because it's come up more than once. Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply