Performance: Every nth item

  • Hi,

    I have a large table, with records for items at locations at different dates.

    I need to return field 'item' for every 'nth' record ('location' and 'date' don't matter), and I have a query which does what I need.

    However, performance is not good when run on a real database with 100 million rows +.

    In real life, instead of returning every 5th record, I might return every 10,000th record.

    I have included my current query, and some test data.

    Is there a better way of returning the information I need?

    (PS: How do people include the nice boxes with scrollbars in their posts?)

    Thanks,

    Andrew

    ----------------------------------------------------

    -- Query I have now:

    SELECT nth.ITEM

    FROM (

    SELECT allItems.ITEM, ROW_NUMBER() OVER(ORDER BY ITEM) AS RowNum

    FROM (

    SELECT TEST.ITEM

    FROM TEST

    WHERE TEST.LOCATION IN ('L1','L2','L3')

    ) allItems

    ) nth

    WHERE nth.RowNum % 5 = 0

    ------------------------------------------------

    -- Test Data

    CREATE TABLE TEST(

    ITEM char(30),

    LOCATION char(6),

    DATE datetime,

    AMOUNT decimal(8, 0)

    )

    CREATE UNIQUE CLUSTERED INDEX TEST_INDEX ON TEST

    (

    LOCATION ASC,

    ITEM ASC,

    DATE ASC

    )

    INSERT INTO TEST

    SELECT 'A01','L1','Jan 1 2007 12:00AM','50'UNION ALL

    SELECT 'A01','L1','Feb 1 2007 12:00AM','20'UNION ALL

    SELECT 'A12','L1','Jan 1 2007 12:00AM','20'UNION ALL

    SELECT 'A13','L1','Jan 1 2007 12:00AM','20'UNION ALL

    SELECT 'A13','L1','Feb 1 2007 12:00AM','25'UNION ALL

    SELECT 'A13','L1','Mar 1 2007 12:00AM','35'UNION ALL

    SELECT 'A15','L1','Mar 1 2007 12:00AM','10'UNION ALL

    SELECT 'A02','L1','Jan 1 2007 12:00AM','10'UNION ALL

    SELECT 'A03','L1','Jan 1 2007 12:00AM','10'UNION ALL

    SELECT 'A05','L1','Jan 1 2007 12:00AM','40'UNION ALL

    SELECT 'A05','L1','Feb 1 2007 12:00AM','50'UNION ALL

    SELECT 'A05','L1','Mar 1 2007 12:00AM','20'UNION ALL

    SELECT 'A06','L1','Jan 1 2007 12:00AM','5'UNION ALL

    SELECT 'A07','L1','Jan 1 2007 12:00AM','6'UNION ALL

    SELECT 'A09','L1','Jan 1 2007 12:00AM','2'UNION ALL

    SELECT 'A09','L1','Feb 1 2007 12:00AM','30'UNION ALL

    SELECT 'A09','L1','Mar 1 2007 12:00AM','1'UNION ALL

    SELECT 'A01','L2','Jan 1 2007 12:00AM','10'UNION ALL

    SELECT 'A01','L2','Mar 1 2007 12:00AM','1'UNION ALL

    SELECT 'A10','L2','Mar 1 2007 12:00AM','2'UNION ALL

    SELECT 'A14','L2','Feb 1 2007 12:00AM','20'UNION ALL

    SELECT 'A02','L2','Jan 1 2007 12:00AM','5'UNION ALL

    SELECT 'A04','L2','Jan 1 2007 12:00AM','2'UNION ALL

    SELECT 'A04','L2','Feb 1 2007 12:00AM','3'UNION ALL

    SELECT 'A05','L2','Jan 1 2007 12:00AM','10'UNION ALL

    SELECT 'A05','L2','Feb 1 2007 12:00AM','15'UNION ALL

    SELECT 'A05','L2','Mar 1 2007 12:00AM','25'UNION ALL

    SELECT 'A08','L2','Jan 1 2007 12:00AM','10'UNION ALL

    SELECT 'A08','L2','Feb 1 2007 12:00AM','20'UNION ALL

    SELECT 'A08','L2','Mar 1 2007 12:00AM','20'UNION ALL

    SELECT 'A09','L2','Mar 1 2007 12:00AM','2'UNION ALL

    SELECT 'A01','L3','Jan 1 2007 12:00AM','200'UNION ALL

    SELECT 'A11','L3','Feb 1 2007 12:00AM','10'UNION ALL

    SELECT 'A05','L3','Jan 1 2007 12:00AM','1'UNION ALL

    SELECT 'A05','L3','Feb 1 2007 12:00AM','2'UNION ALL

    SELECT 'A05','L3','Mar 1 2007 12:00AM','3'UNION ALL

    SELECT 'A08','L3','Feb 1 2007 12:00AM','30'UNION ALL

    SELECT 'A08','L3','Mar 1 2007 12:00AM','35'UNION ALL

    SELECT 'A09','L3','Feb 1 2007 12:00AM','35'UNION ALL

    SELECT 'A09','L3','Mar 1 2007 12:00AM','1'

  • performance is slow because your query with row_number() will number ever row regardless of whether or not you retrieve it.

    are you looking for random/sample data? if so, you can use the TABLESAMPLE clause:

    SELECT ... FROM ...

    TABLESAMPLE (10 PERCENT)

    or

    SELECT ... FROM ...

    TABLESAMPLE (200 rows)

    here's another suggestion from SQL Server 2005 docs:

    If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

    SELECT * FROM Sales.SalesOrderDetail

    WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)

    / CAST (0x7fffffff AS int)

    p.s.: to get the scrollbars inside a message, wrap that code in tags.

  • Thanks Antonio,

    No, I am not looking for random data - I need every nth item, ordered by item.

    The real database has (many) other tables, including an item table, with exactly one record per item, and an item-location table, with exactly one record for every item-location, where the item exists at a location.

    In table TEST, an item may have any number of records, including 0.

    Would I get better performance (still looking for the 'item' in every nth record in TEST, ordered by item) by joining on one of these other tables?

    Andrew

  • eliminating some of the nested derivations might help:

    with Skipper( item, rowNum)

    as

    (

    SELECT item, ROW_NUMBER() OVER (ORDER BY item)

    FROM test

    WHERE location in ('L1','L2','L3')

    )

    select * from Skipper where rowNum % 3 = 0

    of course, location needs to have a supporting index for the in() criteria to help you.

    still, if you have a 100M+ row table, using row_number() to get every Nth row will be slow. it has to number ever row in the table regardless of if you're getting every 5th or every 50,000th row.

    if you added an indexed identity column to the table, using ident_col % N = 0 will not use the index and you'd still be table scanning but at least it wouldn't be using tempdb to store the temporal row_number().

  • antonio.collins (4/1/2008)

    if you added an indexed identity column to the table, using ident_col % N = 0 will not use the index and you'd still be table scanning but at least it wouldn't be using tempdb to store the temporal row_number().

    Using Identity will only work if they never have a rollbacked insert or never delete a record. If ROW_NUMBER() does cause you to scan every record, which I actually doubt as the where is processed before the select list by the query processor. You are probably getting a table scan because of the IN operator. You may be better off with a join on the item_location table and use that to limit your locations. You may get better index usage that way especially since you will eliminate any items not in a location. Something like (stealing some code from antonio:D):

    with Skipper( item, rowNum)

    as

    (

    SELECT

    item,

    ROW_NUMBER() OVER (ORDER BY item)

    FROM

    test T Join

    item_location I On

    T.item = I.item And

    T.location = I.location

    WHERE

    I.location in ('L1','L2','L3')

    )

    select * from Skipper where rowNum % 3 = 0

    This may get you better index usage, but the overhead of the join may negate that.

    How often does this query run? What is the business reason for wanting every nth item? I only ask because this seems like an odd request and I am interested in the answer.

  • Thanks, Jack.

    Yes, we do inserts into this table, and rollbacks are possible. However, we don't really need the 'exact' nth item.

    The business reason: We have an application which works on the data. In this particular case, items from many locations are being combined into one (eg head office) location. The real life table has many other fields - to get the combined record, some fields are summed, and a number of other columns have a number of different business rules. The application does all this processing and writes the 'head office' item back to the database. However, we can't cope with loading all records into memory at once, so we do it in increments. The purpose of this query is to try to get (approximately) x rows at a time. The application remembers item 10000, item 20000, etc and each iteration loads records "where item >= 'A1' and item < 'C1'", then "where item >= 'C1' and item < 'F2'", etc.

    If it is possible, I would like to get my query using the target 'TEST' table into a working condition for a lot of data. If it can't be done, I will have to use another approach. We could, for instance, use our (much smaller) item table to get every 'nth' item for this purpose, but some items will have hundreds of records in table 'TEST', and some items will have few or no records.

    Thanks,

    Andrew

  • This is actually very similar to the classic "paging problem"... I'll see if I can whip up an example tomorrow (today actually... it's after midnight here) ...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • What is the Primary Key of the table called "Test"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff,

    We don't have a Primary Key defined on the table.

    We do have a unique clustered index on (LOCATION, ITEM, DATE).

    Should we define this combination of columns as the Primary Key too?

    Andrew

  • abcdefghy (4/1/2008)


    Thanks, Jack.

    Yes, we do inserts into this table, and rollbacks are possible. However, we don't really need the 'exact' nth item.

    The business reason: We have an application which works on the data. In this particular case, items from many locations are being combined into one (eg head office) location. The real life table has many other fields - to get the combined record, some fields are summed, and a number of other columns have a number of different business rules. The application does all this processing and writes the 'head office' item back to the database. However, we can't cope with loading all records into memory at once, so we do it in increments. The purpose of this query is to try to get (approximately) x rows at a time. The application remembers item 10000, item 20000, etc and each iteration loads records "where item >= 'A1' and item < 'C1'", then "where item >= 'C1' and item < 'F2'", etc.

    If it is possible, I would like to get my query using the target 'TEST' table into a working condition for a lot of data. If it can't be done, I will have to use another approach. We could, for instance, use our (much smaller) item table to get every 'nth' item for this purpose, but some items will have hundreds of records in table 'TEST', and some items will have few or no records.

    Thanks,

    Andrew

    So you really aren't looking for every nth item you are looking for n records, these are 2 very different requirements. Couldn't you load each location separately? This would, in theory, limit the rows in your example to 1/3 each time.

    What is the primary or unique key on the destination table? You could use Select Top N From source left join dest on unique_key where destintion.uniquekey is null.

  • abcdefghy (4/6/2008)


    Thanks Jeff,

    We don't have a Primary Key defined on the table.

    We do have a unique clustered index on (LOCATION, ITEM, DATE).

    Should we define this combination of columns as the Primary Key too?

    Andrew

    Essentially a unique clustered index is a primary key. A primary key defines uniqueness and, in SQL Server, when created defaults to a clustered index.

  • Jack Corbett (4/6/2008)


    So you really aren't looking for every nth item you are looking for n records, these are 2 very different requirements. Couldn't you load each location separately? This would, in theory, limit the rows in your example to 1/3 each time.

    What is the primary or unique key on the destination table? You could use Select Top N From source left join dest on unique_key where destintion.uniquekey is null.

    In the real application there might be hundreds of locations being consolidated into one combined location (the source table is also the destination table). There may (or may not) be an existing combined record. Often, a combined record does not change. We don't want to do unnecessary updates when nothing has changed, and definitely don't want to do one update per source location for every combined item. Some business rules mean we need to compare a field in all locations to decide the value in the combined location - i.e. while we are consolidating an item we need all source locations for the item in memory at once. Sometimes a location moves from one region to another.

    Perhaps there are more efficient ways of doing the work, but this way has been working well enough. All I was wondering was, is there a way I can find (approximately) every n'th item, so that we can consistently process (for example) about 10,000 records every iteration, instead of processing 10 items on one iteration, and 50,000 items (running into memory problems) on the next. My original query does this, but too slowly.

    N records instead of every n'th item, would work for me, as long as this would give me the item in every source location - it would not work if the last item only loaded half of its source locations on one pass, and the other half on the next pass. The fact that we need all source locations at once is why I took the approach in my original post.

    I guess a reasonable approach might be to take your 'top N' suggestion, and to avoid the case where only some source locations are loaded for the last item, just throw the last item away, for processing next iteration.:)

  • Jack Corbett (4/6/2008)


    Essentially a unique clustered index is a primary key. A primary key defines uniqueness and, in SQL Server, when created defaults to a clustered index.

    True... it also rejects nulls and "assists" other indexes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Composite PK's are wonderful... but not for this problem. It would be so much easier if you had an IDENTITY column named RowNum because, like I said, this is nothing more than a classic paging problem and this is nasty fast. If you have problems with parameter sniffing, just convert the SELECT to dynamic SQL and execute that. Clustered index should be on the RowNum column with a unique nonclustered index on what you are currently calling the primary key.

    Here's the simplicity of the code if you were to add the RowNum column...

    --===== Declare the local variables

    DECLARE @PageSize INT --How many rows to appear per page

    DECLARE @PageNum INT --What page number to appear

    DECLARE @Skip INT --Working variable for where to start for page

    DECLARE @sql VARCHAR(8000) --Holds dynamic SQL

    --===== Set the local variables for pagesize and page

    -- PageSize and PageNum would be parameters in a stored proc

    SET @PageSize = 100

    SET @PageNum = 4000

    SET @Skip = @PageSize*@PageNum

    SELECT t.*

    FROM dbo.Test t,

    (

    SELECT TOP (@PageSize) *

    FROM dbo.Test WITH (NOLOCK)

    WHERE RowNum NOT IN (SELECT TOP (@Skip) RowNum

    FROM dbo.Test

    ORDER BY RowNum)

    ORDER BY CustID

    ) d

    WHERE t.RowNum = d.RowNum

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks everyone for your help.

    Much appreciated.

    Andrew

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply