Keeping a distinct list of inserted values ....

  • I am trying to keep a distinct list of inserted values from a table. For example I have a table that lists 'customers' on any given day (around 7000 each day). I have a web app the lists distinct customer last names to query on. I am thinking that maybe an insert trigger that fires during an insert into the customer table and checks for that customer name in the distinct table and if its not found add it? Does this sound reasonable? Will the trigger slow down the insert into the master table? Does anyone have a sample create trigger script that would help me along?

    Thanks everyone for their comments and suggestions!:D

  • 7000 is not a large number; it seems to me that it might be more appropriate to have your select list just fetch the distinct records rather than take the overhead of a trigger to maintain this. What are you reasons for wanting to materialize this data?

  • Take a look at the OUTPUT clause as a mechanism to capture the data that is inserted as it gets inserted. This is much more effecient, as well as easier to maintain and more transparent, than using a trigger.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • That's 7000 records added daily - over time I would prefer not to query hundreds of thousands of records to get the distinct list.

  • Crud. I completely misread your question. Disregard my last post. I'm an idiot.

    Why can't you simply do a join against your table? Especially since we're talking about a distinct list, so it should have a very high selectivity, you should be able to write a query that runs just fine. You'd only be looking at two million rows a year. That's not that big. 20 million in ten years? And that assumes that every name entered is unique. What happens when you start getting duplicate hits? Less data. So you don't need a trigger, you just need a join on the insert such that it doesn't allow an insert where the name exists. Doing a join (not an individual look up) will make it set based... Should be fine.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the reply Grant,

    I guess I need some clarification. I am using a stored procedure insert to insert records into a table - the primary key is not the customer name but another identifier not related to the customer. I call the stored procedure each time I read a record out of a log file. Could you give me an example? I could simply query the distinct table and insert the customer name if it does not exist but I think that would be too slow.

    Here is my stored procedure (if it helps any) - I cut it down to make it easily to read.

    CREATE proc [dbo].[MessageContentInsert]

    @messageid varchar(50),

    @customer varchar(40),

    @city varchar(40),

    @state varchar(40),

    as

    insert into MessageContent(messageid,customer,city,state)

    values (@messageid,@customer,@city,@state)

    Thanks again!

  • Just make sure you have an index that has customer Name first and insert using a WHERE NOT EXISTS clause on customer name -- should be very fast.

    Edit:

    Something else you might want to think about is the row-by-row process used for inserting this data. Transact SQL works far more efficiently with sets of data than it does with row-by-row processes. It might be worthwhile to change your process to take advantage of set-based processing rather than record-based processing.

  • I tried it two ways. I set up a table with a million rows for the tests.

    This way works very well, with a nice index seek and only four reads and runs in 1ms, but only works off of one name at a time.

    DECLARE @test-2 NVARCHAR(75)

    SET @test-2 = 'aacfjipbavrosganqinf'

    IF NOT EXISTS ( SELECT Customername

    FROM namelist

    WHERE customername = @test-2 )

    INSERT INTO namelog ( LogName )

    VALUES ( @test-2 )

    Here's an example where you pass in all the list of names as a set, using XML but you could supply the set in some other fashion, and then uses a join to eliminate the list. This one takes a lot longer, 12ms. It still does a very nice index seek and nested loop on the XML data. Parsing the XML data takes the most time in the query.

    (substitute an angled bracked for the square brackets)

    DECLARE @test-2 XML

    SET @test-2 = '[root]

    [customername name="aacfjipbavrosganqinf"][/customername]

    [customername name="54321" ][/customername]

    [/root]'

    INSERT INTO namelog ( LogName )

    SELECT l.customername

    FROM ( SELECT Ord.value('(@name)', 'nvarchar(75)') AS customername

    FROM @test.nodes('/root/customername') AS x ( Ord )

    ) AS l

    LEFT JOIN namelist n

    ON n.customername = l.customername

    WHERE n.customername IS NULL

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks Grant!

    Your solution is great. Thanks for sharing!

  • Having this in the insert trigger shouldn't be too bad.

    Create trigger trg_Name

    on Customers after insert as

    ...

    insert into DistinctTable(Name)

    select i.Name

    from inserted i

    left join DistinctTable d

    on i.Name = d.Name

    where d.Name is null;

    ...

    Only Name values not already existing in DistinctTable are inserted. Assuming the Name column in DistinctTable is indexed, the performance should be good. You would only use a Distinct query against the large Customers table after first creating DistinctTable. From then on, the trigger will maintain it with very little performance hit.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • Tomm Carr (3/19/2008)


    Only Name values not already existing in DistinctTable are inserted. Assuming the Name column in DistinctTable is indexed, the performance should be good. You would only use a Distinct query against the large Customers table after first creating DistinctTable. From then on, the trigger will maintain it with very little performance hit.

    Sorry to disagree - but there's no guarantee that the batch isn't carrrying duplicate names, so the trigger would have to insert DISTINCT names.

    Another way would be to use EXISTS

    Create trigger trg_Name

    on Customers after insert as

    ...

    insert into DistinctTable(Name)

    select distinct i.Name

    from inserted i

    where not EXISTS (

    select null

    from DistinctTable d

    where d.name=i.name

    );

    ...

    Same idea - just seems to run a little bit faster than the IN clause.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Oops. One of those oversights that could go a really long time before manifesting itself. :blush:

    However, I still like the join better. It's simple and easy to maintain -- and the optimizer should turn both ways into the same ex-plan.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply