Query to find data over range of years

  • I seem to be mind-numb today, as I cannot figure out a good way to do the following query. I am trying to find what customers have been with the company since 2003,

    select distinct CustNumber from dbo.InvoicesView

    where CustNumber in (select distinct CustNumber

    from dbo.InvoicesView where Year(InvoiceDate) = '2003')

    and CustNumber in (select CustNumber

    from InvoicesView where Year(InvoiceDate) = '2004')

    and CustNumber in (select distinct CustNumber

    from dbo.InvoicesView where Year(InvoiceDate) = '2005')

    and CustNumber in (select distinct CustNumber

    from dbo.InvoicesView where Year(InvoiceDate) = '2006')

    and CustNumber in (select distinct CustNumber

    from dbo.InvoicesView where Year(InvoiceDate) = '2007')

    There must be a more elegant way to run this query... I am just not thinking of it at the moment! 🙁

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • How about?

    select distinct CustNumber

    from dbo.InvoicesView

    where Year(InvoiceDate) >= 2003

  • since YEAR() function returns an integer, you could use where Year() > 2003 instead:

    select distinct CustNumber

    from dbo.InvoicesView

    where CustNumber in

    ( select distinct CustNumber

    from dbo.InvoicesView

    where Year(InvoiceDate) >=2003)

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Are you after continuous employment, i.e. no breaks?

    In which case try

    select CustNumber

    from dbo.InvoicesView

    where Year(InvoiceDate) between 2003 and 2007 -- 5 years

    group by CustNumber

    having count(distinct Year(InvoiceDate))=5

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Unfortunately, I believe that query will give me all customers in the database. I specifically need to find customers that have invoices in 2003, 2004, 2005, 2006 and 2007.

    So, if a customer has an invoice in year 2007 they will match the criteria in the query

    select distinct CustNumber

    from dbo.InvoicesView

    where Year(InvoiceDate) >= 2003

    🙂

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • It's hardly optimal, but you could:

    SELECT...

    FROM...

    WHERE YEAR(OrderDate) IN (2001,2002,2003,2007)

    You could add those values to a temp table & join on it. That would probably be better than than in IN statement.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I like Mark's because it will only return those customers who have orders in all the included years. Much more precise.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Just in case anyone is interested:

    My query using all of the derived tables took 1 hour, 13 minutes to run.

    Mark's query using the group/having clause took 14 minutes, 32 seconds.

    The database that is being queried has 89+ Million records and is a very "WIDE" table (lots of fields per record).

    Thanks to Mark for the suggestion! 🙂 🙂

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • The simplest solution I can think of uses derived tables or CTEs and joins to them. But you said you don't want to do that. Any particular reason to not use those?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • DB_Newbie2007 (1/25/2008)


    Just in case anyone is interested:

    My query using all of the derived tables took 1 hour, 13 minutes to run.

    Mark's query using the group/having clause took 14 minutes, 32 seconds.

    The database that is being queried has 89+ Million records and is a very "WIDE" table (lots of fields per record).

    Thanks to Mark for the suggestion! 🙂 🙂

    If this is going to be a regularly run query you may want to consider modifying your table to have an indexed year column as using the Year() function is limiting use of indexes (or indices whichever you prefer). What does your query plan look like?

  • The other thing that's probably hurting the performance is the Distinct operator in the derived tables. That's a lot of extra work for the database engine, but doesn't actually affect the end result of the query.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • To answer GSquared question, my experience is that derived tables (like the ones I used) or CTEs are not as fast or efficient being more "creative" with the query (as you see from my last post, the "creative" query from Mark ran much faster). 🙂

    To answer Jack, only a few queries will be run using this data (some data mining). If I was planning on running this daily or weekly, I would look into optimizing the query, test out some indexes and statistics, etc., look into the cost of the index versus disk space and other query performance, etc. (I hope I am learning from this site!!!!) 😉

    Argue for your limitations, and sure enough they're yours (Richard Bach, Illusions)

  • Just ran some tests.

    Mark's solution was much faster than derived tables. Tried it on 50k rows of data. Difference of 14 seconds average (derived tables) vs a maximum run time of 128 milliseconds (group by).

    Tried adding a computed, persisted column on the year of the date, querying against that cut the execution time in half (five runs with, five without). Scans was 1 in each case, both ways, computed column cut logical reads by c. 10%.

    Added an index on the customer number and the computed year columns, cut execution time in half again. Average run time at this point of 28 milliseconds.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • When you want data from a date range, it is usually better to do the selection in the form of:

    where MyDate >= @StartDatetime and MyDate < @EndDatetime

    This will allow the query to take advantage of any index on the datetime column.

    For your query, I would try this:

    select

    a.CustNumber

    from

    (

    select

    aa.CustNumber

    from

    dbo.InvoicesView aa

    where

    aa.InvoiceDate >= '20030101' and

    aa.InvoiceDate < '20080101'

    group by

    aa.CustNumber,

    year(aa.InvoiceDate)

    ) a

    group by

    a.CustNumber

    having

    count(*) = 5

    order by

    a.CustNumber

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply