Query Question

  • Hi,

    Could someone cast their eye over this query and tell me if what i think it is going to do is correct.

    I am looking to extract data from the wce_ilr table and insert into wce_contact table but i only want to insert data that does not already exist in wce_contact table, this is very important.

    The criteria to match on is wce_contact.postalcode, wce_contact.company against wce_ilr.employer_postcode = wce_contact.postalcode AND wce_ilr.company_name where these match ignore those records and insert ones that don't match.

    I have not added the insert part to my query as i only need to work on the select section, i have the insert part working fine.

    Thanks in advance

    Below is the select part of my query.

    SELECT wce_ilr.EDRS_no AS Uniqueid, MAX(wce_ilr.EDRS_no) AS EDRS_no, MAX(wce_ilr.company_Name) AS Employer_name, max(wce_contact.company) as comp,

    MAX(wce_ilr.Employer_Address_1) AS Address1, MAX(wce_ilr.Employer_Address_2) AS Address2, MAX(wce_ilr.Employer_Address_3) AS Address3,

    MAX(wce_ilr.Employer_Address_4) AS Address4, MAX(wce_ilr.Employer_postcode) AS Postcode,

    'Company' AS Record_Type

    FROM wce_ilr LEFT OUTER JOIN

    wce_contact ON wce_ilr.employer_postcode = wce_contact.postalcode AND wce_ilr.company_name =wce_contact.company COLLATE database_default

    WHERE (wce_contact.EDRS IS NULL)

    AND wce_ilr.EDRS_no IS NOT NULL

    GROUP BY wce_ilr.EDRS_no

    ORDER BY wce_ilr.EDRS_no

  • It looks basically OK, but I'm curious why you're aggregating the data & selecting the max value. Are there multiple entries with the same key value? That's a bit of a relational problem, and how do you know that the MAX(Address) is the address value that you want? Assuming unique entries for the key values, I don't see why you'd need to do this, this way.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the reply.

    There are many rows with the same company details so a company might appear 50 + times in the table but i only want one row for each company where an wce_ilr.edrs_no exists. There are 110,000 rows in wce_ilr table so it's hard to look over them to see if the query i used is working ok. I tried distinct but that was no good and someone recommended the max function.

    Once i have the correct results they will be inserted in to the wce_contact table and create a parent contact record so i can't have duplicates.

    If you can advise me differently i am always happy to learn. Thanks!

  • Well, instead of MAX and an aggregate function, I'd suggest using TOP 1 and an ORDER BY if you can supply a column that can be ordered in such a way that you get the value you want. It's going to work better than the aggregate from a performance stand point, but only if you can supply a good ordering column.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for that, will try it out and let you know how i get on.

  • Hi, I tried to find a good ordering field but have nothing.

    I'm not really sure what you could use to get this using Top 1. Am i right in thinking each duplicated record no matter how many dups per company would need to have one identified differently from the rest to be able to use Top 1?

  • I would recommend ROW_NUMBER(). How can you determine, if a company has 50 rows, which one row you would like to see? Are they all the same for the columns that you are selecting?

    BTW, it sounds like you have some design issues. If you can, I would definetly recommend normalizing.

  • sward (8/4/2009)


    Hi, I tried to find a good ordering field but have nothing.

    I'm not really sure what you could use to get this using Top 1. Am i right in thinking each duplicated record no matter how many dups per company would need to have one identified differently from the rest to be able to use Top 1?

    That's true, but by selecting MAX you're just going to get a random value. If that works... OK, but it seems to point to a flaw in your storage design.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • The problem is the design of the table but it's the best i have to work with as i have no control over design or the data populating it.

    The table stores per row: a student and their contact data, the company they are associated to and their contact data and the course that learner is on which could be many courses per learner equating to many rows in the table per learner. There is over 5 years worth of data in this table. There is no consistency with the data no specific uniqueid company id's which means i have to look at a formula to search the table and only show me one row per company hence the max() function it might not be the best way but it will only show one company record.

    any thoughts still appreciated. Thanks

  • That's tough. But in terms of performance, you could still use a TOP with an ORDER BY and arrive at one row, if somewhat randomly. It usually performs better, but your mileage may vary, etc.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hello Again,

    I need some clarity on this max() function I am using. As you will see from the above posts I have lots of data in a badly designed table. As long as I can get consistent results from this data with static queries am going to use I will be happy.

    Below is a very small example of some data and results. What I need to be sure on and I can't be sure from the replies so far is,

    Does the max() function always give the same result every time it's run, can I be sure that the 1 instance of the duplicated row is the same as last time I run the query?

    Does the function look at every column and how does it determine which to display? Could it ever for example put the value Brighton for the company Tesco when it should always be London? In other words can the data get swapped around giving me false cross record information?

    Because there might be 6 Tesco records someone might have entered the postcodes wrong as in the below example 3 are 'nw17rt' and 3 are 'nw17rj' I only need to show one Tesco record and to be honest it's not important which one it is but it must be the same one every time I run the query or I will be in trouble.

    If anyone could answer these questions I would be most appreciative as I am working to a deadline to tidy this data and have no unique id's to work with for company records.

    create table company(company varchar(50), contact varchar(50), address varchar(50), city varchar(50), county varchar(50), postcode varchar(50))

    insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rt')

    insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rt')

    insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rt')

    insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rj')

    insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rj')

    insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rj')

    insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')

    insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')

    insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')

    insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')

    insert into company values ('barnets', 'peter smith', 3, 'main road', 'sussex', 'pt56ylt')

    insert into company values ('argos', 'jim harris', 4, 'brighton', 'sussex', 'tm456tl')

    insert into company values ('camping co', 'jim harris', 5, 'brighton', 'sussex', 'we987ly')

    SELECT MAX(company) AS Expr1, MAX(contact) AS Expr2, MAX(address) AS Expr3, MAX(city) AS Expr4, MAX(county) AS Expr5, MAX(postcode) AS Expr6

    FROM company

    GROUP BY company

    Results:

    argos jim harris 4brightonsussextm456tl

    barnets peter smith3main road sussexpt56ylt

    camping cojim harris 5brightonsussexwe987ly

    tesco john smith1londonkentnw17rt

    woolworthsjames roberts2brightonsussextm456tl

  • Here is what I would recommend, instead:

    SELECT *

    FROM (SELECT ROW_NUMBER() OVER (PARTITION BY company ORDER BY company) r, *

    FROM company) A

    WHERE r = 1

    If you have a flag that it would make sense to sort by, you can add it to your order by. For example, if you have a modified date, you may want to pull the latest entry for a company. You could use "ORDER BY ModifiedDate DESC" in your ROW_NUMBER statement. This will pull one and only one row for a specific company, so you will never get street from one line and zip from another.

  • Hi, thanks for your reply. I am working through it now to make sure i understand what is happening.

    When i first run it i get this error "The OVER SQL construct or statement is not supported." however after saying ok to this error it produces the results i would expect. Iam using SQL Express maybe that is the reason for the error.

    Am i right in thinking the following statement does this:

    First Selects all the rows in the company table, it checks the company column and where it see's a duplicated company name it gives it it's own unique row number at which point it get's partitioned to only show the first row of each company. This is then ordered by the company name.

    Is that right? I have not used ROW_NUMBER() OVER PARTITION BY before so want to make sure my understanding is correct.

    Thanks again.

    SELECT *

    FROM (SELECT ROW_NUMBER() OVER (PARTITION BY company

    ORDER BY company) r, *

    FROM company) A

    WHERE r = 1

  • Check BOL

    ROW_NUMBER is a member of the RANKING FUNCTIONS available in 2005 & 2008. I am not 100% on the internals, but because SQL Server is building the rowset, it is able to optimize certain pseudo-aggregates as it pulls additional rows. This has an overhead, but because of the design and the fact that it is building this as it builds your recordset (an assumption on my part), these ranking functions are much quicker than doing seperate sub queries etc.

    To your specific issue... "ROW_NUMBER()" starts the command... " OVER ( ORDER BY X)" is required. The order by tells the ROW_NUMBER function how to order it's row numbering. The optional "PARTION BY" causes it to reset the numbering on specific groups. I use this quite a bit when I am working with temporal tables (I track history on tables where no history exist using journals from a dissimilar database, but only care about the status at the end of a day) and have found it an invaluable tool when you need to use data that is not quite optimal in its structure.

    If you have any questions feel free to ask, but it might be better for you in the long run to search SQL Server Central for articles on ROW_NUMBER. There are quite a few people that have a better understanding of it and can explain things better than I.

  • Hi Again,

    I have executed this query in Full SQL 2005 Server and it worked! I got the row number i expected. As long as the data is consitent and not mixed the data in the returned rows then finally i think i am there... well nearly. The sample code above was for a dummy table to get a proof of concept.

    Here is the actual code and structure i used based on the above example. I needed to extract unique learner records one row per learner (distinct rows) from table wce_ilr and insert them into wce_contact. The following code worked perfectly to do that.

    INSERT INTO wce_contact (UNIQUEID,contact,firstname,middle_name,lastname,stu_id,dob,

    gender,national_insurance_no,Learning_Difficulties,Learning_Diff,Disability,ethnicity,

    address1,address2,address3,city,postalcode,phone,emailaddress,mobilephone,record_type)

    SELECT stu_id, Student_Forename + ' ' + Surname, Student_Forename, middlename,

    Surname, stu_id, Date_of_Birth, Sex, NI_Number, Learning_Difficulty, Learning_Diff,

    Disability, Ethnicity, Student_Address_1, Student_Address_2, Student_Address_3, Student_Address_4,

    Student_Postcode, Telephone, Email_Address, Mobile_Number, 'Learner'

    FROM (SELECT ROW_NUMBER() OVER (PARTITION BY stu_id

    ORDER BY stu_id) r, *

    FROM wce_ilr) A

    WHERE r = 1

    The next step is to every day cross reference those tables and where a learner does not exist in the wce_ilr table but does in the wce_contact table copy that learner over to the wce_ilr table. So to test this i tried the following code but i get this error:

    Error for following code:

    Msg 8156, Level 16, State 1, Line 1

    The column 'Disability' was specified multiple times for 'A'.

    SELECT stu_id, Student_Forename + ' ' + Surname, Student_Forename, middlename,

    Surname, stu_id, Date_of_Birth, Sex, NI_Number, Learning_Difficulty, Learning_Diff,

    Disability, Ethnicity, Student_Address_1, Student_Address_2, Student_Address_3, Student_Address_4,

    Student_Postcode, Telephone, Email_Address, Mobile_Number, 'Learner'

    FROM (SELECT ROW_NUMBER() OVER (PARTITION BY wce_ilr.stu_id

    ORDER BY wce_ilr.stu_id) r, *

    FROM wce_ilr LEFT OUTER JOIN

    wce_contact ON wce_ilr.stu_id = wce_contact.stu_id COLLATE database_default

    WHERE (wce_contact.stu_id IS NULL)

    AND wce_ilr.stu_id IS NOT NULL) A

    WHERE r = 1

    Here is the bit i added to my working exmaple that copies all the rows without checking for duplicates.

    LEFT OUTER JOIN

    wce_contact ON wce_ilr.stu_id = wce_contact.stu_id COLLATE database_default

    WHERE (wce_contact.stu_id IS NULL)

    AND wce_ilr.stu_id IS NOT NULL

    Can anyone see why i am getting this error and if this is the best way to cross reference the table?

    Thanks for any help.

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply