August 3, 2009 at 2:57 am
Hi,
Could someone cast their eye over this query and tell me if what i think it is going to do is correct.
I am looking to extract data from the wce_ilr table and insert into wce_contact table but i only want to insert data that does not already exist in wce_contact table, this is very important.
The criteria to match on is wce_contact.postalcode, wce_contact.company against wce_ilr.employer_postcode = wce_contact.postalcode AND wce_ilr.company_name where these match ignore those records and insert ones that don't match.
I have not added the insert part to my query as i only need to work on the select section, i have the insert part working fine.
Thanks in advance
Below is the select part of my query.
SELECT wce_ilr.EDRS_no AS Uniqueid, MAX(wce_ilr.EDRS_no) AS EDRS_no, MAX(wce_ilr.company_Name) AS Employer_name, max(wce_contact.company) as comp,
MAX(wce_ilr.Employer_Address_1) AS Address1, MAX(wce_ilr.Employer_Address_2) AS Address2, MAX(wce_ilr.Employer_Address_3) AS Address3,
MAX(wce_ilr.Employer_Address_4) AS Address4, MAX(wce_ilr.Employer_postcode) AS Postcode,
'Company' AS Record_Type
FROM wce_ilr LEFT OUTER JOIN
wce_contact ON wce_ilr.employer_postcode = wce_contact.postalcode AND wce_ilr.company_name =wce_contact.company COLLATE database_default
WHERE (wce_contact.EDRS IS NULL)
AND wce_ilr.EDRS_no IS NOT NULL
GROUP BY wce_ilr.EDRS_no
ORDER BY wce_ilr.EDRS_no
August 3, 2009 at 7:10 am
It looks basically OK, but I'm curious why you're aggregating the data & selecting the max value. Are there multiple entries with the same key value? That's a bit of a relational problem, and how do you know that the MAX(Address) is the address value that you want? Assuming unique entries for the key values, I don't see why you'd need to do this, this way.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 3, 2009 at 8:31 am
Thanks for the reply.
There are many rows with the same company details so a company might appear 50 + times in the table but i only want one row for each company where an wce_ilr.edrs_no exists. There are 110,000 rows in wce_ilr table so it's hard to look over them to see if the query i used is working ok. I tried distinct but that was no good and someone recommended the max function.
Once i have the correct results they will be inserted in to the wce_contact table and create a parent contact record so i can't have duplicates.
If you can advise me differently i am always happy to learn. Thanks!
August 3, 2009 at 9:18 am
Well, instead of MAX and an aggregate function, I'd suggest using TOP 1 and an ORDER BY if you can supply a column that can be ordered in such a way that you get the value you want. It's going to work better than the aggregate from a performance stand point, but only if you can supply a good ordering column.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 3, 2009 at 10:32 am
Thanks for that, will try it out and let you know how i get on.
August 4, 2009 at 12:37 am
Hi, I tried to find a good ordering field but have nothing.
I'm not really sure what you could use to get this using Top 1. Am i right in thinking each duplicated record no matter how many dups per company would need to have one identified differently from the rest to be able to use Top 1?
August 4, 2009 at 9:24 am
I would recommend ROW_NUMBER(). How can you determine, if a company has 50 rows, which one row you would like to see? Are they all the same for the columns that you are selecting?
BTW, it sounds like you have some design issues. If you can, I would definetly recommend normalizing.
August 5, 2009 at 5:53 am
sward (8/4/2009)
Hi, I tried to find a good ordering field but have nothing.I'm not really sure what you could use to get this using Top 1. Am i right in thinking each duplicated record no matter how many dups per company would need to have one identified differently from the rest to be able to use Top 1?
That's true, but by selecting MAX you're just going to get a random value. If that works... OK, but it seems to point to a flaw in your storage design.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 10, 2009 at 12:07 am
The problem is the design of the table but it's the best i have to work with as i have no control over design or the data populating it.
The table stores per row: a student and their contact data, the company they are associated to and their contact data and the course that learner is on which could be many courses per learner equating to many rows in the table per learner. There is over 5 years worth of data in this table. There is no consistency with the data no specific uniqueid company id's which means i have to look at a formula to search the table and only show me one row per company hence the max() function it might not be the best way but it will only show one company record.
any thoughts still appreciated. Thanks
August 10, 2009 at 7:02 am
That's tough. But in terms of performance, you could still use a TOP with an ORDER BY and arrive at one row, if somewhat randomly. It usually performs better, but your mileage may vary, etc.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 11, 2009 at 6:58 am
Hello Again,
I need some clarity on this max() function I am using. As you will see from the above posts I have lots of data in a badly designed table. As long as I can get consistent results from this data with static queries am going to use I will be happy.
Below is a very small example of some data and results. What I need to be sure on and I can't be sure from the replies so far is,
Does the max() function always give the same result every time it's run, can I be sure that the 1 instance of the duplicated row is the same as last time I run the query?
Does the function look at every column and how does it determine which to display? Could it ever for example put the value Brighton for the company Tesco when it should always be London? In other words can the data get swapped around giving me false cross record information?
Because there might be 6 Tesco records someone might have entered the postcodes wrong as in the below example 3 are 'nw17rt' and 3 are 'nw17rj' I only need to show one Tesco record and to be honest it's not important which one it is but it must be the same one every time I run the query or I will be in trouble.
If anyone could answer these questions I would be most appreciative as I am working to a deadline to tidy this data and have no unique id's to work with for company records.
create table company(company varchar(50), contact varchar(50), address varchar(50), city varchar(50), county varchar(50), postcode varchar(50))
insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rt')
insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rt')
insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rt')
insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rj')
insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rj')
insert into company values ('tesco', 'john smith', 1, 'london', 'kent', 'nw17rj')
insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')
insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')
insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')
insert into company values ('woolworths', 'james roberts', 2, 'brighton', 'sussex', 'tm456tl')
insert into company values ('barnets', 'peter smith', 3, 'main road', 'sussex', 'pt56ylt')
insert into company values ('argos', 'jim harris', 4, 'brighton', 'sussex', 'tm456tl')
insert into company values ('camping co', 'jim harris', 5, 'brighton', 'sussex', 'we987ly')
SELECT MAX(company) AS Expr1, MAX(contact) AS Expr2, MAX(address) AS Expr3, MAX(city) AS Expr4, MAX(county) AS Expr5, MAX(postcode) AS Expr6
FROM company
GROUP BY company
Results:
argos jim harris 4brightonsussextm456tl
barnets peter smith3main road sussexpt56ylt
camping cojim harris 5brightonsussexwe987ly
tesco john smith1londonkentnw17rt
woolworthsjames roberts2brightonsussextm456tl
August 11, 2009 at 11:15 am
Here is what I would recommend, instead:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY company ORDER BY company) r, *
FROM company) A
WHERE r = 1
If you have a flag that it would make sense to sort by, you can add it to your order by. For example, if you have a modified date, you may want to pull the latest entry for a company. You could use "ORDER BY ModifiedDate DESC" in your ROW_NUMBER statement. This will pull one and only one row for a specific company, so you will never get street from one line and zip from another.
August 12, 2009 at 5:09 am
Hi, thanks for your reply. I am working through it now to make sure i understand what is happening.
When i first run it i get this error "The OVER SQL construct or statement is not supported." however after saying ok to this error it produces the results i would expect. Iam using SQL Express maybe that is the reason for the error.
Am i right in thinking the following statement does this:
First Selects all the rows in the company table, it checks the company column and where it see's a duplicated company name it gives it it's own unique row number at which point it get's partitioned to only show the first row of each company. This is then ordered by the company name.
Is that right? I have not used ROW_NUMBER() OVER PARTITION BY before so want to make sure my understanding is correct.
Thanks again.
SELECT *
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY company
ORDER BY company) r, *
FROM company) A
WHERE r = 1
August 12, 2009 at 6:32 am
ROW_NUMBER is a member of the RANKING FUNCTIONS available in 2005 & 2008. I am not 100% on the internals, but because SQL Server is building the rowset, it is able to optimize certain pseudo-aggregates as it pulls additional rows. This has an overhead, but because of the design and the fact that it is building this as it builds your recordset (an assumption on my part), these ranking functions are much quicker than doing seperate sub queries etc.
To your specific issue... "ROW_NUMBER()" starts the command... " OVER ( ORDER BY X)" is required. The order by tells the ROW_NUMBER function how to order it's row numbering. The optional "PARTION BY" causes it to reset the numbering on specific groups. I use this quite a bit when I am working with temporal tables (I track history on tables where no history exist using journals from a dissimilar database, but only care about the status at the end of a day) and have found it an invaluable tool when you need to use data that is not quite optimal in its structure.
If you have any questions feel free to ask, but it might be better for you in the long run to search SQL Server Central for articles on ROW_NUMBER. There are quite a few people that have a better understanding of it and can explain things better than I.
August 12, 2009 at 6:34 am
Hi Again,
I have executed this query in Full SQL 2005 Server and it worked! I got the row number i expected. As long as the data is consitent and not mixed the data in the returned rows then finally i think i am there... well nearly. The sample code above was for a dummy table to get a proof of concept.
Here is the actual code and structure i used based on the above example. I needed to extract unique learner records one row per learner (distinct rows) from table wce_ilr and insert them into wce_contact. The following code worked perfectly to do that.
INSERT INTO wce_contact (UNIQUEID,contact,firstname,middle_name,lastname,stu_id,dob,
gender,national_insurance_no,Learning_Difficulties,Learning_Diff,Disability,ethnicity,
address1,address2,address3,city,postalcode,phone,emailaddress,mobilephone,record_type)
SELECT stu_id, Student_Forename + ' ' + Surname, Student_Forename, middlename,
Surname, stu_id, Date_of_Birth, Sex, NI_Number, Learning_Difficulty, Learning_Diff,
Disability, Ethnicity, Student_Address_1, Student_Address_2, Student_Address_3, Student_Address_4,
Student_Postcode, Telephone, Email_Address, Mobile_Number, 'Learner'
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY stu_id
ORDER BY stu_id) r, *
FROM wce_ilr) A
WHERE r = 1
The next step is to every day cross reference those tables and where a learner does not exist in the wce_ilr table but does in the wce_contact table copy that learner over to the wce_ilr table. So to test this i tried the following code but i get this error:
Error for following code:
Msg 8156, Level 16, State 1, Line 1
The column 'Disability' was specified multiple times for 'A'.
SELECT stu_id, Student_Forename + ' ' + Surname, Student_Forename, middlename,
Surname, stu_id, Date_of_Birth, Sex, NI_Number, Learning_Difficulty, Learning_Diff,
Disability, Ethnicity, Student_Address_1, Student_Address_2, Student_Address_3, Student_Address_4,
Student_Postcode, Telephone, Email_Address, Mobile_Number, 'Learner'
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY wce_ilr.stu_id
ORDER BY wce_ilr.stu_id) r, *
FROM wce_ilr LEFT OUTER JOIN
wce_contact ON wce_ilr.stu_id = wce_contact.stu_id COLLATE database_default
WHERE (wce_contact.stu_id IS NULL)
AND wce_ilr.stu_id IS NOT NULL) A
WHERE r = 1
Here is the bit i added to my working exmaple that copies all the rows without checking for duplicates.
LEFT OUTER JOIN
wce_contact ON wce_ilr.stu_id = wce_contact.stu_id COLLATE database_default
WHERE (wce_contact.stu_id IS NULL)
AND wce_ilr.stu_id IS NOT NULL
Can anyone see why i am getting this error and if this is the best way to cross reference the table?
Thanks for any help.
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply