Easy (I hope) join question

Question

Post reply

Easy (I hope) join question

taigovinda

Default port

Points: 1406
More actions
August 1, 2013 at 2:53 pm

#280743

Hi,
I have two tables with many fields in them. I am only interested in a few of the fields, and both of the tables have records that are duplicates on the fields I am interested in. I think if I do a left outer join A on B I will get all the records in A, but some of them will become duplicated because they exist in B more than once. How do I do a join so that I keep all the records from table A and fill in the missing blanks, but do not create additional rows? Here's a small example:
table A
Name | Address
Tai | #
table B
Name | Address
Tai | 123 Main St
Tai | 123 Main St
Tai | 123 Main St
Mike | 456 Broadway
desired output:
Tai 123 Main St
I do not expect there will ever be more than one distinct address for each name. However, if there is more than one distinct address then I would take one randomly and don't care if there are others that I did not get. I've done something similar to this but I had to use ROW_NUMBER() OVER (PARTITION BY... which given my newbie status seems complicated and it seemed like there should be an easier way.
Thanks for the help!
Tai

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286609 More actions · Answer 1

DISTINCT?

GROUP BY?

It is kind of hard to say without more details about what your actual query looks like.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

taigovinda Default port Points: 1406 More actions · Answer 2

My data example was flawed... suppose it is this instead. Then how would the query be written? Thanks.

table A

Name | Address

Tai | #

Mike | #

table B

Name | Address

Tai | 123 Main St

Tai | 897 Main St

Tai | 123 Main St

Mike | 456 Broadway

desired output:

Tai | 123 Main St

Mike | 456 Broadway

OR

desired output:

Tai | 897 Main St

Mike | 456 Broadway

(I don't care which of the two outputs I get... If I use "select a.name, b.address from a left outer join b on a.name = b.name" then I will end up with more than the two rows I am after.)

Lowell SSC Guru Points: 323755 More actions · Answer 3

here's my best guess for you: in this case,note how i provided a CREATE TABLE and sample data via INSERT INTO?

if you can do the same in future posts, you'll be able to get tested, working answers from our volunteers.

here i'm using row_number to partition by the name, and since we don't care which record we get, the order by doesn't have much of an impact.

the second query, where i limit it just to the first matching row is what i think you are after:

CREATE TABLE #MySampleData([Name] varchar(30),[Address]varchar(30))

INSERT INTO #MySampleData

SELECT 'Tai ',' 123 Main St' UNION ALL

SELECT 'Tai ',' 897 Main St' UNION ALL

SELECT 'Tai ',' 123 Main St' UNION ALL

SELECT 'Mike ',' 456 Broadway' UNION ALL

SELECT 'Mike ',' 456 Broadway'

SELECT row_number() over(partition by name order by name) AS RW,* FROM #MySampleData

select * FROM

(

SELECT row_number() over(partition by name order by name) AS RW,* FROM #MySampleData

)

WHERE RW = 1

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Sean Lange SSC Guru Points: 286609 More actions · Answer 4

Here are two examples of how you could do that.

--method #1

with MyAddresses as

(

select a.name, b.address, ROW_NUMBER() over (partition by a.name order by (select null)) as RowNum

from a left outer join b on a.name = b.name

)

select *

from MyAddresses

where RowNum = 1

--Method #2

select *

from

(

select a.name, b.address, ROW_NUMBER() over (partition by a.name order by (select null)) as RowNum

from a left outer join b on a.name = b.name

)x

where x.RowNum = 1

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

taigovinda Default port Points: 1406 More actions · Answer 5

Thanks to both of you for the code. And thanks for the tip on posting code to create the sample data.

I am actually already doing it the way you both have shown... as I am just getting started though it is very good to know that is the right way, though, instead of a bloated way.

Tai

Alan Burstein SSC Guru Points: 61152 More actions · Answer 6

This is what I came up with

--SAMPLE DATA

IF OBJECT_ID('tempdb..#table_A') IS NOT NULL

DROP TABLE #table_A;

IF OBJECT_ID('tempdb..#table_B') IS NOT NULL

DROP TABLE #table_B;

CREATE TABLE #table_A (name varchar(20), [address] varchar(20));

CREATE TABLE #table_B (name varchar(20), [address] varchar(20));

INSERT #table_A VALUES ('Tai','#'),('Mike','#');

INSERT #table_B VALUES

('Tai','123 Main St'),('Tai','897 Main St'),('Tai','123 Main St'),

('Mike','456 Broadway'),('Mike','456 Broadway');

--The query

WITH distinct_folks(name, r, [address]) AS

(SELECT DISTINCTname, RANK() OVER (PARTITION BY name ORDER BY [address]),

[address]

FROM #table_B)

SELECT b.name, b.[address]

FROM distinct_folks AS b

JOIN #table_A AS a ON b.name=a.name

WHERE r=1

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

taigovinda Default port Points: 1406 More actions · Answer 7

Further confirmation that this is the right way to do it... I just had expected there would be something that did not require two queries and was thus simpler. Thanks Alan.

Tai

Lowell SSC Guru Points: 323755 More actions · Answer 8

I just noticed every one of us skipped over a basic group by as an alternative, and went straight to row_number.

this is probably the simplest:

SELECT

name,

min([address]) as address

From MySampleData

GROUP BY name

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Sean Lange SSC Guru Points: 286609 More actions · Answer 9

Lowell (8/2/2013)
I just noticed every one of us skipped over a basic group by as an alternative, and went straight to row_number.
this is probably the simplest:
SELECT
name,
min([address]) as address
From MySampleData
GROUP BY name

haha you're right Lowell. I think for me it was the "doesn't matter which one" so I went to row_number instinctively so i could 'randomize' it with a useless order by. Good catch.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

taigovinda Default port Points: 1406 More actions · Answer 10

Good point... For the sample data that is definitely smarter. In my actual data I have about 6 fields that are like 'address' in that there could be multiple distinct values and I only want to pick one - I suppose I could put min() for each one but that would probably be more expensive, right?

Sean Lange SSC Guru Points: 286609 More actions · Answer 11

taigovinda (8/2/2013)
Good point... For the sample data that is definitely smarter. In my actual data I have about 6 fields that are like 'address' in that there could be multiple distinct values and I only want to pick one - I suppose I could put min() for each one but that would probably be more expensive, right?

Actually MIN would be less expensive than using Row_Number. Since you say you only want to pick one, you have to either make a decision about the rules for which one or get a random one.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

taigovinda Default port Points: 1406 More actions · Answer 12

taigovinda

Default port

Points: 1406

August 2, 2013 at 11:22 am

#1638289

Thanks!

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 13

taigovinda (8/2/2013)
Good point... For the sample data that is definitely smarter. In my actual data I have about 6 fields that are like 'address' in that there could be multiple distinct values and I only want to pick one - I suppose I could put min() for each one but that would probably be more expensive, right?

Careful... if you put MIN on each column, you could end up having Detroit in California. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

taigovinda Default port Points: 1406 More actions · Answer 14

Thanks for realizing that I needed that caution!

(I had to test the code that Lowell provided in order to understand that the same address would not be returned for both rows. I see, hopefully correctly, that it is the group by which ensures Mike doesn't get Tai's address but would not prevent mismatches within one person's distinct addresses...)