Join two tables that do not have matching keys

Question

Post reply

Join two tables that do not have matching keys

henryvuong1000

SSC Eights!

Points: 821

February 25, 2017 at 6:41 pm

#335269

I have two tables like this:

Items

ItemID (PK)	Title
101	iPhone 5
102	iPhone 6
103	Blackberry Storm
104	Samsung Galaxy S1
105	Samsung Galaxy S6

Sales

SaleID (PK)	Reference	Title
1	103-B00	Blackberry Storm
2	101-IK5	iPhone 5
3	105-ISG	Samsung Galaxy S6
4	104-UN3	Samsung Galaxy S1
5	101-IK5	iPhone 5
6	102-HWB	iPhone 6
7	105-ISG	Samsung Galaxy S6
8	101-IK5	iPhone 5
9	102-HWB	iPhone 6

I want to join them using "ItemID" column in Items table and "Reference" column in Sales table. As you can see, the values of two columns are not matching, but the part before the hyphen in each value of "Reference" column is matched with values in "ItemID" column. How can I join them? I know it's odd to have two columns with such values, but I did not create these tables and must play with what's available. Thanks

March 22, 2017 at 3:31 am

This was removed by the editor as SPAM

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Henrico Bekker One Orange Chip Points: 27652 More actions · Answer 1

Join on LEFT(3, reference) - assuming the nrs will always be less than a 1000.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

pietlinden SSC Guru Points: 63512 More actions · Answer 2

DECLARE @Ref VARCHAR(7) = '103-B00'; SELECT LEFT(@Ref,CHARINDEX('-',@Ref,1)-1);

Ed Wagner SSC Guru Points: 287026 More actions · Answer 3

The problem is that isolation of the part of the column from your Sales table is going to need to be calculated for every row in the table. This isn't every row in the result set (because it isn't built yet) but every row in the table. As the number of rows goes up, so will the performance problems. This is only of the problems with having values concatenated into one column instead of keeping the values separate.

If you're stuck with the existing structure, can you add a column? Specifically, can you add a persisted computed column to hold the value of the foreign key? If so, you'll isolate the calculation of your value to when the row is written and save yourself from having to calculate it over the whole table every time the tables are joined. If not, then you'll save yourself some work in the future by using the approach pietlinden suggested. The work you save will be when the values to isolate goes above 999, when you'll have to switch to finding the position of the delimiter anyway.

The best way is to fix the table structure, but I understand if you're stuck with it.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Neil Burton SSC-Insane Points: 23010 More actions · Answer 4

CREATE TABLE #items ---- Build test tables, credit to the great Jeff Moden for this bit. ( ItemID INT IDENTITY(101,1) ,Title VARCHAR(6) ) ALTER TABLE #items ADD CONSTRAINT PK_ItemID PRIMARY KEY CLUSTERED (ItemID) WITH FILLFACTOR = 90 GO

INSERT INTO #items SELECT TOP 100 -- Create 100 unique items Title = CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))

FROM sys.all_columns ac1 CROSS JOIN sys.all_columns ac2 GO

CREATE TABLE #sales ( SaleID INT IDENTITY(1,1) ,Reference VARCHAR(20) ,Title VARCHAR(40) )

ALTER TABLE #sales ADD CONSTRAINT PK_SaleID PRIMARY KEY CLUSTERED (SaleID) WITH FILLFACTOR = 90

GO INSERT INTO #sales SELECT ---- Add 10000 sales for each item, 1000000 rows in total CAST(i.itemID AS VARCHAR(6)) + '-' + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) AS Reference ,i.Title FROM #Items i GO 10000

/* The actual query starts here */ SELECT i.ItemID ,i.Title ,der.CleanReference ,der.Reference ,der.Title ,der.SaleID FROM #Items i JOIN ( SELECT CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) AS CleanReference ,s.Reference ,s.Title ,s.SaleID FROM #Sales s ) der ON der.CleanReference = i.ItemID DROP TABLE #Items ,#sales

Does this do what you want? I can't help thinking it's too obvious and there's a glaring flaw though, The whole thing runs in about 22 seconds when the tables are built each time and in about 10 seconds when the tables aren't dropped. That's on a crappy test box so on something with a bit of oomph it might be pretty quick.

--- Edited to make the data more realistic

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

SELECT *

FROM Items i

INNER JOIN Sales s

ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

-- Retains SARGability of s.Reference

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Neil Burton SSC-Insane Points: 23010 More actions · Answer 6

ChrisM@Work - Wednesday, March 22, 2017 6:25 AM
SELECT *
FROM Items i
INNER JOIN Sales s
ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'
-- Retains SARGability of s.Reference

I'm not sure what the advantages of that are Chris, I was under the impression that functions in JOINs was to be avoided. It's considerably slower than my version on my laptop against the same million row tables.

--Edited to correct a mistake in terminology.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

Neil Burton - Wednesday, March 22, 2017 6:53 AM
ChrisM@Work - Wednesday, March 22, 2017 6:25 AM
SELECT *
FROM Items i
INNER JOIN Sales s
ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'
-- Retains SARGability of s.Reference
I'm not sure what the advantages of that are Chris, I was under the impression that functions in WHERE clauses were to be avoided. It's considerably slower than my version on my laptop against the same million row tables.

Put an index on s.reference and restrict the output columns to i.ItemID and s.Reference.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

CREATE INDEX ix_Stuff ON #sales (Reference) INCLUDE (Title)

SELECT i.ItemID, s.Title

FROM #Items i

INNER JOIN #Sales s

ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

-- 1,000,000 / 00:00:04

Execution plan shows an index seek for #Sales.

Edit: "blackholing" the output to eliminate return & display of the 1000000-row result set.

SET

STATISTICS TIME ON

DECLARE @ItemID INT, @Title VARCHAR(40)

SELECT @ItemID = i.ItemID, @Title = s.Title

FROM #Items i

INNER JOIN #Sales s

ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

SET STATISTICS TIME OFF

-- (1 row(s) affected)

-- SQL Server Execution Times:

-- CPU time = 1214 ms, elapsed time = 402 ms.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Neil Burton SSC-Insane Points: 23010 More actions · Answer 9

I see. It's logical that a function on the right of an ON (which is what I meant when I said WHERE clause above) won't affect SARGability.

It certainly speeds things up having the index in place. Over the million rows my version still seems to edge it though. When there is a WHERE clause in place mine is definitely faster.

SELECT i.ItemID ,der.Reference FROM #Items i JOIN ( SELECT CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) AS CleanReference ,s.Reference FROM #Sales s

) der ON der.CleanReference = i.ItemID WHERE der.Reference LIKE '10%'

SELECT i.itemid ,s.reference FROM #Items i

INNER JOIN #Sales s ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%' WHERE s.Reference LIKE '10%'

Out of those two queries mine is consistently much quicker.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 10

Neil Burton - Wednesday, March 22, 2017 8:33 AM
I see. It's logical that a function on the right of an ON (which is what I meant when I said WHERE clause above) won't affect SARGability.
It certainly speeds things up having the index in place. Over the million rows my version still seems to edge it though. When there is a WHERE clause in place mine is definitely faster.

SELECT i.ItemID ,der.Reference FROM #Items i JOIN ( SELECT CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) AS CleanReference ,s.Reference FROM #Sales s
) der ON der.CleanReference = i.ItemID WHERE der.Reference LIKE '10%'
SELECT i.itemid ,s.reference FROM #Items i
INNER JOIN #Sales s ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%' WHERE s.Reference LIKE '10%'
Out of those two queries mine is consistently much quicker.

It's quicker because a hash match will process 1000000 rows faster than 1000000 individual seeks. LIKE won't play with a hash join.
If you change the row restriction to a filter on #Item (quite likely in the wild) you see the nested loops join (with seeks) overtakes the blunt-force hash join, by a significant amount:

--Run each batch individually several times

DECLARE@ItemID INT, @Reference VARCHAR(50),@Title VARCHAR(50) = (SELECT TOP 1Title FROM #Items)

SET STATISTICS TIME ON

SELECT

@ItemID = i.ItemID

,@Reference= der.Reference

FROM#Items i

JOIN (

SELECT

CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) ASCleanReference

,s.Reference

FROM#Sales s

)der ON der.CleanReference = i.ItemID

WHERE i.Title= @Title

SET STATISTICS TIME OFF

--CPU time = 672 ms, elapsed time = 134ms.

GO

DECLARE@ItemID INT, @Reference VARCHAR(50),@Title VARCHAR(50) = (SELECT TOP 1Title FROM #Items)

SET STATISTICS TIME ON

SELECT

@ItemID = i.itemid

,@Reference= s.reference

FROM#Items i

INNER JOIN#Sales s

ON s.ReferenceLIKE CAST(i.ItemIDAS VARCHAR(50)) + '-%'

WHERE i.Title= @Title

SET STATISTICS TIME OFF

--CPU time = 16 ms, elapsed time = 9 ms.

GO

DECLARE@ItemID INT, @Reference VARCHAR(50),@Title VARCHAR(50) = (SELECT TOP 1Title FROM #Items)

SET STATISTICS TIME ON

SELECT

@ItemID = i.itemid

,@Reference= s.reference

FROM#Items i

INNER JOIN#Sales s

ON s.ReferenceBETWEEN CAST(i.ItemIDAS VARCHAR(50)) AND CAST(i.ItemIDAS VARCHAR(50)) + 'Â¬'

WHERE i.Title= @Title

SET STATISTICS TIME OFF

--CPU time = 16 ms, elapsed time = 5 ms.

GO

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden