LIKE operator with millions of records giving slow performance

Question

Post reply

LIKE operator with millions of records giving slow performance

Santhosh Yamsani

Old Hand

Points: 388
More actions
July 29, 2010 at 4:13 am

#224760

Hii all..!!
I have Two tables named dbo.Person and dbo.Person_exclusion.These two tables contain millions of records.
Structure of dbo.Person
SNo | Name | Age | ExclusionID
Structure of dbo.Person_Exclusion
Name_similar | ID
So my basic requirement is to
update ( dbo.Person.ExclusionID) with (dbo.Person_Exclusion.ID ) when
(dbo.Person.Name) like (dbo.Person_Exclusion.Name_similar)
For Example:
sample records in dbo.Person:
SNo | Name | Age | ExclusionID
1 | Santhosh | 22 | Null
2 | Sandeep | 21 | Null
3 | Raju | 20 | Null
4 | Raja |18 | Null
5 | Rajesh | 25 | Null
sample records in dbo.Person_exclusion:
Name_similar | ID
sa% | 56
san% | 53
sand% | 57
%fgbm | 89
raj% | 84
raje% | 76
ghji%plk | 63
Resultant set should be:
SNo | Name | Age | ExclusionID
1 | Santhosh | 22 | 53
2 | Sandeep | 21 | 57
3 | Raju | 20 | 84
4 | Raja |18 | 84
5 | Rajesh | 25 | 76
I tried in following way :
update a set a.ExclusionID = b.ID
FROM dbo.Person a
join dbo.Person_exclusion b
on a.Name like b.Name_similar
But it is taking loooooooong time to execute.Almost hours to execute.So can u please suggest new and best performance way which fulfills my requirement.
THANKS IN ADVANCE

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

Pulivarthi Sasidhar SSCertifiable Points: 6706 More actions · Answer 1

U are joining the tables on Varchar field that to not with matching conditionds....

Joining with the like operation ....from above scenario it creates more records...

like 'san%'

It will joins all name starts with 'san'

like 'sa%'

It will joins all names starts with sa and the all above also...

So the joining condidtion creating more records.

Better to use join with some identity or ID field...

Create index on the table.

Thanks,

Sasidhar Pulivarthi

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 2

Santhosh Yamsani (7/29/2010)
...
I tried in following way :
update a set a.ExclusionID = b.ID
FROM dbo.Person a
join dbo.Person_exclusion b
on a.Name like b.Name_similar
But it is taking loooooooong time to execute.Almost hours to execute.So can u please suggest new and best performance way which fulfills my requirement.
THANKS IN ADVANCE

1st. Join on LIKE cannot be expected to be very fast. Especially as some of yours "like" values starts from wild character, the indices will not be used.

2nd. How many records is selected for update? If millions in one go, then you cannot expect it happen too fast as well.

Can you advise how many records you have in dbo.Person_exclusion table?

What you can try to do is separate your queries based on Name_similar values in to two: first will only operate with Name_similar values which have wildchar at the end, so index can be used, and the second one the rest of Name_similar.

Also you can try to update in batches: for example 100K of records at the time.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Raj Deep Alapati Ten Centuries Points: 1118 More actions · Answer 3

(On behalf of santosh)

Hi

We understand the join from left side is returning more records. so we have a slight change in our requirement to return first record matched

Supppose in above case Santosh is matching with sa% (first record getting matched,ignore the other records ).so the id of the record is 56

I want to update the Exclusion id for 56..

Can you suggest me a best way in handling this. Joining on varchar column is unavoidable.

Thanks

Raj Deep

Raj Deep Alapati Ten Centuries Points: 1118 More actions · Answer 4

There are 5 million records in person table and 4 million in person exclude table

We are trying in different ways but couldn't find a convincing solution. please suggest what can be the best option

Thanks

Raj Deep

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 5

Based on your "new" requirements, I suggest you to read my previous post which can give you an idea of how to try to speed up your process...

To make it faster you will need to split your query.

Split set of data in your Person_exclusion based on the Position of wildchar and the length of the "like" value into separate temp tables (use "select into for speed"), then join to them for update of your person table, add the filter to only update records which doesn't have ExclusionID set. You might also need to do it in batches of 100K

You will need to test it to find the most optimal split option.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Santhosh Yamsani Old Hand Points: 388 More actions · Answer 6

Hii all..!!

Actually dbo.person structure is

SNo | Name | Age |

and Structure of dbo.Person_exclusion is

| Name_Similar |

My aim is : To Exclude all the names from dbo.person which matches with Name_similar in dbo.Person_exclusion.

So for that I added one column to both tables i.e.,( ID(row_number) to dbo.Person_exclusion and ExclusionID(NULL) to dbo.Person) and i updated the ExclusionId with ID where it Name and Name_similar matches..

Then i can exclude all the names by using a condtion ExclusionID not null...

This is my approach.

Can I do the Exclusion process in any other way..As my process taking loong time..Can any one suggest new way of approach for this EXCLUSION process....

Thanks in ADVANCE

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 7

Can you test how long the following query will run and how many records does it returns:

SELECT a.*

FROM dbo.Person a

LEFT JOIN dbo.Person_exclusion b

ON a.Name LIKE b.Name_similar

WHERE b.Name_similar IS NULL

You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

Santhosh Yamsani (7/29/2010)
-- update a set a.ExclusionID = b.ID
SELECT COUNT(*)
FROM dbo.Person a
join dbo.Person_exclusion b
on a.Name like b.Name_similar

Try running this statement to see the size of the working table. That's why your query is slow.

Get some idea of the dupes within your table with -

SELECT Name, COUNT(*) FROM dbo.Person GROUP BY Name HAVING COUNT(*) > 1.

If you've got millions of persons in your list, and you only have the name to play with, it's probably going to give you a high proportion of false positives. You need tie-breaker columns - National ID, address elements.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Raj Deep Alapati Ten Centuries Points: 1118 More actions · Answer 9

Hi Chris,

Nice to hear from you.. I am have gone through couple of articles posted by you. This requirement is unavoidable to use like operator. Finally i got ready to use this batch statement. I request you members to help me in optimising,reduce and include the steps necessary to get this work done in minimum time

Declare @MaxCounter Int

Declare @URLID Int

Declare @PrevURLID Int

Select @MaxCounter= MAX(URLID) from DI_URL

SET @URLID=1

SET @PrevURLID=1

While @URLID<=@MaxCounter

Begin

Update URL2

Set ExclusionID= (Select Top 1 ID

From URL_EXCLUDE2 B

WHERE URLNAME lIKE B.[RULE])

FROM DI_URL2

where URLID>@PrevURLID and URLID<=@URLID

Set @PrevURLID= @URLID

SET @URLID = @URLID+100

End

For sample i tried with 200 records .even it is taking hell lot of time of more than 5 mins which i cant afford for 5 million records

About URL2 . there was primary clustered index on URLID

About URL_EXCLUDE2 .there was no indexes

Thanks in advance

Rajdeep

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 10

Eugene Elutin (7/29/2010)
Can you test how long the following query will run and how many records does it returns:
SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches

Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 11

Chris Morris-439714 (7/29/2010)
Eugene Elutin (7/29/2010)
Can you test how long the following query will run and how many records does it returns:
SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches
Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?

Chris, I think that OP has a few issues there.

First, we should remember that his ultimate aim is deleting of records.

Now, if they have 'Ed%' and 'Edwa%' in Name_similar, they should try

to dedupe it first, as the 'Edwa%' is subset of 'Ed%' and such deduping will reduce number of required records in their "Exclusion" table.

I have asked OP to perform the above query, as if the timing is acceptable and number of records returned is not large, it could be faster to insert the affected records into temp table then truncate original one and reinsert records back...

There is another advanced method they could use, it's bitmasking the name to quickly exclude "non-matches"...

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

Eugene Elutin (7/29/2010)
Chris Morris-439714 (7/29/2010)
Eugene Elutin (7/29/2010)
Can you test how long the following query will run and how many records does it returns:
SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches
Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?
Chris, I think that OP has a few issues there.
First, we should remember that his ultimate aim is deleting of records.
Now, if they have 'Ed%' and 'Edwa%' in Name_similar, they should try
to dedupe it first, as the 'Edwa%' is subset of 'Ed%' and such deduping will reduce number of required records in their "Exclusion" table.
I have asked OP to perform the above query, as if the timing is acceptable and number of records returned is not large, it could be faster to insert the affected records into temp table then truncate original one and reinsert records back...
There is another advanced method they could use, it's bitmasking the name to quickly exclude "non-matches"...

I've just had a quick play with a table of over 1 million random-ish Western last names. I took a DISTINCT subset of all those with only three letters, then matched it back to the original table using LIKE. Around 3/4 of the original table matched. I think the logic needs investigation.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 13

Chris, it's look like OP data is not about person names, its URL names...

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Raj Deep Alapati Ten Centuries Points: 1118 More actions · Answer 14

Hi Eugine,

Yes,they are URL Names .i cant seggregate them with some criteria.. What i was thinking among these three approaches

1) First use Exists clause and update the ExclusionID with 0 or 1. After this i will operate only on records which are 1 (matched)

2) Tweak the Exclusion table into multiple subsets and operate in batch

3) I want it to be suggested by you 🙂

Thanks

Raj Deep