LIKE operator with millions of records giving slow performance

  • Hii all..!!

    I have Two tables named dbo.Person and dbo.Person_exclusion.These two tables contain millions of records.

    Structure of dbo.Person

    SNo | Name | Age | ExclusionID

    Structure of dbo.Person_Exclusion

    Name_similar | ID

    So my basic requirement is to

    update ( dbo.Person.ExclusionID) with (dbo.Person_Exclusion.ID ) when

    (dbo.Person.Name) like (dbo.Person_Exclusion.Name_similar)

    For Example:

    sample records in dbo.Person:

    SNo | Name | Age | ExclusionID

    1 | Santhosh | 22 | Null

    2 | Sandeep | 21 | Null

    3 | Raju | 20 | Null

    4 | Raja |18 | Null

    5 | Rajesh | 25 | Null

    sample records in dbo.Person_exclusion:

    Name_similar | ID

    sa% | 56

    san% | 53

    sand% | 57

    %fgbm | 89

    raj% | 84

    raje% | 76

    ghji%plk | 63

    Resultant set should be:

    SNo | Name | Age | ExclusionID

    1 | Santhosh | 22 | 53

    2 | Sandeep | 21 | 57

    3 | Raju | 20 | 84

    4 | Raja |18 | 84

    5 | Rajesh | 25 | 76

    I tried in following way :

    update a set a.ExclusionID = b.ID

    FROM dbo.Person a

    join dbo.Person_exclusion b

    on a.Name like b.Name_similar

    But it is taking loooooooong time to execute.Almost hours to execute.So can u please suggest new and best performance way which fulfills my requirement.

    THANKS IN ADVANCE

  • U are joining the tables on Varchar field that to not with matching conditionds....

    Joining with the like operation ....from above scenario it creates more records...

    like 'san%'

    It will joins all name starts with 'san'

    like 'sa%'

    It will joins all names starts with sa and the all above also...

    So the joining condidtion creating more records.

    Better to use join with some identity or ID field...

    Create index on the table.

    Thanks,

    Sasidhar Pulivarthi

  • Santhosh Yamsani (7/29/2010)


    ...

    I tried in following way :

    update a set a.ExclusionID = b.ID

    FROM dbo.Person a

    join dbo.Person_exclusion b

    on a.Name like b.Name_similar

    But it is taking loooooooong time to execute.Almost hours to execute.So can u please suggest new and best performance way which fulfills my requirement.

    THANKS IN ADVANCE

    1st. Join on LIKE cannot be expected to be very fast. Especially as some of yours "like" values starts from wild character, the indices will not be used.

    2nd. How many records is selected for update? If millions in one go, then you cannot expect it happen too fast as well.

    Can you advise how many records you have in dbo.Person_exclusion table?

    What you can try to do is separate your queries based on Name_similar values in to two: first will only operate with Name_similar values which have wildchar at the end, so index can be used, and the second one the rest of Name_similar.

    Also you can try to update in batches: for example 100K of records at the time.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • (On behalf of santosh)

    Hi

    We understand the join from left side is returning more records. so we have a slight change in our requirement to return first record matched

    Supppose in above case Santosh is matching with sa% (first record getting matched,ignore the other records ).so the id of the record is 56

    I want to update the Exclusion id for 56..

    Can you suggest me a best way in handling this. Joining on varchar column is unavoidable.

    Thanks

    Raj Deep

  • There are 5 million records in person table and 4 million in person exclude table

    We are trying in different ways but couldn't find a convincing solution. please suggest what can be the best option

    Thanks

    Raj Deep

  • Based on your "new" requirements, I suggest you to read my previous post which can give you an idea of how to try to speed up your process...

    To make it faster you will need to split your query.

    Split set of data in your Person_exclusion based on the Position of wildchar and the length of the "like" value into separate temp tables (use "select into for speed"), then join to them for update of your person table, add the filter to only update records which doesn't have ExclusionID set. You might also need to do it in batches of 100K

    You will need to test it to find the most optimal split option.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Hii all..!!

    Actually dbo.person structure is

    SNo | Name | Age |

    and Structure of dbo.Person_exclusion is

    | Name_Similar |

    My aim is : To Exclude all the names from dbo.person which matches with Name_similar in dbo.Person_exclusion.

    So for that I added one column to both tables i.e.,( ID(row_number) to dbo.Person_exclusion and ExclusionID(NULL) to dbo.Person) and i updated the ExclusionId with ID where it Name and Name_similar matches..

    Then i can exclude all the names by using a condtion ExclusionID not null...

    This is my approach.

    Can I do the Exclusion process in any other way..As my process taking loong time..Can any one suggest new way of approach for this EXCLUSION process....

    Thanks in ADVANCE

  • Can you test how long the following query will run and how many records does it returns:

    SELECT a.*

    FROM dbo.Person a

    LEFT JOIN dbo.Person_exclusion b

    ON a.Name LIKE b.Name_similar

    WHERE b.Name_similar IS NULL

    You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Santhosh Yamsani (7/29/2010)


    -- update a set a.ExclusionID = b.ID

    SELECT COUNT(*)

    FROM dbo.Person a

    join dbo.Person_exclusion b

    on a.Name like b.Name_similar

    Try running this statement to see the size of the working table. That's why your query is slow.

    Get some idea of the dupes within your table with -

    SELECT Name, COUNT(*) FROM dbo.Person GROUP BY Name HAVING COUNT(*) > 1.

    If you've got millions of persons in your list, and you only have the name to play with, it's probably going to give you a high proportion of false positives. You need tie-breaker columns - National ID, address elements.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Nice to hear from you.. I am have gone through couple of articles posted by you. This requirement is unavoidable to use like operator. Finally i got ready to use this batch statement. I request you members to help me in optimising,reduce and include the steps necessary to get this work done in minimum time

    Declare @MaxCounter Int

    Declare @URLID Int

    Declare @PrevURLID Int

    Select @MaxCounter= MAX(URLID) from DI_URL

    SET @URLID=1

    SET @PrevURLID=1

    While @URLID<=@MaxCounter

    Begin

    Update URL2

    Set ExclusionID= (Select Top 1 ID

    From URL_EXCLUDE2 B

    WHERE URLNAME lIKE B.[RULE])

    FROM DI_URL2

    where URLID>@PrevURLID and URLID<=@URLID

    Set @PrevURLID= @URLID

    SET @URLID = @URLID+100

    End

    For sample i tried with 200 records .even it is taking hell lot of time of more than 5 mins which i cant afford for 5 million records

    About URL2 . there was primary clustered index on URLID

    About URL_EXCLUDE2 .there was no indexes

    Thanks in advance

    Rajdeep

  • Eugene Elutin (7/29/2010)


    Can you test how long the following query will run and how many records does it returns:

    SELECT a.*

    FROM dbo.Person a

    LEFT JOIN dbo.Person_exclusion b

    ON a.Name LIKE b.Name_similar

    WHERE b.Name_similar IS NULL

    You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches

    Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Chris Morris-439714 (7/29/2010)


    Eugene Elutin (7/29/2010)


    Can you test how long the following query will run and how many records does it returns:

    SELECT a.*

    FROM dbo.Person a

    LEFT JOIN dbo.Person_exclusion b

    ON a.Name LIKE b.Name_similar

    WHERE b.Name_similar IS NULL

    You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches

    Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?

    Chris, I think that OP has a few issues there.

    First, we should remember that his ultimate aim is deleting of records.

    Now, if they have 'Ed%' and 'Edwa%' in Name_similar, they should try

    to dedupe it first, as the 'Edwa%' is subset of 'Ed%' and such deduping will reduce number of required records in their "Exclusion" table.

    I have asked OP to perform the above query, as if the timing is acceptable and number of records returned is not large, it could be faster to insert the affected records into temp table then truncate original one and reinsert records back...

    There is another advanced method they could use, it's bitmasking the name to quickly exclude "non-matches"...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (7/29/2010)


    Chris Morris-439714 (7/29/2010)


    Eugene Elutin (7/29/2010)


    Can you test how long the following query will run and how many records does it returns:

    SELECT a.*

    FROM dbo.Person a

    LEFT JOIN dbo.Person_exclusion b

    ON a.Name LIKE b.Name_similar

    WHERE b.Name_similar IS NULL

    You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches

    Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?

    Chris, I think that OP has a few issues there.

    First, we should remember that his ultimate aim is deleting of records.

    Now, if they have 'Ed%' and 'Edwa%' in Name_similar, they should try

    to dedupe it first, as the 'Edwa%' is subset of 'Ed%' and such deduping will reduce number of required records in their "Exclusion" table.

    I have asked OP to perform the above query, as if the timing is acceptable and number of records returned is not large, it could be faster to insert the affected records into temp table then truncate original one and reinsert records back...

    There is another advanced method they could use, it's bitmasking the name to quickly exclude "non-matches"...

    I've just had a quick play with a table of over 1 million random-ish Western last names. I took a DISTINCT subset of all those with only three letters, then matched it back to the original table using LIKE. Around 3/4 of the original table matched. I think the logic needs investigation.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Chris, it's look like OP data is not about person names, its URL names...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Hi Eugine,

    Yes,they are URL Names .i cant seggregate them with some criteria.. What i was thinking among these three approaches

    1) First use Exists clause and update the ExclusionID with 0 or 1. After this i will operate only on records which are 1 (matched)

    2) Tweak the Exclusion table into multiple subsets and operate in batch

    3) I want it to be suggested by you 🙂

    Thanks

    Raj Deep

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply