July 29, 2010 at 4:13 am
Hii all..!!
I have Two tables named dbo.Person and dbo.Person_exclusion.These two tables contain millions of records.
Structure of dbo.Person
SNo | Name | Age | ExclusionID
Structure of dbo.Person_Exclusion
Name_similar | ID
So my basic requirement is to
update ( dbo.Person.ExclusionID) with (dbo.Person_Exclusion.ID ) when
(dbo.Person.Name) like (dbo.Person_Exclusion.Name_similar)
For Example:
sample records in dbo.Person:
SNo | Name | Age | ExclusionID
1 | Santhosh | 22 | Null
2 | Sandeep | 21 | Null
3 | Raju | 20 | Null
4 | Raja |18 | Null
5 | Rajesh | 25 | Null
sample records in dbo.Person_exclusion:
Name_similar | ID
sa% | 56
san% | 53
sand% | 57
%fgbm | 89
raj% | 84
raje% | 76
ghji%plk | 63
Resultant set should be:
SNo | Name | Age | ExclusionID
1 | Santhosh | 22 | 53
2 | Sandeep | 21 | 57
3 | Raju | 20 | 84
4 | Raja |18 | 84
5 | Rajesh | 25 | 76
I tried in following way :
update a set a.ExclusionID = b.ID
FROM dbo.Person a
join dbo.Person_exclusion b
on a.Name like b.Name_similar
But it is taking loooooooong time to execute.Almost hours to execute.So can u please suggest new and best performance way which fulfills my requirement.
THANKS IN ADVANCE
July 29, 2010 at 4:22 am
U are joining the tables on Varchar field that to not with matching conditionds....
Joining with the like operation ....from above scenario it creates more records...
like 'san%'
It will joins all name starts with 'san'
like 'sa%'
It will joins all names starts with sa and the all above also...
So the joining condidtion creating more records.
Better to use join with some identity or ID field...
Create index on the table.
Thanks,
Sasidhar Pulivarthi
July 29, 2010 at 4:35 am
Santhosh Yamsani (7/29/2010)
...I tried in following way :
update a set a.ExclusionID = b.ID
FROM dbo.Person a
join dbo.Person_exclusion b
on a.Name like b.Name_similar
But it is taking loooooooong time to execute.Almost hours to execute.So can u please suggest new and best performance way which fulfills my requirement.
THANKS IN ADVANCE
1st. Join on LIKE cannot be expected to be very fast. Especially as some of yours "like" values starts from wild character, the indices will not be used.
2nd. How many records is selected for update? If millions in one go, then you cannot expect it happen too fast as well.
Can you advise how many records you have in dbo.Person_exclusion table?
What you can try to do is separate your queries based on Name_similar values in to two: first will only operate with Name_similar values which have wildchar at the end, so index can be used, and the second one the rest of Name_similar.
Also you can try to update in batches: for example 100K of records at the time.
July 29, 2010 at 4:39 am
(On behalf of santosh)
Hi
We understand the join from left side is returning more records. so we have a slight change in our requirement to return first record matched
Supppose in above case Santosh is matching with sa% (first record getting matched,ignore the other records ).so the id of the record is 56
I want to update the Exclusion id for 56..
Can you suggest me a best way in handling this. Joining on varchar column is unavoidable.
Thanks
Raj Deep
July 29, 2010 at 4:44 am
There are 5 million records in person table and 4 million in person exclude table
We are trying in different ways but couldn't find a convincing solution. please suggest what can be the best option
Thanks
Raj Deep
July 29, 2010 at 5:06 am
Based on your "new" requirements, I suggest you to read my previous post which can give you an idea of how to try to speed up your process...
To make it faster you will need to split your query.
Split set of data in your Person_exclusion based on the Position of wildchar and the length of the "like" value into separate temp tables (use "select into for speed"), then join to them for update of your person table, add the filter to only update records which doesn't have ExclusionID set. You might also need to do it in batches of 100K
You will need to test it to find the most optimal split option.
July 29, 2010 at 6:34 am
Hii all..!!
Actually dbo.person structure is
SNo | Name | Age |
and Structure of dbo.Person_exclusion is
| Name_Similar |
My aim is : To Exclude all the names from dbo.person which matches with Name_similar in dbo.Person_exclusion.
So for that I added one column to both tables i.e.,( ID(row_number) to dbo.Person_exclusion and ExclusionID(NULL) to dbo.Person) and i updated the ExclusionId with ID where it Name and Name_similar matches..
Then i can exclude all the names by using a condtion ExclusionID not null...
This is my approach.
Can I do the Exclusion process in any other way..As my process taking loong time..Can any one suggest new way of approach for this EXCLUSION process....
Thanks in ADVANCE
July 29, 2010 at 6:59 am
Can you test how long the following query will run and how many records does it returns:
SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches
July 29, 2010 at 7:04 am
Santhosh Yamsani (7/29/2010)
-- update a set a.ExclusionID = b.IDSELECT COUNT(*)
FROM dbo.Person a
join dbo.Person_exclusion b
on a.Name like b.Name_similar
Try running this statement to see the size of the working table. That's why your query is slow.
Get some idea of the dupes within your table with -
SELECT Name, COUNT(*) FROM dbo.Person GROUP BY Name HAVING COUNT(*) > 1.
If you've got millions of persons in your list, and you only have the name to play with, it's probably going to give you a high proportion of false positives. You need tie-breaker columns - National ID, address elements.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 29, 2010 at 8:11 am
Hi Chris,
Nice to hear from you.. I am have gone through couple of articles posted by you. This requirement is unavoidable to use like operator. Finally i got ready to use this batch statement. I request you members to help me in optimising,reduce and include the steps necessary to get this work done in minimum time
Declare @MaxCounter Int
Declare @URLID Int
Declare @PrevURLID Int
Select @MaxCounter= MAX(URLID) from DI_URL
SET @URLID=1
SET @PrevURLID=1
While @URLID<=@MaxCounter
Begin
Update URL2
Set ExclusionID= (Select Top 1 ID
From URL_EXCLUDE2 B
WHERE URLNAME lIKE B.[RULE])
FROM DI_URL2
where URLID>@PrevURLID and URLID<=@URLID
Set @PrevURLID= @URLID
SET @URLID = @URLID+100
End
For sample i tried with 200 records .even it is taking hell lot of time of more than 5 mins which i cant afford for 5 million records
About URL2 . there was primary clustered index on URLID
About URL_EXCLUDE2 .there was no indexes
Thanks in advance
Rajdeep
July 29, 2010 at 8:57 am
Eugene Elutin (7/29/2010)
Can you test how long the following query will run and how many records does it returns:SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches
Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 29, 2010 at 9:33 am
Chris Morris-439714 (7/29/2010)
Eugene Elutin (7/29/2010)
Can you test how long the following query will run and how many records does it returns:SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches
Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?
Chris, I think that OP has a few issues there.
First, we should remember that his ultimate aim is deleting of records.
Now, if they have 'Ed%' and 'Edwa%' in Name_similar, they should try
to dedupe it first, as the 'Edwa%' is subset of 'Ed%' and such deduping will reduce number of required records in their "Exclusion" table.
I have asked OP to perform the above query, as if the timing is acceptable and number of records returned is not large, it could be faster to insert the affected records into temp table then truncate original one and reinsert records back...
There is another advanced method they could use, it's bitmasking the name to quickly exclude "non-matches"...
July 29, 2010 at 10:23 am
Eugene Elutin (7/29/2010)
Chris Morris-439714 (7/29/2010)
Eugene Elutin (7/29/2010)
Can you test how long the following query will run and how many records does it returns:SELECT a.*
FROM dbo.Person a
LEFT JOIN dbo.Person_exclusion b
ON a.Name LIKE b.Name_similar
WHERE b.Name_similar IS NULL
You should understand that the most of time taken by your query is for resolving LIKE join. You can only significantly reduce this time by minimazing time for the join, that is why you need to do it in batches
Hi Eugene, looks like I'm treading on your toes here mate, and I'm in complete agreement with your thinking so far. What I'd suggest from here is firstly a dupecount of both tables, and then maybe working one table against an aggregate of the other?
Chris, I think that OP has a few issues there.
First, we should remember that his ultimate aim is deleting of records.
Now, if they have 'Ed%' and 'Edwa%' in Name_similar, they should try
to dedupe it first, as the 'Edwa%' is subset of 'Ed%' and such deduping will reduce number of required records in their "Exclusion" table.
I have asked OP to perform the above query, as if the timing is acceptable and number of records returned is not large, it could be faster to insert the affected records into temp table then truncate original one and reinsert records back...
There is another advanced method they could use, it's bitmasking the name to quickly exclude "non-matches"...
I've just had a quick play with a table of over 1 million random-ish Western last names. I took a DISTINCT subset of all those with only three letters, then matched it back to the original table using LIKE. Around 3/4 of the original table matched. I think the logic needs investigation.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 29, 2010 at 11:14 am
Chris, it's look like OP data is not about person names, its URL names...
July 29, 2010 at 11:59 pm
Hi Eugine,
Yes,they are URL Names .i cant seggregate them with some criteria.. What i was thinking among these three approaches
1) First use Exists clause and update the ExclusionID with 0 or 1. After this i will operate only on records which are 1 (matched)
2) Tweak the Exclusion table into multiple subsets and operate in batch
3) I want it to be suggested by you 🙂
Thanks
Raj Deep
Viewing 15 posts - 1 through 15 (of 20 total)
You must be logged in to reply to this topic. Login to reply