February 11, 2008 at 2:10 pm
Thanks Journeyman
That made sense! Just wondering though.....
IF for table A- that has the duplicates...i just run a distinct query that returns only one record of the duplicate(based on name, id, email etc
For table B- I take the remaining id's that are unique( that is not in the distinct)
Join table A&B into a new table C
I get a new table free of duplicates.
Im trying out these queries though currently.
Thanks for your suggestions ...
February 13, 2008 at 10:01 am
I've deduped data, but only manually(pressed for time..).
I would like to use SSIS for its really meant for then....
May 9, 2008 at 4:04 am
hello
i did the same thing that you recommended but i got an avertissement message which said :"
The buffer manager detected that the system was low on virtual memory, but
was unable to swap out any buffers. 16 buffers were considered and 16 were
locked. Either not enough memory is available to the pipeline because not
enough is installed, other processes are using it, or too many buffers are
locked."
My server included an system SQLSERVER 2003 :
MEMORY : 1 Go
PENTIUM III 512 MHZ
I know it's old version but my entreprise use it for test it
unfortunately i don't have choice to change it.
I try to use in my package limitation of memory ( 256 Mo : ) but i got same avertissement message.
Have you got an idea how i can solve my problems ?
thanks you 😉
May 9, 2008 at 12:06 pm
Could you give a little more detail as to the issue you are facing i.e. how many tables, how much data etc...what ssis package are you using..?
May 12, 2008 at 4:59 pm
hello
i create an table TempClient which contents duplicate fields. To clean that table i create an process composed of :
- query with aggregation on LastName,FirstName, Adress,ZipCode, City
- i used an fuzzy lookup where my reference table is "aggClient table" contened only 3 field not duplicate is "Lastname, Firstname , Zipcode"
- after i conditional fractionnal on Similarity >=0.8 to retrieve unique Client
- then i used an fuzzy grouping with and conditionnal fractionnal where restriction is key_in == key_out
- finally i used an aggregation to retrieve unique client
- Insert in my Client Table
When i did that process for 1000 data or 4 months data , my proccess work
but when i did on all data, it wrote that i got a allocated memory problem.
My table contents more 1 300 000 data on TempClient
Have you got an idea for my problems ?
Thanks you
hani 😉
May 16, 2008 at 11:00 am
Let me see if i have understood correctly, basically you're having a problem when you do a Fuzzy Lookup on a larger dataset.
So, one option would be to make sure that the Buffers size is kept to smallest possible and that the Buffers are stored in a drive that has a lot of space.
Secondly, this is what worked for me, i used t-sql prcedures to initially clean the dataset( when i first began the data was about 4 years old and there were 45 million records!). After cleaning the data and building the routines to be automated i am now operating on a much smaller dataset( as i have only about 60K records or so) that i pull every week.
I have started to use SSIS, but its still slow.
Hope that helps...
Viewing 6 posts - 16 through 20 (of 20 total)
You must be logged in to reply to this topic. Login to reply