August 2, 2019 at 8:50 am
Hi
I am exploring ideas of improving data quality in new organisation i joined. The issue is it is not straight forward as to re-write things and improve as there are 100s of dataset we process every weeks. I checked few tables for basic checks and found about 40 tables with duplicates. They are actual duplicates which means i am grouping by every field and checking if there are duplicates.
This is just a start and best way to improve this is sometimes go back to source and find out why duplicates were sent. In some cases the process has been set up in a way that it will produce duplicates if the logic is not right. Now i am looking at further options of doing checks on datasets.
As per example the dataset received is in line with the specification (i.e. string field is float or int field has string etc.). Then i could check consistency of data (i.e. expected 500 new records every week but we had 5 this week). I would like to come up with few indicators like this and then produce a score.
What kind of checks/indicators people put in the process to improve data quality? Basically the team needs to give assurance to business that the data is credible and if not approximate idea of credibility. Are you aware of any tools which could do this? I looked at one called Informatica Data Quality (IDQ).
Any help would be appreciated.
August 2, 2019 at 11:50 am
Have you looked at SQL Server Data Quality Services?https://www.google.com/url?sa=t&source=web&rct=j&url=https://docs.microsoft.com/en-us/sql/data-quality-services/data-quality-services&ved=2ahUKEwjK9KWej-TjAhUhMn0KHdL3DhcQFjAAegQIBxAC&usg=AOvVaw3ERS-iZO5QVMVmiNy9qnAb
L
SQLrdata quality services
August 2, 2019 at 12:42 pm
Thanks - Yes i am aware of this tool and mentioned this but apparently following the assessment, it has been decided that it is not fit for our needs. This was carried out about a year ago.
August 2, 2019 at 8:43 pm
There are some tools like DQS, but I think they are fairly similar. There is also MDS (Master DAta Services), which could help. What are you looking for help with?
What I would really look at is a rebuild of the process that loads the datasets into a staging table and then use a SQL process to move over non-dupes, discarding dupes.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply