How would you plan this data cleanse?

  • Hi all,

    I've got two tables (Client and Contacts), in two separate databases, on the same server, that need to be cleaned and merged with an Excel file (that also needs to be cleaned and merged) into one large csv file.

    Without having looked at the data, the first task seems to be to show the users some 'top 100 rows' sample data so I can nail down which of their criteria apply to which columns (I think the company who supplied the system may have gone bust, and we have no docu from them).  Both tables have at least 15 columns, which is a pain to print out and tile up.  Then, I'd also need to import the Excel data - which shouldn't be a problem.

    After that, I need to select 'firstname', 'lastname' and 'email_address' (although these column names might well vary between the three tables!) from each table, so that for example, anyone in the 'Client' table who bought a day ticket OR a week ticket etc (and there could well be a column for each) appears in the result set, but NOBODY is returned who doesn't have a valid email address.

    Now, I know that I've got to take a look at the table structures, but before I do, just a few questions?

    What would be the best way to define a correct email address within SQL?  I presume that it must contain a dot and an '@', both preceeded and succeeded by a minimum of one character?

    What's the best way to print out the sample data for user review and readability?

    What should the SQL look like, roughly?

    Any 'gotcha's I need to watch out for? 

     

    Thanks in advance,

     

    Jaybee.

  • You really don't have a technically hard problem, but just a very time consuming one!

    I can give you some pointers I used in creating a multi-functional "DataLoad",

    importing process I wrote for STC (now called SEEBEYOND), which they then wrapped

    with a freindly GUI interface and added it as one of many software products.

    First, import all existing data into the database totally unchanged with all fields well labeled.

    Secondly, create a new table for storing all cleansed data.

    Thirdly, Create a stored procedure for the cleaning process,

     filter input data on read & output clean data to new table,

     continue this process until results meet your expectations.

    In your procedure list your rules & comparisons, look ups,

     and data mining if needed. As your knowledge of the data

     increases so does your stored procedure logic.

    In summary, you never have to reload the dirty data as it

    stays unchanged in the origonal table, but you drop and create

    the output table during each attempt to have clean data. Also,

    you must drop, modify, and create your stored procedure, over and

    over, and over until you get it right, then select the data to file

    called newdata.csv (using BCP). Happy Hunting!

     


    Regards,

    Coach James

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply