Importing and normalizing

  • I have a single table formed from an import, which contains a variety of data on files.  I want to insert that data into my normalized table structure, but I'm unsure on the best way to do that.  The import table contains the following:

    Import

    Category tinyint,

    MD5_Hash varchar(500),

    SHA1_Hash varchar(500),

    UpdateDateTime datetime,

    ImportRecordId int

     

    The target tables are set up as follows:

    Media

    Id int primary key identity,

    Category tinyint,

    CreatedDatetime datetime,

    UpdatedDatetime datetime

     

    Hashes

    MediaId int,

    HashType tinyint,

    HashValue varchar(500)

     

    I have to pull the values from the import table, and insert them into the Media and Hashes table if they don't already exist there.  If they do exist in the target tables then I need to simply update them.  Updates are only needed if a category changes or another hash is added.  The MD5 is available for all records, but they other hashes may or may not be present.

     

    What's the best way to tackle this? My first thought was to check if the MD5 hash exists and insert if it does not.  Is that the best/only option?

  • Some sort of MERGE will probably do the trick.

    What is the relationship between the import table and the other two?

    That is, exactly how do you determine from one imported row whether an INSERT or UPDATE is required, for each table.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • The only way to determine whether an insert or update is needed is the presence of the MD5 hash in the Hashes table.  If it already exists, then the corresponding record also exists in the Media table.  If not, then it is a new file and needs to be added to both.  Other hashes are inconsistently added so they can't be used, and there aren't any unique values aside from the hashes.

     

    The import table is a dump of all of all the data, and I'm just trying to normalize it into the other two tables.  There are additional fields in the import table that I haven't included above, but they aren't unique to each record either.

  • So the link is from Hashes(MediaId) to Media(Id), is that correct?

    If so, is this a one-to-one relationship?

    If so, I don't see a lot of point in having it as a separate table ... can you explain your rationale for splitting it out?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • No, it will be one-to-many in most cases.  For some of these files, only the MD5 has was generated, but for others the SHA1 and PhotoDNA hashes were also generated for example.  The missing hashes may be added at some point in the future or the category (stored in the Media table) might change, which is why I need the update functionality instead of just insert.  Does that make sense?

  • Mark Harley wrote:

    No, it will be one-to-many in most cases.  For some of these files, only the MD5 has was generated, but for others the SHA1 and PhotoDNA hashes were also generated for example.  The missing hashes may be added at some point in the future or the category (stored in the Media table) might change, which is why I need the update functionality instead of just insert.  Does that make sense?

    Some sense, but not 100%! How can one row in the Hashes table link to many in the Media table? And conversely, how can one row in the Media table link to a many rows in the Hashes table. From the table structure provided, these relationships are not clear.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • There should only be one record for each file in the Media table, and there could be several hashes for each file. The complexity occurs when another import is conducted. This could introduce another record for the file, which could contain updated fields or new hashes. Ultimately, there should be one Media record to many Hashes records.  Does that make things clearer?

     

    Unfortunately I have little control over the shape and content of the Import table. Additional fields will be added from time to time, and new data will be provided regularly, but ultimately the relationship won’t change.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply