Importing and normalizing

Question

Post reply

Importing and normalizing

Mark Harley

SSCrazy

Points: 2259
More actions
May 27, 2019 at 2:37 pm

#3647568

I have a single table formed from an import, which contains a variety of data on files. I want to insert that data into my normalized table structure, but I'm unsure on the best way to do that. The import table contains the following:
Import
Category tinyint,
MD5_Hash varchar(500),
SHA1_Hash varchar(500),
UpdateDateTime datetime,
ImportRecordId int

The target tables are set up as follows:
Media
Id int primary key identity,
Category tinyint,
CreatedDatetime datetime,
UpdatedDatetime datetime

Hashes
MediaId int,
HashType tinyint,
HashValue varchar(500)

I have to pull the values from the import table, and insert them into the Media and Hashes table if they don't already exist there. If they do exist in the target tables then I need to simply update them. Updates are only needed if a category changes or another hash is added. The MD5 is available for all records, but they other hashes may or may not be present.

What's the best way to tackle this? My first thought was to check if the MD5 hash exists and insert if it does not. Is that the best/only option?

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247224 More actions · Answer 1

Some sort of MERGE will probably do the trick.

What is the relationship between the import table and the other two?

That is, exactly how do you determine from one imported row whether an INSERT or UPDATE is required, for each table.

Mark Harley SSCrazy Points: 2259 More actions · Answer 2

The only way to determine whether an insert or update is needed is the presence of the MD5 hash in the Hashes table. If it already exists, then the corresponding record also exists in the Media table. If not, then it is a new file and needs to be added to both. Other hashes are inconsistently added so they can't be used, and there aren't any unique values aside from the hashes.

The import table is a dump of all of all the data, and I'm just trying to normalize it into the other two tables. There are additional fields in the import table that I haven't included above, but they aren't unique to each record either.

Phil Parkin SSC Guru Points: 247224 More actions · Answer 3

So the link is from Hashes(MediaId) to Media(Id), is that correct?

If so, is this a one-to-one relationship?

If so, I don't see a lot of point in having it as a separate table ... can you explain your rationale for splitting it out?

Mark Harley SSCrazy Points: 2259 More actions · Answer 4

No, it will be one-to-many in most cases. For some of these files, only the MD5 has was generated, but for others the SHA1 and PhotoDNA hashes were also generated for example. The missing hashes may be added at some point in the future or the category (stored in the Media table) might change, which is why I need the update functionality instead of just insert. Does that make sense?

Phil Parkin SSC Guru Points: 247224 More actions · Answer 5

Mark Harley wrote:

No, it will be one-to-many in most cases. For some of these files, only the MD5 has was generated, but for others the SHA1 and PhotoDNA hashes were also generated for example. The missing hashes may be added at some point in the future or the category (stored in the Media table) might change, which is why I need the update functionality instead of just insert. Does that make sense?

Some sense, but not 100%! How can one row in the Hashes table link to many in the Media table? And conversely, how can one row in the Media table link to a many rows in the Hashes table. From the table structure provided, these relationships are not clear.

Mark Harley SSCrazy Points: 2259 More actions · Answer 6

There should only be one record for each file in the Media table, and there could be several hashes for each file. The complexity occurs when another import is conducted. This could introduce another record for the file, which could contain updated fields or new hashes. Ultimately, there should be one Media record to many Hashes records. Does that make things clearer?

Unfortunately I have little control over the shape and content of the Import table. Additional fields will be added from time to time, and new data will be provided regularly, but ultimately the relationship won’t change.