January 7, 2011 at 10:08 am
Requirement : Storing a XML piece of data in hashed format in a column. This will be a PK column.
Data : The XML data might have fixed format of tags. But the values in tags will differ.
This probably changes the length of Hash generated.
At first glance, it appears obvious that choice could be VARBINARY which have flexibility of variable lengths.
But looking into fact that this column will be a PK column & data searches will happen on WHERE for this column.
So thinking of trade-offs as to should I go for ...
VARBINARY though PK and a Search column
OR
BINARY though size may not be exact but between certain variation??
Any ideas? Please let me know.
January 7, 2011 at 11:14 am
I'm a bit confused...the PK will be the hashed binary/varbinary column? And you need to search the column? That doesn't sound right at all.
I don't think you can search for any value that exists inside a hash. Isn't it true you cannot "unhash" a value, just compare a hash of a known value now to a hash that is stored, looking for an exact match? and not "part" of a hash?
You could use synchronous encryption, so you can extract out the values back again, but you'd have to unencrypt every column just to run a search.
Am I reading your plan wrong?
Lowell
January 7, 2011 at 11:31 am
Hi Lowell
You are correct in reading the details. May be my details were not complete
Yes it is surprising but still a requirement.
The field comes in as XML... Needs to hashed & then stored.
So thought of Hashing and storing as BINARY
While search, the search parameter comes in again as XML.
So to search, the plan is to hash search parameter value and then compare.
I have tested and this works perfect.
There is no requirement of un-hash and return the value this specific hashed column.
Hashed column acts as a key for returning a record with other columns in the record
Hashing will be done using any of the available alogorithm such as MD2, MD4, MD5, SHA, or SHA1
Now the main question remains is that incoming XML may vary, but not too much. So should it be stored as BINARY with certain size or VARBINARY
I understand in general, if we know in advance that size is fixed then suggested way would be to go for BINARY instead of VAR
But scenario is that the XML hash size may not turn out be exact fixed for all records...
It may vary to certain extent though may not be to a very large extent
Thanks
January 7, 2011 at 11:35 am
I'm with Lowell on this one, using a binary for a PKey raised flags for me immediately, I cannot think of a case where I would ever do this.
As far as varbinary vs. binary, if the value is variable in length then varbinary makes sense, however, again, I would be *extremely* reluctant to consider using a binary value let alone a variable length column for a PKey.
Hashes are usually (always?) one-way functions, you give it a value, it returns a hash, and a hash of two very similar values may or may NOT be even remotely similar. So I'm not sure what you hope to get out of a partial hash, I can almost guarantee it doesn't work how you think it will. By and large, as Lowell said, you can compare a known hash against an unknown has and see if they match, if they do you can generally assume they are generated from the same value.
Please give additional detail if we have missed something.
CEWII
January 7, 2011 at 6:20 pm
Hi Elliott, Lowell
Here is the psuedo code that I tried to come up with.
Question is What approproriate size of "InfoHash" BINARY should be?
I tried sampling and looks like BINARY(20) is the the one that works
Even if assuming XML of field InfoXML is 10 times as big than what below, the hash of it still fits in BINARY(20).
I still need some of your expert comments/suggestions for BINARY(20) about the idea of going with SIZE as 20.
Also with following psuedo sample details, your comments/suggestions on having InfoHash as PK
Or any other better way with this requirement.
----Create table
CREATE TABLE dbo.PersonsInfo(InfoHash BINARY(20), InfoXML XML, ColXML1 XML, ColXML2 XML, DateCreated DATETIME)
GO
--Create a PK (Not worried for now if it would be CLUSTERED or NONCLUSTERED)
ALTER TABLE [dbo].[PersonsInfo]
ADD CONSTRAINT [PK_PersonsInfo_InfoHash_DateCreated]
PRIMARY KEY CLUSTERED (InfoHash, DateCreated)
----Sample record insert
DECLARE @InfoXML XML
SET @InfoXML = '<?xml version="1.0" encoding="utf-8"?>
<Person xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:SchemaLocation="http://cdn.mysite.com/schemas/PersonService/1/Person.xsd">
<FName>Tom</FName>
<MName>Dick</MName>
<LName>Harry</LName>
<DOB>01/01/1965</DOB>
<SSN>123456789</SSN>
</Person>'
----Insert Data
INSERT INTO PersonsInfo(InfoHash, InfoXML, ColXML1, ColXML2, DateCreated)
SELECT HASHBYTES('SHA1', CONVERT(VARCHAR(MAX), @InfoXML)) AS 'InfoHash'
, @InfoXML AS 'InfoXML'
, '<test>hello some XML 1</test>' AS 'ColXML1'
, '<test>hello some XML 2</test>' AS 'ColXML2'
, GETDATE() AS 'DateCreated'
----Get data
SELECT * FROM PersonsInfo (NOLOCK)
WHERE InfoHash = HASHBYTES('SHA1', CONVERT(VARCHAR(MAX), @InfoXML))
January 7, 2011 at 6:32 pm
So you are planning to use the hash of an xml fragment as a part of a Primary Key.
What happens when the xml differs only in whitespace or layout ?
The xml is still equivalent, but the hash will change and your lookup will fail.
Is there not a key within the XML fragment that can be used instead of a hash?
Edit: Also - are you not concerned that a hash is a really bad primary key? it's not ordered in any way?
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
January 7, 2011 at 6:51 pm
I am concerned with hash as a bad primary key. But no choice.
Calling program have limitaions to provide person's info as XML to store, and then send the person's XML to search for any existing record.
Maintaining the consistency of XML that was passed while insert and sending the same XML for search is responsibility of calling application/program.
Plus by using a algorithm while hashing, and then using the same algorithm while serach/compare... Do you think still there can be issues?
Also if any other better solution if you can suggest would be appreciated.
Thanks
Deepak
January 8, 2011 at 1:54 am
My advice would be to extract the person's data from the xml and store that in columns in the database.
Then do exactly the same for a query - extract the data from the xml for the primary key and then lookup in your database table.
Hashing the xml and using that is just horrible IMHO and will almost certainly let you down.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply