BINARY vs VARBINARY ---- How to decide for a given scenario

  • Requirement : Storing a XML piece of data in hashed format in a column. This will be a PK column.

    Data : The XML data might have fixed format of tags. But the values in tags will differ.

    This probably changes the length of Hash generated.

    At first glance, it appears obvious that choice could be VARBINARY which have flexibility of variable lengths.

    But looking into fact that this column will be a PK column & data searches will happen on WHERE for this column.

    So thinking of trade-offs as to should I go for ...

    VARBINARY though PK and a Search column

    OR

    BINARY though size may not be exact but between certain variation??

    Any ideas? Please let me know.

  • I'm a bit confused...the PK will be the hashed binary/varbinary column? And you need to search the column? That doesn't sound right at all.

    I don't think you can search for any value that exists inside a hash. Isn't it true you cannot "unhash" a value, just compare a hash of a known value now to a hash that is stored, looking for an exact match? and not "part" of a hash?

    You could use synchronous encryption, so you can extract out the values back again, but you'd have to unencrypt every column just to run a search.

    Am I reading your plan wrong?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hi Lowell

    You are correct in reading the details. May be my details were not complete

    Yes it is surprising but still a requirement.

    The field comes in as XML... Needs to hashed & then stored.

    So thought of Hashing and storing as BINARY

    While search, the search parameter comes in again as XML.

    So to search, the plan is to hash search parameter value and then compare.

    I have tested and this works perfect.

    There is no requirement of un-hash and return the value this specific hashed column.

    Hashed column acts as a key for returning a record with other columns in the record

    Hashing will be done using any of the available alogorithm such as MD2, MD4, MD5, SHA, or SHA1

    Now the main question remains is that incoming XML may vary, but not too much. So should it be stored as BINARY with certain size or VARBINARY

    I understand in general, if we know in advance that size is fixed then suggested way would be to go for BINARY instead of VAR

    But scenario is that the XML hash size may not turn out be exact fixed for all records...

    It may vary to certain extent though may not be to a very large extent

    Thanks

  • I'm with Lowell on this one, using a binary for a PKey raised flags for me immediately, I cannot think of a case where I would ever do this.

    As far as varbinary vs. binary, if the value is variable in length then varbinary makes sense, however, again, I would be *extremely* reluctant to consider using a binary value let alone a variable length column for a PKey.

    Hashes are usually (always?) one-way functions, you give it a value, it returns a hash, and a hash of two very similar values may or may NOT be even remotely similar. So I'm not sure what you hope to get out of a partial hash, I can almost guarantee it doesn't work how you think it will. By and large, as Lowell said, you can compare a known hash against an unknown has and see if they match, if they do you can generally assume they are generated from the same value.

    Please give additional detail if we have missed something.

    CEWII

  • Hi Elliott, Lowell

    Here is the psuedo code that I tried to come up with.

    Question is What approproriate size of "InfoHash" BINARY should be?

    I tried sampling and looks like BINARY(20) is the the one that works

    Even if assuming XML of field InfoXML is 10 times as big than what below, the hash of it still fits in BINARY(20).

    I still need some of your expert comments/suggestions for BINARY(20) about the idea of going with SIZE as 20.

    Also with following psuedo sample details, your comments/suggestions on having InfoHash as PK

    Or any other better way with this requirement.

    ----Create table

    CREATE TABLE dbo.PersonsInfo(InfoHash BINARY(20), InfoXML XML, ColXML1 XML, ColXML2 XML, DateCreated DATETIME)

    GO

    --Create a PK (Not worried for now if it would be CLUSTERED or NONCLUSTERED)

    ALTER TABLE [dbo].[PersonsInfo]

    ADD CONSTRAINT [PK_PersonsInfo_InfoHash_DateCreated]

    PRIMARY KEY CLUSTERED (InfoHash, DateCreated)

    ----Sample record insert

    DECLARE @InfoXML XML

    SET @InfoXML = '<?xml version="1.0" encoding="utf-8"?>

    <Person xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:SchemaLocation="http://cdn.mysite.com/schemas/PersonService/1/Person.xsd">

    <FName>Tom</FName>

    <MName>Dick</MName>

    <LName>Harry</LName>

    <DOB>01/01/1965</DOB>

    <SSN>123456789</SSN>

    </Person>'

    ----Insert Data

    INSERT INTO PersonsInfo(InfoHash, InfoXML, ColXML1, ColXML2, DateCreated)

    SELECT HASHBYTES('SHA1', CONVERT(VARCHAR(MAX), @InfoXML)) AS 'InfoHash'

    , @InfoXML AS 'InfoXML'

    , '<test>hello some XML 1</test>' AS 'ColXML1'

    , '<test>hello some XML 2</test>' AS 'ColXML2'

    , GETDATE() AS 'DateCreated'

    ----Get data

    SELECT * FROM PersonsInfo (NOLOCK)

    WHERE InfoHash = HASHBYTES('SHA1', CONVERT(VARCHAR(MAX), @InfoXML))

  • So you are planning to use the hash of an xml fragment as a part of a Primary Key.

    What happens when the xml differs only in whitespace or layout ?

    The xml is still equivalent, but the hash will change and your lookup will fail.

    Is there not a key within the XML fragment that can be used instead of a hash?

    Edit: Also - are you not concerned that a hash is a really bad primary key? it's not ordered in any way?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • I am concerned with hash as a bad primary key. But no choice.

    Calling program have limitaions to provide person's info as XML to store, and then send the person's XML to search for any existing record.

    Maintaining the consistency of XML that was passed while insert and sending the same XML for search is responsibility of calling application/program.

    Plus by using a algorithm while hashing, and then using the same algorithm while serach/compare... Do you think still there can be issues?

    Also if any other better solution if you can suggest would be appreciated.

    Thanks

    Deepak

  • My advice would be to extract the person's data from the xml and store that in columns in the database.

    Then do exactly the same for a query - extract the data from the xml for the primary key and then lookup in your database table.

    Hashing the xml and using that is just horrible IMHO and will almost certainly let you down.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 8 posts - 1 through 7 (of 7 total)

    You must be logged in to reply to this topic. Login to reply