Need to find out what combination of columns makes each record in table unique

  • I am working with a table that I have not seen before, and do not have a clear idea on what it is doing or what the data in it really means. There are 5+ million rows in this table.  and 550 columns. However I need to create a few rather complex queries against that table.

    There is no PK, no Clustered or Unique index on the table.  There are a bunch of non-unique indexes only.

    I need to come up with a T-SQL (or may be there is one somewhere already?.. I searched and did not find ..)

    that will show me what combination of columns marks a record in this table unique.  I know for the fact there is no one single column that has unique value per row but there are definitely several columns that are unique per row together. I just don't know which ones. Has anybody done anything like that before?

    Likes to play Chess

  • There could (and probably will be) be multiple combinations of columns which are unique, based on existing data. That does not tell you which of the combinations, if any, you should choose for a clustered index. You need to find a way of getting to grips with the data enough to understand roughly what is going on, and that will help you choose the right combination.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • If the number of candidate columns is small you could exhaustively test the possible combinations using a cursor and dynamic sql.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • VoldemarG wrote:

    I am working with a table that I have not seen before, and do not have a clear idea on what it is doing or what the data in it really means. There are 5+ million rows in this table.  and 550 columns. However I need to create a few rather complex queries against that table.

    I would think you have to have at least some idea of what data is stored in that table - else you cannot even begin to write any queries, let alone *rather complex queries* against that table.

    With that in mind - I would use SQL Search by Redgate to search that database (and any other related databases) for reference to that table.  This table sounds like it is built from a combination of other tables and I suspect that there is some code that is used to build that table.  If that isn't the case - then I would hope there is other code that references that table, and examining that code should help in identifying how the table has been previously utilized.

    If I still couldn't find out - then I would start asking associates and/or the application owner - see if they have an idea on what the data represents and what could possibly be utilized to uniquely identify the rows.  Once you have an idea on what actually sources this data, then you can start looking at individual columns to determine which ones might possibly be candidate keys.

    Finally - look at the non-clustered indexes...there might just be one (or more) that were not created as unique, but are actually meant to be the PK or unique.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • May be Identifying Relationship Between Columns In SQL can be of use?

  • Have you asked anyone???

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • VoldemarG wrote:

    I am working with a table that I have not seen before, and do not have a clear idea on what it is doing or what the data in it really means. There are 5+ million rows in this table.  and 550 columns. However I need to create a few rather complex queries against that table.

    There is no PK, no Clustered or Unique index on the table.  There are a bunch of non-unique indexes only.

    I need to come up with a T-SQL (or may be there is one somewhere already?.. I searched and did not find ..)

    that will show me what combination of columns marks a record in this table unique.  I know for the fact there is no one single column that has unique value per row but there are definitely several columns that are unique per row together. I just don't know which ones. Has anybody done anything like that before?

    Try exploring statistics.

  • Best way I can think of is to do a little data exploration so you better understand the table (500+ columns... <sigh>), then, on a TEST system, try creating a unique constraint (don't do a clustered index yet, just nonclustered for testing, avoids data movement, makes dropping the index easy). Use the combination of columns that makes the most sense based on the structure. That's how I'd do it.

    Of course, then, I'd get to work normalizing the data structure and/or fixing the star schema.

    500 columns...

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • It's difficult for me to believe that such a table exists and no one knows anything about it.  Find who does and ask them.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    It's difficult for me to believe that such a table exists and no one knows anything about it.  Find who does and ask them.

    Very good point, as usual.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • With 550 columns it's not possible to test every combination of 2 or more columns to see if they are unique. It's a combinatorial problem with a search space of 3.6855 × 10^(165). Or to be more precise:

    3685510180489786476798393145496356338786055879312930105836138965083617346086082863365358130056307390177215209990980317284932211552660930305235775636164742230126362073

    I think Jeff has the right suggestion of finding someone who might know.

    Or maybe you could add an identity column and make it the primary key?

  • Jonathan AC Roberts wrote:

    With 550 columns it's not possible to test every combination of 2 or more columns to see if they are unique. It's a combinatorial problem with a search space of 3.6855 × 10^(165). Or to be more precise:

    3685510180489786476798393145496356338786055879312930105836138965083617346086082863365358130056307390177215209990980317284932211552660930305235775636164742230126362073

    I think Jeff has the right suggestion of finding someone who might know.

    That's a great calculator you've got there ... must be the size of a TV 🙂

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Jonathan AC Roberts wrote:

    With 550 columns it's not possible to test every combination of 2 or more columns to see if they are unique. It's a combinatorial problem with a search space of 3.6855 × 10^(165). Or to be more precise:

    3685510180489786476798393145496356338786055879312930105836138965083617346086082863365358130056307390177215209990980317284932211552660930305235775636164742230126362073

    For sure having some insight, any insight, into the meaning of the columns is essential.  Anyway, if the OP could boil it down to 10 candidate columns it's only 1,023 combinations (including single columns, excluding those it's 1,013).  Maybe that's doable.  If it's 12 columns that's 4,095 combinations which could take a while.

     

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Thanks everyone for your input.

    I really appreciate it.

    Will take all of it into account.  Happy Thanksgiving!

    Likes to play Chess

  • Thanks! Very helpful suggestion.

    Likes to play Chess

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply