Duplicates and ROW_NUMBER ()

  • Dear All,

    We use ROW_NUMBER () to partition results and hand-pick the needed records, mostly to address duplicate records scenario.

    However, how do we sense that there is a potential threat for duplicates to emrge? The answer could be "Well, you should know your data functionality". In reality, from a

    Developer standpoint, often we work on subject areas that we least know, nor is there sufficient documentation on data held in tables and how they are functionally related.

    In such cases, how do we technically (if not with Business Domain proficiency) identify potential pitfalls related to Duplicates? and hence resort to ROW_NUMBER () or any other method?

    thank you

  • etl2016 (6/9/2016)


    Dear All,

    We use ROW_NUMBER () to partition results and hand-pick the needed records, mostly to address duplicate records scenario.

    However, how do we sense that there is a potential threat for duplicates to emrge? The answer could be "Well, you should know your data functionality". In reality, from a

    Developer standpoint, often we work on subject areas that we least know, nor is there sufficient documentation on data held in tables and how they are functionally related.

    In such cases, how do we technically (if not with Business Domain proficiency) identify potential pitfalls related to Duplicates? and hence resort to ROW_NUMBER () or any other method?

    thank you

    Your question seems a little broad in scope and is therefore difficult to answer.

    Are you asking: "What problems are duplicates in my data likely to cause?" – this depends entirely on your environment/application.

    Or maybe: "How can I identify which applications are inserting duplicates?"

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • The subject you're asking about is called "data hygiene". It's about things like duplicates, unexpected NULL values, out-of-expected-range values, and a whole lot of other similar issues.

    It takes a very strong understanding of the business needs regarding the data. It's not something that can be done without a high level of domain knowledge.

    Without that, the only option is wait till something goes wrong, then fix it.

    With the domain knowledge, you can move into prevention. Some of that (a lot) goes into data architecture. Some into physical database design and implementation (which columns to have Not Null on, that kind of thing). And so on.

    But prevention takes detailed knowledge of the data, what it means, what it's for, who uses it and what impacts it has on them, and so on.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • etl2016 (6/9/2016)


    Dear All,

    We use ROW_NUMBER () to partition results and hand-pick the needed records, mostly to address duplicate records scenario.

    However, how do we sense that there is a potential threat for duplicates to emrge? The answer could be "Well, you should know your data functionality". In reality, from a

    Developer standpoint, often we work on subject areas that we least know, nor is there sufficient documentation on data held in tables and how they are functionally related.

    In such cases, how do we technically (if not with Business Domain proficiency) identify potential pitfalls related to Duplicates? and hence resort to ROW_NUMBER () or any other method?

    thank you

    I disagree with your statement "often we work on subject areas that we least know". I think that you are confusing knowledge with expertise. You don't need to be an expert in the subject area, but you should have a passing knowledge of your subject area.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • I will say that ROW_NUMBER() is a dangerous tool to use when trying to deduplicate data. It's dangerous in that unless the ORDER BY you provide can already provide a perfectly unambiguous sort , the ROW_NUMBER() value won't guarantee to return the same value for the same row in multiple query runs. If you HAVE to use something arbitrary like that - recommendation would be to at least use something consistent (e.g. an identity column) so that the same ordering will occur each time.

    We've had a few ugly scenarios like that where some od the devs used ROW_NUMBER() within the paging schemes, only to find that they don't get the detail for the record they clicked on.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (6/9/2016)


    I will say that ROW_NUMBER() is a dangerous tool to use when trying to deduplicate data. It's dangerous in that unless the ORDER BY you provide can already provide a perfectly unambiguous sort , the ROW_NUMBER() value won't guarantee to return the same value for the same row in multiple query runs. If you HAVE to use something arbitrary like that - recommendation would be to at least use something consistent (e.g. an identity column) so that the same ordering will occur each time.

    We've had a few ugly scenarios like that where some od the devs used ROW_NUMBER() within the paging schemes, only to find that they don't get the detail for the record they clicked on.

    Must say that I find your statement slightly odd Matt, the issue you are mentioning here has nothing to do with the functionality or the integrity of the ROW_NUMBER function but all to do with the developers' ignorance on how it works. So far I have never seen any collisions or discrepancies in the function but I've seen lots of naive and wrong applications of it. The fact that the partition grouping does not produce any collisions makes it one of the best tools in the box for de-duplication.

    😎

  • Eirikur Eiriksson (6/9/2016)


    Matt Miller (#4) (6/9/2016)


    I will say that ROW_NUMBER() is a dangerous tool to use when trying to deduplicate data. It's dangerous in that unless the ORDER BY you provide can already provide a perfectly unambiguous sort , the ROW_NUMBER() value won't guarantee to return the same value for the same row in multiple query runs. If you HAVE to use something arbitrary like that - recommendation would be to at least use something consistent (e.g. an identity column) so that the same ordering will occur each time.

    We've had a few ugly scenarios like that where some od the devs used ROW_NUMBER() within the paging schemes, only to find that they don't get the detail for the record they clicked on.

    Must say that I find your statement slightly odd Matt, the issue you are mentioning here has nothing to do with the functionality or the integrity of the ROW_NUMBER function but all to do with the developers' ignorance on how it works. So far I have never seen any collisions or discrepancies in the function but I've seen lots of naive and wrong applications of it. The fact that the partition grouping does not produce any collisions makes it one of the best tools in the box for de-duplication.

    😎

    True, the fault isn't with the tool but the wielder: it's such an easy kludge it does engender a bit of overconfidence, but the flaw still is in not looking past the immediate gratification :).

    That said I still stand by my recommendation to use an identity over row_number() in particular since I cannot seem to get folks to check their math before using ROW_NUMBER().

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply