Data Warnings / Ignore Checked Results

Question

Data Warnings / Ignore Checked Results

Phil Parkin

SSC Guru

Points: 246960
More actions
March 8, 2021 at 1:37 pm

#3854933
Not looking for any code, just ideas.
I have a series of queries which run regularly, which return data rows which we consider to be potentially incorrect or worthy of being checked. These queries run against any database/table deemed particularly interesting/important.
Imagine a scenario where 75 rows of data are identified, across three different queries. 60 of these rows have been checked and found to be OK – not an issue. I'd like to exclude these rows from the results the next time the query runs, so that focus can be given to those rows which have not been assessed as OK.
I can think of no flexible, straightforward and easy-to-maintain way of excluding these results. Some ideas I've had:
1. Build an exclusion table for every query and populate it with the PKs of the rows to be excluded. Modify the original query and include AND PK NOT IN (SELECT PK FROM EXCLUSION TABLE)
2. Modify the query code to do the same as above
3. Same as the first option, but give every query a unique number and build a single exclusion table, then use AND PK NOT IN (SELECT PK FROM EXCLUSION TABLE WHERE QUERYNUM = n).
If I had just one or two of these queries and they were fairly static, either of options one and two would work OK, with the first being easier to maintain. But it's not the case: queries can be added or removed fairly regularly (though the numbers are manageably low – once or twice a month) – option three would handle this.
I can't be the first person thinking about implementing this. Any ideas or suggestions, anyone?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 101095 More actions · Answer 1

You could add a tinyint (or bit) indicator to the table row itself indicating whether that row has been reviewed already. (Maybe tinyint so that there are bits for other statuses, if you decide you need them.) Of course that requires modifying the tables themselves, but it would be more efficient.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

as_1234 SSCrazy Points: 2859 More actions · Answer 2

Taking a significantly different angle on this and apologises if it's too far outside of the scope of your question but could you use Change Data Capture to identify changed rows? Then you could do something like this:

One time process per table:

Step 1: Select changes from CDC stored procedure

Step 2: Evaluate rows

Step 3: Store the highest lastlsn of returned rows

Repeated process per table:

Step 1: Select changes from CDC stored procedure since the lastlsn

Step 2: Evalulate rows

Step 3: Store the highest lastlsn of returned rows

Please note: I have only implemented this on a small test system. I have no idea how well it would scale in production.

Phil Parkin SSC Guru Points: 246960 More actions · Answer 3

Scott and as1981, thank you for taking the time to respond.

These responses at least suggest that I did not miss anything obvious.

Unless I have misunderstood, CDC does not achieve what I wish to achieve. If there is a row which has not been marked as OK, that row should continue being returned by the SELECT query. I cannot see how the proposed solution would accommodate this.

Scott's approach works well enough, though I do not wish to universally add an IsValid flag to our tables, nor do I want to have to make a DDL change to add the new column every time I wish to validate data on a table which has not previously been validated.

I think I will proceed with my suggested option 3.

This reply was modified 3 years, 8 months ago by Phil Parkin.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

as_1234 SSCrazy Points: 2859 More actions · Answer 4

Apologises I was thinking that the rows you need to check might all be new or have been updated. As you said it won't pick up existing, unchanged, rows.

Phil Parkin SSC Guru Points: 246960 More actions · Answer 5

as_1234 wrote:

Apologises I was thinking that the rows you need to check might all be new or have been updated. As you said it won't pick up existing, unchanged, rows.

I thought so. No need to apologise. I thank you once again for taking the time to respond.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 6

Also test whether doing an outer join to the approved table and looking for nulls might be faster than the NOT IN.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Phil Parkin SSC Guru Points: 246960 More actions · Answer 7

Grant Fritchey wrote:

Also test whether doing an outer join to the approved table and looking for nulls might be faster than the NOT IN.

Thanks, Grant. To be honest, I'd probably have implemented it this way anyway.

My original NOT IN text perhaps made the question easier to understand than a LEFT JOIN ... IS NULL.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

frederico_fonseca SSCoach Points: 16154 More actions · Answer 8

option 3 - but with addition of fields

time created -- used to determine how long it has been identified as an issue
validated by
validated date
validation status (ok/not ok/don't care/don't like the boss)

and I would make it so that the PK can be removed and re-added - what if data was fixed a year ago and has gone bad again? fact it was reported and validated a year ago would cause it to be ignored now.

ScottPletcher SSC Guru Points: 101095 More actions · Answer 9

Phil Parkin wrote:

Scott and as1981, thank you for taking the time to respond.
Scott's approach works well enough, though I do not wish to universally add an IsValid flag to our tables, nor do I want to have to make a DDL change to add the new column every time I wish to validate data on a table which has not previously been validated.
I think I will proceed with my suggested option 3.

Not objections to a combined validation table. I would point out, though, that if you were to go the is_valid flag route, I'd just add it to all tables immediately, so there would be no future maintenance. This is much easier to do if all tables are only accessed via a view, so that the underlying physical table can be very easily changed with no effect on existing table uses.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 10

Just a thought, and probably not a very good one. But in lieu of status flags validating rows, or exclusion tables. Why not remove suspect/invalid rows to other tables with similar layouts? Perhaps with a few other columns such as date removed, a code for the reason it was considered suspect, and the number for the query, as you described in your initial post. If and when rows are corrected, they can be restored to the "live" table.

Although there is the workload of copying them back and forth, there is no need for code to exclude . Your "normal" code would run faster for not having to test against an exclusion table every time. I understand that this approach isn't going to be usable if rows are only considered suspect for some queries, but not others. But I've always been of the opinion that garbage data is garbage data.

Good luck, whichever way you go.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills