March 8, 2021 at 1:37 pm
Not looking for any code, just ideas.
I have a series of queries which run regularly, which return data rows which we consider to be potentially incorrect or worthy of being checked. These queries run against any database/table deemed particularly interesting/important.
Imagine a scenario where 75 rows of data are identified, across three different queries. 60 of these rows have been checked and found to be OK – not an issue. I'd like to exclude these rows from the results the next time the query runs, so that focus can be given to those rows which have not been assessed as OK.
I can think of no flexible, straightforward and easy-to-maintain way of excluding these results. Some ideas I've had:
If I had just one or two of these queries and they were fairly static, either of options one and two would work OK, with the first being easier to maintain. But it's not the case: queries can be added or removed fairly regularly (though the numbers are manageably low – once or twice a month) – option three would handle this.
I can't be the first person thinking about implementing this. Any ideas or suggestions, anyone?
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
March 8, 2021 at 3:09 pm
You could add a tinyint (or bit) indicator to the table row itself indicating whether that row has been reviewed already. (Maybe tinyint so that there are bits for other statuses, if you decide you need them.) Of course that requires modifying the tables themselves, but it would be more efficient.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
March 8, 2021 at 4:54 pm
Taking a significantly different angle on this and apologises if it's too far outside of the scope of your question but could you use Change Data Capture to identify changed rows? Then you could do something like this:
One time process per table:
Step 1: Select changes from CDC stored procedure
Step 2: Evaluate rows
Step 3: Store the highest lastlsn of returned rows
Repeated process per table:
Step 1: Select changes from CDC stored procedure since the lastlsn
Step 2: Evalulate rows
Step 3: Store the highest lastlsn of returned rows
Please note: I have only implemented this on a small test system. I have no idea how well it would scale in production.
March 9, 2021 at 10:13 am
Scott and as1981, thank you for taking the time to respond.
These responses at least suggest that I did not miss anything obvious.
Unless I have misunderstood, CDC does not achieve what I wish to achieve. If there is a row which has not been marked as OK, that row should continue being returned by the SELECT query. I cannot see how the proposed solution would accommodate this.
Scott's approach works well enough, though I do not wish to universally add an IsValid flag to our tables, nor do I want to have to make a DDL change to add the new column every time I wish to validate data on a table which has not previously been validated.
I think I will proceed with my suggested option 3.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
March 9, 2021 at 12:27 pm
Apologises I was thinking that the rows you need to check might all be new or have been updated. As you said it won't pick up existing, unchanged, rows.
March 9, 2021 at 1:11 pm
Apologises I was thinking that the rows you need to check might all be new or have been updated. As you said it won't pick up existing, unchanged, rows.
I thought so. No need to apologise. I thank you once again for taking the time to respond.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
March 9, 2021 at 1:43 pm
Also test whether doing an outer join to the approved table and looking for nulls might be faster than the NOT IN.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 9, 2021 at 2:55 pm
Also test whether doing an outer join to the approved table and looking for nulls might be faster than the NOT IN.
Thanks, Grant. To be honest, I'd probably have implemented it this way anyway.
My original NOT IN text perhaps made the question easier to understand than a LEFT JOIN ... IS NULL.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
March 9, 2021 at 6:47 pm
option 3 - but with addition of fields
and I would make it so that the PK can be removed and re-added - what if data was fixed a year ago and has gone bad again? fact it was reported and validated a year ago would cause it to be ignored now.
March 9, 2021 at 8:30 pm
Scott and as1981, thank you for taking the time to respond.
Scott's approach works well enough, though I do not wish to universally add an IsValid flag to our tables, nor do I want to have to make a DDL change to add the new column every time I wish to validate data on a table which has not previously been validated.
I think I will proceed with my suggested option 3.
Not objections to a combined validation table. I would point out, though, that if you were to go the is_valid flag route, I'd just add it to all tables immediately, so there would be no future maintenance. This is much easier to do if all tables are only accessed via a view, so that the underlying physical table can be very easily changed with no effect on existing table uses.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
March 10, 2021 at 3:59 pm
Just a thought, and probably not a very good one. But in lieu of status flags validating rows, or exclusion tables. Why not remove suspect/invalid rows to other tables with similar layouts? Perhaps with a few other columns such as date removed, a code for the reason it was considered suspect, and the number for the query, as you described in your initial post. If and when rows are corrected, they can be restored to the "live" table.
Although there is the workload of copying them back and forth, there is no need for code to exclude . Your "normal" code would run faster for not having to test against an exclusion table every time. I understand that this approach isn't going to be usable if rows are only considered suspect for some queries, but not others. But I've always been of the opinion that garbage data is garbage data.
Good luck, whichever way you go.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply