May 30, 2012 at 5:35 am
Hi All,
I have a bit of a puzzle and my brain decided it didn't want to come to work today. Please could you assist with a possible solution.
I need to obfuscate the business function and objects, but I am able to simplify with an example.
I have the following table
CREATE TABLE Account
(ID int IDENTITY(1, 1) NOT NULL
,AccountID int NOT NULL
,AccountVersion smallint NOT NULL
,IsActive bit
,CONSTRAINT PK_Account PRIMARY KEY CLUSTERED
(ID, AccountID, AccountVersion));
and lets add one account as a sample
INSERT INTO Account (AccountID, AccountVersion, IsActive)
VALUES
(2022, 0, 0)
,(2022, 1, 1)
,(2022, 2, 1)
,(2022, 3, 0)
,(2022, 4, 0)
,(2022, 5, 1)
,(2022, 6, 0)
,(2022, 7, 1);
OK, so lets say an account can exist with a status of active or inactive. Every time this status changes a new version is added. A new version may also be added under other conditions that does not affect the status. I need to summarise these status changes in one query. I refuse to use a CURSOR for this.
My expected output is as follows:
AccountID,StartVersion,EndVersion
2022,1,3
2022,5,6
2022,7,7
I hope this makes sense.
May 30, 2012 at 5:51 am
This assumes AccountVersion is contiguous
WITH CTE AS (
SELECT AccountID, AccountVersion, IsActive,
ROW_NUMBER() OVER(PARTITION BY AccountID ORDER BY AccountVersion DESC) AS rn1,
ROW_NUMBER() OVER(PARTITION BY AccountID,IsActive ORDER BY AccountVersion DESC) AS rn2
FROM Account)
SELECT AccountID,
MIN(AccountVersion) AS StartVersion,
MAX(AccountVersion) + CASE WHEN MIN(rn1)>1 THEN 1 ELSE 0 END AS EndVersion
FROM CTE
WHERE IsActive=1
GROUP BY AccountID,rn1-rn2
ORDER BY MIN(rn1) DESC;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537May 31, 2012 at 2:38 am
May 31, 2012 at 5:06 am
Artoo22 (5/31/2012)
Thanks Mark, that works nicely.
Just checking... do you know HOW it works?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2012 at 3:28 pm
Jeff Moden (5/31/2012)
Artoo22 (5/31/2012)
Thanks Mark, that works nicely.Just checking... do you know HOW it works?
Well, I have learned that there aren't many people, if anyone at all, who know it all when it comes to SQL Server. If you want to succeed in a SQL Server career the key is to understand exactly what you are doing, thus I'm not a copy and paste developer.
To start with I ran the the given solution with a simple SELECT * FROM CTE to visualise the row numbering. I saw rn1 was a row number for each version starting with the most recent version. I also saw rn2 was a row number for each state.
The next part was more difficult to understand. I ran SELECT *, rn1-rn2 FROM CTE to understand the grouping. By studying the results I picked up that by grouping on rn1-rn2 would represent a change of state.
Now understanding the group by I then added WHERE IsActive = 1 to visualise the final data set we're working with, which is obviously because we need to show the start and end of active states.
Finally the start version of a state is the min version of that group and the last version is the max, but as we wanted to return the next version after a state is finished we added 1 (hence the warning about contiguous version numbers) except for the last group which does not have a next version.
So yeah, I spent time making sure I know how it works before implementing it.
June 1, 2012 at 8:10 am
Good job. With responses like this these forums become more useful. Not only is there the code posted to solve an issue, but an explanation of the approach used to understand how the code worked, makes this post more useful. thanks
Francis
June 1, 2012 at 2:59 pm
Artoo22 (5/31/2012)
Jeff Moden (5/31/2012)
Artoo22 (5/31/2012)
Thanks Mark, that works nicely.Just checking... do you know HOW it works?
Well, I have learned that there aren't many people, if anyone at all, who know it all when it comes to SQL Server. If you want to succeed in a SQL Server career the key is to understand exactly what you are doing, thus I'm not a copy and paste developer.
To start with I ran the the given solution with a simple SELECT * FROM CTE to visualise the row numbering. I saw rn1 was a row number for each version starting with the most recent version. I also saw rn2 was a row number for each state.
The next part was more difficult to understand. I ran SELECT *, rn1-rn2 FROM CTE to understand the grouping. By studying the results I picked up that by grouping on rn1-rn2 would represent a change of state.
Now understanding the group by I then added WHERE IsActive = 1 to visualise the final data set we're working with, which is obviously because we need to show the start and end of active states.
Finally the start version of a state is the min version of that group and the last version is the max, but as we wanted to return the next version after a state is finished we added 1 (hence the warning about contiguous version numbers) except for the last group which does not have a next version.
So yeah, I spent time making sure I know how it works before implementing it.
VERY cool. You really need to give yourself a good pat on the back because way too many people will take this simply as a CPR (Cut, Paste, Replace) example without ever taking the time to rip it appart, play with it, and understand it. I don't know who you work for but, with the attitude you just showed, they should really be happy to have you.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 2, 2012 at 12:11 am
June 3, 2012 at 11:21 am
Gosh, I don't believe so, Joe. A column constrained to a 1 or a 0 is still nothing more than a mostly useless "bit" of information whether it's a BIT or a SMALLINT data-type especially in the presence of a Start and End Date column. Having just two values, the column is nearly useless as the leading column of an index and offers information that is already available in the date columns.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 3, 2012 at 11:45 am
@Artoo22,
With consideration for the IsActive column and to continue the thought that I started on that subject with Joe... Rather than a "bit" column (I don't care what data type it is, it still only has two values which will cause index splits galore on INSERTS possibly causing timeouts in the app), I'll recommend a StartDate and an EndDate column to replace the IsActive column. I also recommend that the EndDate column be a NOT NULL column with a default value of '9999-12-30' (there are computational advantages to NOT using '9999-12-31' that I won't get into here).
That will allow several remarkable things to happen in the areas of performance and future reporting requirements.
First, if it's possible to reactivate an account, it will allow for SCD Type 6 rows to occur which is also great for auditing and historical purposes. Please see the following URL on SCD Type 6 (SCD = Slowly Changing Dimension).
Second, since you can put a dandy index on StartDate and EndDate (especially since EndDate IS NOT NULL), finding active accounts is highly "seekable" using something like the following to find the currently active accounts.
WHERE StartDate <= GETDATE() AND EndDate > GETDATE()
The reason for using the large EndDate instead of a null is two fold. It eliminates the NULL possibly making the index more effective and it keeps you from having to write code like...
WHERE EndDate IS NULL
or
WHERE EndDate IS NOT NULL
While that looks a lot more convenient, I believe it's less effective when trying to use an index.
Third, inserts for new accounts (IsActive = 1) and deactivations of an account (IsActive = 0) will be done by date rather than on a two value column which will greatly decrease extent splits in the non-clustered index or page splits in a clustered index especially if you have a decent FILLFACTOR other than 100 or 0. This not only reduces index maintenance but it also helps prevent timeouts while the app waits for the splits to occur on a busy system.
Last but not least, an InActive column will certainly tell you whether an account is active or inactive but it will not allow you to create reports that identify (for example) how many new accounts were gained in a particular month nor how many were lost in a particular month.
Just my 2 cents on what I've been able to use very effectively instead of an IsActive column.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 3, 2012 at 3:58 pm
Not quite. The T-SQL BIT is numeric and therefore NULL-able. It also cannot be extended. How many times have you seen a sex_code go from BIT to the ISO standard when we needed "legal person" and "unknown" in a table?
So what if BIT is NULL-able... so is a SMALLINT. You just need to add a NOT NULL constraint to it just like you did for SMALLINT.
So far as BIT columns going from a Yes/No to an ISO standard, that's why you and I are opposed to a BIT column or any two state column for that matter. That's why I'm opposed to having a column called "IsActive". By definition, such a column can only have a max of 3 states (Yes, No, Unknown) which, in the absence of other essential modeling features, has no temporal inference other than "now". If such features existed, I'd also consider such an IsActive column to be a duplication of data because the presence, absence, or value of a date column is certainly capable of answering the "IsActive?" question in a truly temporal manner.
The thing with any status is that the meta-data definition is that it is a "STATe of being" and therefore has a temporal dimension. The information principle requires that dimension be modeled.
Agreed for most status columns but not for 2 or 3 state columns like "IsActive". If the temporal dimension of such a condition is to be modeled (and I agree that it should be), the content of the temporal dimension will correctly indicate if something is active or not and having a separate "IsActive" flag (I thought you hated flags :w00t: ) is a simple duplication of data which is a violation of 1NF, is it not?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 3, 2012 at 5:55 pm
CELKO (6/3/2012)recorc
record
CELKO (6/3/2012)stratus
status
CELKO (6/3/2012)
>> I have the following table <<Even for a skeleton, this was bad code; we do not use bit flags, IDENTITY as a fake mag tape recorc number and other non-RDBMS constructs and names that violate ISO-11179 rules. The stratus of an account is a temporal event, so it is modeled with a (start_time end_time) half-one intervals. This is simple, basics, not fancy stuff.
Your real problem is bad DDL.
Joe, with all due respect, this is not my architecture. I would love to meet the architect and I pray there is no weapon at hand when I do. I work for a client as a database consultant. A third party has written the data access layer while a fourth party has written the front end. There is no way for me to change the design (or lack thereof) of the database.
CELKO (6/3/2012)>> OK, so lets say an account can exist with a status of active or inactive. Every time this status changes a new version is added. A new version may also be added under other conditions that does not affect the status. <<
No let's try a relational design instead of a βmock punch cardβ design. In fact, your mindset is so much into that design approach that you put commas in the front of each line(card) just like we did in the 1960's. Try this skeleton:
CREATE TABLE Accounts
(acct_nbr char(10) NOT NULL,
PRIMARY KEY (acct_nbr, start_date)
acct_status SMALLINT NOT NULL
CHECK (acct_status IN (0, 1))
start_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,
end_date DATE,
CHECK (start_date <= end_date)));
We can add more constraints to prevents gaps, etc.
Not my mindset but someones design that I have been forced to work with. It is painful.
CELKO (6/3/2012)>> I need to summaries these status changes in one query. I refuse to use a CURSOR for this. <<
Good, you know that cursors are the work of the devil π Thanks to your bad schema, you will have to built the table you should have designed with DML β row numbers and self-joins and over high overhead kludges. It is simple with the right DDKL.
SELECT acct_nbr, start_date, end_date
FROM Accounts
WHERE acct_status = 1;
You are assuming the bad schema is mine and you are also assuming I have the power to change such a schema. You know what they say about assumptions.
CELKO (6/3/2012)
>> I hope this makes sense. <<
Yes, but not in a good way. You need a book on data modeling and basic RDBMS
I have a good book titled "Joe Celko's Data and Databases: Concepts in Practice". Can you recommend any others?
June 3, 2012 at 6:06 pm
@jeff, as you can see from my reply to Joe the architecure is not mine, set in concrete and causes me great anguish and pain. Your reply is, however, valuable and worthy of a read.
Jeff Moden (6/3/2012)
@Artoo22,With consideration for the IsActive column and to continue the thought that I started on that subject with Joe... Rather than a "bit" column (I don't care what data type it is, it still only has two values which will cause index splits galore on INSERTS possibly causing timeouts in the app), I'll recommend a StartDate and an EndDate column to replace the IsActive column. I also recommend that the EndDate column be a NOT NULL column with a default value of '9999-12-30' (there are computational advantages to NOT using '9999-12-31' that I won't get into here).
That will allow several remarkable things to happen in the areas of performance and future reporting requirements.
First, if it's possible to reactivate an account, it will allow for SCD Type 6 rows to occur which is also great for auditing and historical purposes. Please see the following URL on SCD Type 6 (SCD = Slowly Changing Dimension).
Second, since you can put a dandy index on StartDate and EndDate (especially since EndDate IS NOT NULL), finding active accounts is highly "seekable" using something like the following to find the currently active accounts.
WHERE StartDate <= GETDATE() AND EndDate > GETDATE()
The reason for using the large EndDate instead of a null is two fold. It eliminates the NULL possibly making the index more effective and it keeps you from having to write code like...
WHERE EndDate IS NULL
or
WHERE EndDate IS NOT NULL
While that looks a lot more convenient, I believe it's less effective when trying to use an index.
Third, inserts for new accounts (IsActive = 1) and deactivations of an account (IsActive = 0) will be done by date rather than on a two value column which will greatly decrease extent splits in the non-clustered index or page splits in a clustered index especially if you have a decent FILLFACTOR other than 100 or 0. This not only reduces index maintenance but it also helps prevent timeouts while the app waits for the splits to occur on a busy system.
Last but not least, an InActive column will certainly tell you whether an account is active or inactive but it will not allow you to create reports that identify (for example) how many new accounts were gained in a particular month nor how many were lost in a particular month.
Just my 2 cents on what I've been able to use very effectively instead of an IsActive column.
Unfortunately, the IsActive column was merely used for my example and the actual status column in my database represents many states of an account. Each state though can be active or inactive and my report needs to show active states.
Jeff Moden (6/3/2012)
@Artoo22,there are computational advantages to NOT using '9999-12-31' that I won't get into here
I would love to hear more about that.
June 4, 2012 at 5:58 am
Artoo22 (6/3/2012)
Unfortunately, the IsActive column was merely used for my example and the actual status column in my database represents many states of an account. Each state though can be active or inactive and my report needs to show active states.Jeff Moden (6/3/2012)
@Artoo22,there are computational advantages to NOT using '9999-12-31' that I won't get into here
I would love to hear more about that.
Ah! Got it, Artoo. nThanks for the feedback.
On the 9999-12-30 thing... someone could probably write a chapter on it but I'll summarize. One of the best practices I've seen (and almost always use) when it comes to using paramaterized date ranges is to make sure the EndDate is a "whole" date, add 1 to it, and then check for dates less than that. When passing parameters for an "open" EndDate, I use 9999-12-30 because you cannot add 1 to 9999-12-31 and, since I don't want to make an exception for it when using the best practice, I use 9999-12-30 as the default EndDates to represent "open" dates in tables. It all makes it easy to stick with the best practice for date lookups of...
WHERE StartDate >= @StartDate AND EndDate < DATEADD(dd,1,@EndDate)
--Jeff Moden
Change is inevitable... Change for the better is not.
June 4, 2012 at 2:20 pm
Jeff Moden (6/4/2012)
On the 9999-12-30 thing... someone could probably write a chapter on it but I'll summarize. One of the best practices I've seen (and almost always use) when it comes to using paramaterized date ranges is to make sure the EndDate is a "whole" date, add 1 to it, and then check for dates less than that. When passing parameters for an "open" EndDate, I use 9999-12-30 because you cannot add 1 to 9999-12-31 and, since I don't want to make an exception for it when using the best practice, I use 9999-12-30 as the default EndDates to represent "open" dates in tables.
That makes a lot of sense, thanks.
CELKO (6/4/2012)
LOL! Been there myself. I suggested killing a programmer at one consulting job. The scary part was that I did not get a laugh at my presentation.
π
CELKO (6/4/2012)
Why, I recommend everything I have written except my old Wrox intro book π SMARTIES has all of the basic idioms for SQL; THINKING IN SETS helps with the leap from procedural code to data-driven programming.I also started a book for CRC. I got tried of discrete math books aimed at procedural programmers and I wanted one for database guys. It will not be out until 2013.
Cheers, my next book shall be SQL for SMARTIES.
Viewing 15 posts - 1 through 15 (of 40 total)
You must be logged in to reply to this topic. Login to reply