Just For Fun: An Impossible Delete
Recently a seemingly unsolvable question was posed on one of the SQLServerCentral forums and its solution serves as an interesting case study in some once common, but now rarely seen techniques in database manipulation.
First let's look at the problem in the poster's own words:
I use SQL server 2000. I have a table 'TAB' with 3 columns 'Name','Age','Sex' and there is no primary key.I have multiple duplicate data in my table. for eg..
ABC 24 M
ABC 24 M
LMN 27 M
LMN 27 M
LMN 27 M
PQRS 25 F
XYZ 24 M
XYZ 25 MNow i would wish to 'DELETE' the 1st original row ( Row no 1,3,6,7)and keep the other duplicates from above data. Now, If there is a row without any duplicates present.. it will be deleted ( for eg Row no 6)The condition is i dont want to go for intermediate tables or have any additional identity column.
To summarize:
- We have a table with three columns: [Name], [Age], and [Sex]
(Let's call this table [SOURCE].) - SOURCE has duplicate rows.
- We wish to delete the 'first' row from every set of duplicates.
- This also means that all singles are to be deleted as well.
To this the OP adds the following restrictions:
- No intermediate tables, and
- No additional identity column.
And implicit to this, of course, is that it is for SQL2000.
Cursors? Foiled Again
Like most others, I at first thought that with these restrictions, this was impossible without resorting to something vile, like cursors. In fact, it seemed as though this problem was constructed and artificially constrained with the intent of forcing us to admit that it could only be solved with a cursor. In that light, however, it occurred to me to treat this not as a practical problem that someone needed for their work, but rather as a SQL puzzle, much like some Questions of the Day, a convoluted technical problem meant to challenge us to find any feasible way to solve it, though still without cursors or loops, as a matter of principle.
And one other thing occurred to me as well: Although this problem and its restrictions seemed very contrived and artificial, this was a time when they would not have been out of place at all...
Flashback
"Back In The Day", when I started out in computers (early 70's) there we none of the Gigabyte memory spaces and Terabyte disk spaces that we have today. Then, disks were measured in Kilobytes and memory was measured in, well, just bytes. And any programmer worth their salt spent their time worrying about every bit of storage. In fact if you took the amount of time that some of us spend on performance today and tripled it, that would give you an idea of how much time we spent worrying about how much space things took up.
Now that might seem silly to us today, but in a time when disks had almost a million times less capacity than today, and the cheapest disks cost more than the average person's house, it made a lot of sense. Space affected everything about computers in those days: how much you could store in your program, how much your program could do, it affected what libraries your program could co-exist with and thus what functions you could use. It was even universally understood that the principle way to make most programs faster was to reduce their "footprint" and other data needs.
In that era, the OP's restrictions would not have seemed unreasonable, they were in fact already assumed by everyone. Copious amounts of temporary space ("working space") were rarely available. In many cases a temporary file was out of the question and anyone who suggested a temporary or intermediate table would have been seen as suffering from delusional fantasies ("Sure! And someday we'll talk to each other with personal communicators, like they do on Star Trek!"). Anything even remotely like that would have taken extensive planning, and lots of down time.
And adding a column to an existing table? That was the worst of all. Not only would you have to shut the system down, unload the table (which took hours) restructure the table, reload the data (again, hours), rebuilt the indexes, and then brought the system back up. And then, you also would have had to deal with the fact that every single line of existing code that referenced that data would break, and it would all have to be rewritten. And that would take months. In general, it was easier to change jobs than it was to add a column to an existing table or file (and definitely faster).
The Temp In-Place Method
In this very different world of computers, the important, everyday techniques were very different and one of them is what I now call "Temp In-Place". (It had no name then, it was just how everyone did it.) With Temp In-Place, instead of using a temporary table to do your intermediate work, an unheard of luxury, you would obviously use the existing table that you already had, and instead just use different key values so that they did not conflict with the existing records.
Now, this may not seem like a terribly useful technique today, but is still does come in handy sometimes. Just last month, I used this "trick" to help out a customer with particular data problem. (For this article, I have changed things some to simplify it and to protect confidentiality.)
Let's say they had a large table defined like this:
CREATE TABLE CONTACT_NUMBERS
(
EmployeeID int NOT NULL,
CallOrder smallint NOT NULL,
PhoneNumber varchar(50) NOT NULL
)
GO
ALTER TABLE CONTACT_NUMBERS ADD CONSTRAINT
PK_CONTACT_NUMBERS_1 PRIMARY KEY CLUSTERED
(
EmployeeID,
CallOrder
)
Each on-call employee was supposed to have exactly two contact numbers in this table, a Primary contact number (CallOrder=1) and a Secondary (CallOrder=2) which is the order that they must be called in. It seems that a day-one bug in the client program had been assigning these numbers backwards for some time so that now there were thousands of records that had the CallOrder's "1"'s and "2"'s reversed. Here's how to setup some test data to demonstrate this:
INSERT Into CONTACT_NUMBERS
Select 1001,1,6095551111
UNION ALL Select 1001,2,2150004321
UNION ALL Select 1002,1,6095552222
UNION ALL Select 1002,2,2150004322
UNION ALL Select 1003,1,6095552223
UNION ALL Select 1003,2,2150004323
UNION ALL Select 1004,1,6095552224
UNION ALL Select 1004,2,2150004324
In this case the 609* numbers should have been CallOrder=2 and the 215* numbers should have been CallOrder=1. So he tried this:
BEGIN Transaction
UPDATE CONTACT_NUMBERS
SET CallOrder = 2
Where CallOrder = 1
UPDATE CONTACT_NUMBERS
SET CallOrder = 1
Where CallOrder = 2
COMMIT Transaction
This fails with Primary Key violations (which it turns out was a good thing). He called me to ask for a suggestion on how to do it. And yes, I do know that there are other ways to do this correctly, especially with a CASE function. For reasons that I cannot get into, that was not an option here. Anyway, he wanted to avoid using a Temp table, so I suggested the Temp In-Place method as an alternative: Instead of copying out the records and copying them back again, changing the values as we went (and deleting the old versions in between), I suggested that he just change the CallOrder number to an unused number range first and then back to their correct 1's and 2's.
The trick with Temp In-Place is always finding a "safe" part of the key space (the range of values allowed for the key) to use for the temporary holding areas: a sub-range that is unused and will not otherwise break anything after we are done and still is large enough to hold all of our temporary rows without mixing them up. Fortunately, as I have also found for almost all Identity column cases, CallOrder did not have any negative numbers so we could use them as our temporary Keyspace:
BEGIN TRANSACTION
--move into temporary keyspace, switching values
UPDATE CONTACT_NUMBERS
SET CallOrder = -2
WHERE CallOrder = 1
UPDATE CONTACT_NUMBERS
SET CallOrder = -1
WHERE CallOrder = 2
--now move them back to their correct key values
UPDATE CONTACT_NUMBERS
SET CallOrder = 1
WHERE CallOrder = -1
UPDATE CONTACT_NUMBERS
SET CallOrder = 2
WHERE CallOrder = -2
COMMIT TRANSACTION
Back to The Future
Returning to our original problem, obviously, we could do this with a Cursor. Although technically, we cannot assume an order for the data, by adding a Clustered Index covering all three columns we would effectively impose an order. However, I was not ready to give up and say that Cursors or Loops were necessary, even for this set of unreasonable restrictions. (Besides, as Jeff Moden pointed out, Cursors actually use an intermediate table under the hood anyway.)
The Identity Method
Without the restrictions, this is not a difficult problem to solve, as made clear by the responses of some folks who overlooked them. For instance, to do this with an additional identity column:
- Add a Clustered Index covering all three columns
- Add an Identity column [ID]:
Name Age Sex ID
----- ----------- ---- -----------
ABC 24 M 1
ABC 24 M 2
DEF 24 M 3
DEF 24 F 4
GHI 26 F 5
GHI 26 F 6
GHI 26 F 7
GHI 26 F 8
GHI 26 F 9
LMN 27 M 10
LMN 27 M 11
LMN 27 M 12
PQRS 25 F 13
XYZ 24 M 14
XYZ 25 M 15 - Now delete every row where the previous row (ID-1) is not the same {Name,Age,Sex}):
DELETE From SOURCE
Where NOT EXISTS (Select * From SOURCE S2
Where S2.ID = SOURCE.ID-1
And S2.Name = SOURCE.Name
And S2.Age = SOURCE.Age
And S2.Sex = SOURCE.Sex
)
producing the desired result:
Name Age Sex ID
----- ----------- ---- -----------
ABC 24 M 2
GHI 26 F 6
GHI 26 F 7
GHI 26 F 8
GHI 26 F 9
LMN 27 M 11
LMN 27 M 12
The Group By / Temp Table Method
It is also possible to do this without an Identity column if we use GROUP BY and an intermediate table instead, as demonstrated by one responder. However, she used a WHILE loop, which a Tally table will make unnecessary as demonstrated below:
- Copy SOURCE into our TEMP table, using GROUP BY to both count & eliminate duplicates:
- Next, empty the SOURCE table so that we can insert our output rows there:
- Finally, generate our output rows using a Tally table, skipping over the first row in each group
SELECT [Name],Age,Sex,Count(*)as [Cnt]
Into #TEMP
From SOURCE
Group By [Name],Age,Sex
DELETE From SOURCE
INSERT Into SOURCE
Select [Name],Age,Sex
From #TEMP
Join dbo.Tally ONN Between 2 and Cnt
However, with the restrictions, neither of these approaches would work.
There's No School like the Old School
Thinking through this problem I became convinced that there had to be way to solve it by using one of these two and somehow conforming it to the restrictions.The Identity method could obviously be done using ROW_NUMBER(), but that was only in SQL 2005. Without ROW_NUMBER(), this approach seemed hopeless to me. Even a Tally table would be hampered by an inability to get a 1-to-1 synch with the source rows.
The Group By/Temp table method seemed more promising and it occurred to me that I might be able to implement it using the previously mentioned Old School technique that I have come to call Temp In-Place.
I felt that this was just the thing to get me around the "No Temp Tables" restriction, however there were still some problems to be solved. First, what keyspace could I safely use? This was not such a difficult problem as the [Sex] column (really should be called "Gender" nowadays) was only using two characters M and F, so I could just use two other letters, say "A" and "B" for my temporary copies of the existing records. The second problem was considerably more difficult: the temporary table used in the Group By method actually has an additional column used to store the record counts. Where was I going to store an entire extra column?
OK, I could have just added an extra column, which was not strictly against the rules (as long as it wasn't an Identity column), but by this time I was determined to solve it entirely within old school rules. Once again the [Sex] column was the answer. Under the rules of this problem, I could not assume that the [Sex] was any larger than CHAR(1), which is one byte.
Now one byte may not seem like a lot today, but in the 70's an entire family of programmers could live in the space provided by a single byte. Since the existing data only had two values (M and F) that meant that only one bit of the eight bits in a byte was needed for the current key values, leaving a luxurious seven (7) bits for whatever I wanted. This would allow me to store counts in the ranges of 0-127, which I was pretty sure would cover the maximum number of duplicates in this problem (if not, there were at least 4 more bits that I could squeeze out of the Name and Age columns).
The Solution: Temp In-Place and ReMapping ASCII
Putting all of these together, we can finally come up with a solution that fulfills all of the requirements and still avoids any Cursors or loops:
--======1: ReMap {M,F} to lowest bit:
Update SOURCESet Sex = CHAR(CASE Sex When'M' Then 0 Else 1 End)
--======2: Create counted rows, using upper 7 bits
Insert into SOURCE(Name, Age, Sex)
Select Name
, Age
, CHAR( (2*Count(*)) + Ascii(Sex) )
From SOURCE
Group By Name, Age, Sex
--======3: Remove the old rows
Delete from SOURCE
Where Ascii(Sex) <= 1
--======4: Generate the new, recounted rows
Insert into SOURCE(Name, Age, Sex)
Select Name
, Age
, CHAR(ASCII(Sex) & 1) --extract and preserve the [Sex] field
From TAB
Join Tally ONTally.Number<= Ascii(Sex)/2
Where Tally.Number Between 2and 127
--======5: Clean-up
Delete From SOURCE
Where Ascii(Sex) > 1
Update SOURCE
Set Sex = CASE ASCII(Sex) When 0 Then 'M' Else 'F' End
Breaking It Down
So, let's step through this routine and see how it works:
Step 1: Move Data Rows to Temporary Keyspace
The first thing that we need to do is deal with the existing M's and F's in the [Sex] column. To put it simply, a CHAR(1) column is a single byte encoded as ASCII. That's the physical storage representation scheme used to represent English characters as binary values. Thus CHAR(1) is the same size as TINYINT and stores the same range of binary values, but these same values are interpreted and treated differently.
For instance the capital M's in the [Sex] are all stored as binary "01001101" if they are upper-case and as "01101101" if it is lower-case. Likewise, the F's are stored as "01000110" in upper-case and "01000110" in lower (if you look closely, you can tell which bit makes it upper or lower case). These binary values are also the same as the TINYINT values 77 and 109 (the M's) and 70 and 102 (the F's), respectively.
This equivalence is what we can exploit for our compressed encoding. When necessary, we will use the Ascii() function to change the [Sex] column value into a TINYINT, use integer operations to manipulate the bits and then use the Char() function to change it back into a CHAR(1). Back in the day, this was called Mapping or Casting (By the way, the programmers who write compilers and OS's still use these techniques extensively today).
In order to make it easy for us to use 7 or the 8 bits in the [Sex] column for our counts, we will need to re-map the M's and F's to either the highest bit or the lowest bit. I think that the lowest bit is easier, so that's what I do:
Update SOURCE Set Sex = CHAR(CASE Sex When 'M' Then 0 Else 1 End)
When the [Sex] is an "M" (binary "01001101" or "01101101"), we change it to a TINYINT 0 (which is binary "00000000") and then use the CHAR function to cast it back to a CHAR(1) datatype. Similarly, we change the F's (binary "01000110" and "01100110") to 1's (binary "00000001"). Now, the [Sex] column for all of our records has either "00000000" or "00000001".
I now can use the highest 7 bits for whatever I want. So long as I preserve the value of the lowest bit, I can always restore the original M's and F's. From here on, when I show these values, I will use a "|" character to separate the two fields in the binary bits: (0000000|x).
Step 2: Make Counted Intermediate Rows in the Upper Keyspace
The first step in the original Group By/Temp table method was to create counts of the duplicate rows and save this in the intermediate TEMP table. In this step of my procedure, I will do the same thing, except instead of a temporary table, I will use the temporary keyspace above (0000000|x), by encoding the counts using the upper 7 bits and preserving the lowest bit values to still represent the coded M's and F's:
Insert into SOURCE(Name, Age, Sex)
Select Name
, Age
, CHAR( (2*Count(*)) + Ascii(Sex) )
From SOURCE
Group By Name, Age, Sex
The expression CHAR( (2*Count(*)) + Ascii(Sex) ) for the third column is the key to this. The counts, which can range anywhere from 1 up 127, are doubled which now makes them range from 2 to 254, but only the even numbers. Because the lowest bit values add either a 0 or a 1 to the total they either leave it even or change the value to the next higher odd number, so now they range from 2 to 255.
In binary, the [Sex] column of these new rows range from "0000001|0" (for an M row with a duplicate of 1) to (1111111|1) (for an F row with a duplicate count of 127). The original data rows are still there and recall that they have binary values of either "0000000|0" for M or "0000000|1" for F, so they are effectively the records whose duplicate count = 0.
Step 3: Removing The Old Rows
Just as in the original Group By/Temp table procedure, we need to delete the original rows so that we can copy the re-counted new ones into their place. Since their [Sex] values can only be "0000000|0" or "0000000|1", all we have to do is delete the rows whose [Sex] values are 1 or less:
Delete from SOURCE
Where Ascii(Sex) <= 1
Step 4: Generating The New, Recounted Rows
This is the same as the last step of the original procedure. We are still using a Tally table to generate the new rows and we are still starting a N=2 to skip the first rows, effectively deleting them. We are just coding the values differently to still keep within our keyspace boundaries:
Insert into SOURCE(Name, Age, Sex)
Select Name
, Age
, CHAR(ASCII(Sex)& 1) --extract and preserve the [Sex] field
From TAB
Join Tally ON Tally.Number <= Ascii(Sex)/2
Where Tally.Number Between 2 and 127
I use the expression Ascii(Sex)/2 to extract the Count field. This reverses the "Count * 2" that we did originally to map it into the [Sex] column.
And, I use the expression "ASCII(Sex) & 1" to extract the original Sex field from the encoded [Sex] column the "&" is a bitwise AND operator that makes "& 1" act as a bit-mask, masking out all of the bits except the lowest one (i.e., all of the upper bits are zeroed and the lowest bit is preserved).
Step 5: Clean-up
Now we are almost done, but we still have a little cleanup to do. First we have to remove all of the counted Group By records that are still there in the upper keyspace:
Delete From SOURCE
Where Ascii(Sex) > 1
This is the complement to STEP 3 where we deleted all of the records in the lower keyspace. As you might expect, all we have to do is just reverse the logical test.
Finally, I need to undo my clever encoding scheme and change my 0's and 1's back to M's and F's:
Update SOURCE
Set Sex = CASE ASCII(Sex) When 0 Then 'M' Else 'F' End
Although this is more complicated than steps 3 and 5, because it reverses step 1, it also does so by executing its functional converse.
Conclusion
These are not new techniques by a long shot, and you are unlikely to need them every day. But I believe that it is good to know how such methods are done, and who knows, someday you might find yourself tightly squeezed for data space, and one of these techniques might just be your ticket out.
R. Barry Young is a Principal Consultant for Proactive Performance Solutions, Inc., a Microsoft Gold Certified Partner, located in northern Delaware. He has been programming for over 35 years, a computer professional for 30 years, a professional consultant for 25 years, a Systems Performance Analyst for 20 years and a Database Consultant for the last 15 years. He loves living in the Future and doesn't miss "the Good Old Days" at all.
(many thanks to Jeff Moden for his help and encouragement in preparing this article.)