February 14, 2020 at 12:00 am
Comments posted to this topic are about the item A quick search without an index
February 14, 2020 at 1:36 pm
Thanks for sharing! A successful developer is one who is skilled in the requisite tools AND is confident is his/her creative ability. A developer is a creator.
February 14, 2020 at 1:51 pm
5 Stars for realizing that you were stuck in a box and then for thinking outside the box. I wonder how many people would have actually thought of using a Binary Search like this? Nice job!
--Jeff Moden
Change is inevitable... Change for the better is not.
February 14, 2020 at 3:33 pm
Brilliant - thanks for sharing! That's a great example of creatively using an existing index rather than adding another index to the table. I'll definitely add the binary search technique to my SQL toolkit.
February 14, 2020 at 11:26 pm
Alex, just a couple things from an old-timer for you to consider.
First, if you are serious about that number of rows, you need to fix that. What are you doing, documenting the stars in the Milky Way? In my active years as a DBA I was always advocating for archiving data. You need to consider how old the history is and how often you need to access it, versus the time and cost to store it online, back it up, etc.
At least be sure it is NOT in a database with transactional data, preferably on a different server. Better, move anything more than the last year or so to something like a dis-mountable NAS drive and put it 'in the basement'. Break the data up into unique tables such as by year, get it organized and indexed and ready to access. Create summary data which might even suffice for lots of demands. On the rare occasion you need to access it, you can plug it in and query it at will with simple SQL code, instead of spending the time doing complex stuff just to look at it. In the old days of removable 'disk packs', we had archive disks that we could drop in and fire up as needed. Today I use a NAS device with internal drives that idle down when not active, and I'm thinking about getting a second one. I can archive my data to that, and it's great for backups too so they're not taking space on my active server.
It very easy for companies to get carried away with demands for accessible data, and it's up us to figure out how it gets done. Those of us who have been around for decades had to learn all the tricks because budgets were lots smaller then, both for hardware and personnel. Even back when we had to use 1/2 inch magnetic tape, we did weekly tape-to-tape merges of detail into archived data and only kept summary data online.
We even traded storage with other companies close by where we kept copies of our data archives for security and protection from disasters. Was that the original version of 'the cloud'? Before that, several of us would actually carry home backups and archives for off-site storage.
I have applications that use historical data that is as old as 35 years, with a small quantity of data going all the way back to 1944 but I only keep the most recent 5 or 6 years on my SQL Server, and the rest is on a RAID NAS device that I can access if and when I need to.
So, instead of indexing, consider summarizing and archiving.
Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )
February 15, 2020 at 12:01 am
BTW... You already know this but you didn't write it quite right in the article. To determine the max number of iterations in a binary search, you have to do a bit more that just taking the LOG of the number of values. The full formula is CEILING(LOG(N)/LOG(2)) where "N" is the number of rows in a gap-less situation or MAX(ID)-MIN(ID)+1 for something like an IDENTITY column.
Additional kudos for explaining that the IDENTITY column is probably not gap-less and what you did to compensate.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2020 at 1:51 am
Just a bit of caution for people reading this... if your DATETIME column only contains whole dates (no times) and you tell it to search for a date with no time or a 00:00:00.000 time, it'll find the correct ID. If, however, you give it any time larger than 12:00:00.000, it'll find the next day. I've not tested if you tell it to find a whole date that's not there but it seems like it will find whatever row contains the value closest to the target value. That means that it might give you some extra rows you hadn't anticipated and that's my point... once you've used a script like this, you need to double check the data you retrieve.
None of this is a fault of this code. It's must one of the devils in the data that you have to watch for.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2020 at 3:23 am
I believe the problem with time larger than 12:00:00.000 that picks the next day is because of the last part of the script that selects the closest neighbor.
February 15, 2020 at 4:17 am
Yep... I got that. It's not a problem if you know about it. I just wanted to be sure that people knew about it. Perhaps stripping the time would eliminate that problem if one were looking for whole date ranges, which I believe would be the case for the original task presented in the article.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 17, 2020 at 1:39 pm
I remember doing this many years ago, but it didn't get coded up at the time. The same in principle but not quite the same code. Just for S&G I coded up what I think it would have looked like:
DECLARE
@DateToFind DATE = '20150210',
@DateFound DATE = '21000101',
@ID BIGINT = NULL,
@MAXID BIGINT = (SELECT MAX(ID) FROM MyTrans) / 67136,
@Factor INT = 67136, -- 16 seeks max
@Bitty INT = 67136 / 2,
@SeekCount SMALLINT = 0
WHILE @Bitty >= 1 BEGIN
SET @Factor = CASE WHEN @DateFound < @DateToFind THEN @Factor + @Bitty ELSE @Factor - @Bitty END
SET @ID = @Factor*@MAXID
SELECT TOP(1) @ID = ID, @DateFound = [Date] FROM MyTrans WHERE ID >= @ID ORDER BY ID
IF @DateToFind = @DateFound BREAK
SELECT @Bitty = @Bitty / 2, @SeekCount = @SeekCount + 1;
END
SELECT DateToFind = @DateToFind, DateFound = @DateFound, SeekCount = @SeekCount
SELECT * FROM MyTrans WHERE ID = @ID
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
March 10, 2020 at 1:44 pm
Thank you VERY MUCH for this article! I have already used the algorithm to greatly improve the performance of a poorly written query (from another developer).
Interestingly, this developer needed to be able to apply the query from within Tableau, which apparently does not allow the use of DECLARE statements from their interface. Given that we (the owners of the target database) very rarely write custom views for external access (it's an access control system), I decided to attempt to rewrite your logic in the form of a series of CTE's with a recursive CTE used as the binary search engine in place of the iterative loop (also not allowed in Tableau). I know recursion can get expensive and slow over time; however, given that most cases were found in less than 30 calls, I wasn't too worried about it.
Here is the result:
-- Sample from Tbl
;WITH CurrentDate AS (
SELECT CONVERT(DATE,GETDATE()) AS Today
), StartOfMonth AS (
SELECT CONVERT(DATETIME,DATEADD(dd,-(DAY(cd.Today)-1),cd.Today)) AS [Value]
FROM CurrentDate cd
), TargetDate AS (
SELECT DATEADD(mm, -13, som.[Value]) AS [Value]
FROM StartOfMonth som
), initSplit AS (
SELECT MIN(tbl.TblIDColumn) AS low, MAX(tbl.TblIDColumn) AS high,
(SELECT TOP 1 [Value] FROM TargetDate td) AS tgtDate
FROM Tbl tbl
), firstLvl AS (
SELECT isp.low, split.mid, --tbl.CreateDate,
isp.high, split.CreateDate,
CASE WHEN split.CreateDate < isp.tgtDate
THEN (SELECT ntbl.TblIDColumn FROM Tbl ntbl WHERE ntbl.TblIDColumn >= split.mid + 1
ORDER BY ntbl.TblIDColumn OFFSET 0 ROWS FETCH FIRST 1 ROWS ONLY)
ELSE isp.low END AS newLow,
CASE WHEN split.CreateDate >= isp.tgtDate
THEN split.mid
ELSE isp.high END AS newHigh,
isp.tgtDate, lvl = 0
FROM initSplit isp
CROSS APPLY (SELECT TOP 1 midtbl.TblIDColumn AS mid, midtbl.CreateDate
FROM Tbl midtbl
WHERE isp.low + FLOOR( (isp.high - isp.low)/2 ) <= midtbl.TblIDColumn) split
UNION ALL
SELECT fl1.newLow, split.mid,
fl1.newHigh AS high, split.CreateDate,
CASE WHEN split.CreateDate < fl1.tgtDate
THEN (SELECT ntbl.TblIDColumn
FROM Tbl ntbl
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY ftbl.TblIDColumn) AS [rank], ftbl.TblIDColumn
FROM Tbl ftbl
WHERE ftbl.TblIDColumn >= split.mid + 1) nextId ON ntbl.TblIDColumn = nextId.TblIDColumn
AND nextId.[rank] = 1
)
ELSE fl1.newLow END AS newLow,
CASE WHEN split.CreateDate >= fl1.tgtDate
THEN split.mid
ELSE fl1.newHigh END AS newHigh,
fl1.tgtDate, fl1.lvl + 1 AS lvl
FROM firstLvl fl1
CROSS APPLY (SELECT midtbl.TblIDColumn AS mid, midtbl.CreateDate
FROM Tbl midtbl
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY tmp.TblIDColumn) AS [rank], tmp.TblIDColumn
FROM Tbl tmp
WHERE tmp.TblIDColumn >= (fl1.newLow + FLOOR( (fl1.newHigh- fl1.newLow)/2 ))
) sub ON midtbl.TblIDColumn = sub.TblIDColumn
AND sub.[rank] = 1
) split
WHERE 1 = 1
AND (split.mid <> fl1.mid)
), tgtId AS (
SELECT fl.mid AS TblIDColumn
FROM firstLvl fl
WHERE fl.low = fl.high
)
SELECT *
FROM firstLvl fl
/*
SELEC *
FROM tgtId ti
*/
March 10, 2020 at 2:19 pm
There is a small error in the code above in the tgtId CTE. The fl.low will not always match fl.high. Changing the WHERE clause to the following should fix that issue:
WHERE fl.newLow = fl.mid
AND fl.newHigh = fl.mid
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply