October 8, 2010 at 4:22 pm
craig-404139 (10/8/2010)
I am getting weird results from the #contiguous table. It looks like:PPCI412903540
PPCI7102903235
PPCI1932903562
PPCI2122853875
PPCI1022903952
PPCI131402902866
PPCI8592903560
PPCI1452903559
PPCI5492903970
PPCI11452900794
PPCI632903558
PPCI12542903556
PPCI33962903208
PPCI902903539
That is seriously wierd. You are running the code end to end, right? Only change the name of the table near the top. The reason I ask is because it's obviously avoiding the order by clause on #contiguous already.
From what I've deduced change (near the top)
(SELECT * FROM gaps UNION ALL ...
to
(SELECT id, controlnumber, endno FROM document UNION ALL ...
The clustered index utterly controls the quirky update. This all has to happen in the #Midpoint table.
My results, btw:
Pattern GapStart GapEnd
-------- -------------------- --------------------
PPCI 737005 737005
PPCI 737009 737009
PPCI 999798 999798
PPCI 1346238 1346238
PPCI 1350958 1350958
PPCI 1462905 1462905
PPCI 1620877 1620877
PPCI 1620882 1620882
PPCI 1620891 1620891
PPCI 1620899 1620899
PPCI 1620909 1620909
PPCI 1620957 1620957
PPCI 1620962 1620962
(13 row(s) affected)
and
Pattern Series BeginSeries EndSeries
---------------------------------------------------------------------------------------------------- ----------- -------------------- --------------------
PPCI 1 0 737004
PPCI 2 737006 737008
PPCI 3 737010 999797
PPCI 4 999799 1346237
PPCI 5 1346239 1350957
PPCI 6 1350959 1462904
PPCI 7 1462906 1620876
PPCI 8 1620878 1620881
PPCI 9 1620883 1620890
PPCI 10 1620892 1620898
PPCI 11 1620900 1620908
PPCI 12 1620910 1620956
PPCI 13 1620958 1620961
PPCI 14 1620963 2903970
(14 row(s) affected)
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 8, 2010 at 4:26 pm
For the prepended PPCI parts, change the select into for #Contiguous to this:
SELECT
--Pattern,
--Series,
--MIN( StartNum) AS BeginSeries,
--Max( EndNum) AS EndSeries
Pattern + CAST( MIN( StartNum) AS VARCHAR(100)) AS BeginSeries,
Pattern + CAST( MAX( EndNum) AS VARCHAR(100)) AS EndSeries
INTO
#Contiguous
FROM
#MidPoint
GROUP BY
Pattern,
Series
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 8, 2010 at 4:39 pm
Same weird results. Ideally it would return one window with the ranges of sequential numbers:
PPCI000000001 - PPCI000737004
PPCI000737006 - PPCI000737008
PPCI000737010 - PPCI000999797
PPCI000999799 - PPCI001346237
PPCI001346239 - PPCI001350957
PPCI001350959 - PPCI001462904
PPCI001462906 - PPCI001620876
PPCI001620878 - PPCI001620881
PPCI001620883 - PPCI001620890
PPCI001620892 - PPCI001620898
PPCI001620900 - PPCI001620908
PPCI001620910 - PPCI001620956
PPCI001620958 - PPCI001620961
PPCI001620963 - PPCI002903970
And one window with the gap range(s). Not sure why the the query to identify the gaps is stripping off the prefix and leading zeros. Ideal output would be:
PPCI00737005 - PPCI000737005
PPCI000737009 - PPCI000737009
PPCI000 999798 - PPCI000 999798
PPCI0001346238 - PPCI0001346238
PPCI0001350958 - PPCI0001350958
etc...
Technically this works for the sequential ranges and only take 7 seconds on my 261,000+ row table.
select d.controlnumber as start
,substring(d.endno,1,4) + (select substring(min(a.endno),5,9) as id
from document as a
left outer join document as b
on substring(a.endno,5,9) = substring(b.controlnumber,5,9) - 1
where b.controlnumber is null
and substring(a.endno,5,9) >= substring(d.endno,5,9)
) as ending
from document as d
left outer join document as d1
on substring(d1.endno,5,9) = substring(d.controlnumber,5,9) - 1
where d1.endno is null
order by d.controlnumber
October 8, 2010 at 4:53 pm
craig-404139 (10/8/2010)
And one window with the gap range(s). Not sure why the the query to identify the gaps is stripping off the prefix and leading zeros. Ideal output would be:
Well, the reason it's stripping off the leading 0s is because we're converting everything to numeric so that it's quicker for comparisons. To get the specific results you want, it needs to have the leading zeros put back in, using something along the lines of:
Pattern + REPLICATE( '0', 9-len(StartNum)) + CAST( StartNum AS VARCHAR(100)),
Pattern + REPLICATE( '0', 9-len(EndNum)) + CAST( EndNum AS VARCHAR(100))
That said, if the code you have is working for you and you understand it, I'd go with that. It won't deal with multiple beginning patterns (IE: PPCI and PPCIXA and BCDA), and the PPCI vs. PPCIXA will especially end up problematic, but in this instance it may be your best choice for future maintenance.
My guess is if you reviewed the other code above, you'd see where the differences lie. I'll be happy to explain any specific piece that's not making a lot of sense if there's an issue there.
However, in the end, always go with the code that you understand and gives you the correct answers for the task at hand.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 8, 2010 at 5:01 pm
Craig Farrell (10/8/2010)
craig-404139 (10/8/2010)
And one window with the gap range(s). Not sure why the the query to identify the gaps is stripping off the prefix and leading zeros. Ideal output would be:
Well, the reason it's stripping off the leading 0s is because we're converting everything to numeric so that it's quicker for comparisons. To get the specific results you want, it needs to have the leading zeros put back in, using something along the lines of:
Pattern + REPLICATE( '0', 9-len(StartNum)) + CAST( StartNum AS VARCHAR(100)),
Pattern + REPLICATE( '0', 9-len(EndNum)) + CAST( EndNum AS VARCHAR(100))
That said, if the code you have is working for you and you understand it, I'd go with that. It won't deal with multiple beginning patterns (IE: PPCI and PPCIXA and BCDA), and the PPCI vs. PPCIXA will especially end up problematic, but in this instance it may be your best choice for future maintenance.
My guess is if you reviewed the other code above, you'd see where the differences lie. I'll be happy to explain any specific piece that's not making a lot of sense if there's an issue there.
However, in the end, always go with the code that you understand and gives you the correct answers for the task at hand.
Thanks for all the time you put into this. I'll continue working on it on Monday.
October 8, 2010 at 8:29 pm
Okay Craig and Craig-4 - I've been off all day, just getting back to look at this. It seems like things are going good. Is there anything else that is needed here that I can help with?
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
October 9, 2010 at 9:36 am
I've been waiting because it seems that the requirements for this problem have been modified several times thoughout this thread. Soooooo.... let's make sure the requirements are actually very clear here...
[font="Arial Black"]Data Description:[/font]
1. You have a table that looks something like this...
CREATE TABLE dbo.YourTable
(
ID INT,
ControlNumber VARCHAR(255),
EndNo VARCHAR(255)
)
2. The "ControlNumber" and "EndNo" data both look like the following...
PREFIXnnnnnnnnn
... where PREFIX is some alpha-only prefix and "nnnnnnnnn" is a zero padded sequence number.
3. The "EndNo" will always be greater than or equal to the "ControlNumber.
4. The values of "ControlNumber" and "EndNo" will form a range that will NEVER overlap the range formed on any other row.
The final sample of data is actually located in the zip file at the following post in this thread with the understanding that the prefix in the data could change...
http://www.sqlservercentral.com/Forums/Topic999745-392-1.aspx#BM1001543
[font="Arial Black"]Required Output:[/font]
1. Aggregated ranges of "islands" of ranges. For example, no matter how they appear according to the data specs above, if docs 40-53, 57-102, and 115-197 are present for a given prefix, the "island" output should look like this...
StartControlNumber EndControlNumber
PREFIX000000040 PREFIX000000053
PREFIX000000057 PREFIX000000102
PREFIX000000115 PREFIX000000197
2. Aggregated ranges of "gaps" of ranges. For example, no matter how they appear according to the data specs above, if docs 40-53, 57-102, and 115-197 are present for a given prefix, the "gaps" output should look like this...
StartControlNumber EndControlNumber
PREFIX000000054 PREFIX000000056
PREFIX000000103 PREFIX000000114
[font="Arial Black"]Is all that correct and are those the final requirements????[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2010 at 9:59 am
[edit] Just an FYI... It would appear that the Zip file still contains the original 13 nulls.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 9, 2010 at 1:21 pm
{edit} My apologies... I had posted some code here and then think I found a bug later. I took the code down until I can determine if it's actually a bug and, if it is, some repaired code.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 10, 2010 at 9:01 am
As usual, Jeff has an excellent solution.
I did notice one, well, issue with it. It is assuming that the control numbers will have a prefix, followed by 9 zero-padded digits. If my understanding is correct, each prefix can have a different length for the numbers following the prefix.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
October 10, 2010 at 9:21 am
WayneS (10/10/2010)
As usual, Jeff has an excellent solution.I did notice one, well, issue with it. It is assuming that the control numbers will have a prefix, followed by 9 zero-padded digits. If my understanding is correct, each prefix can have a different length for the numbers following the prefix.
I got lazy. I should have had it calculate the correct length for each prefix.
That notwithstanding, I believe I did find a bug with the code so I took the post down until I can figure it out.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 10, 2010 at 10:22 am
Heh... I'm paranoid but I'm not sure that I'm paranoid enough. 😛 It turns out that I changed the test data and had done some formatting for the output of the previous post and stuff was simply being cutoff in the result set. The code I posted was fine. That notwithstanding, I was also creating a test data generator for this problem and thought I'd share the wealth instead of just reposting what I had deleted. With the kind of data this one makes, it may be worthwhile to revisit the "Quirky Update" method.
As a sidebar, this test will, for sure, blow most Tally Table solutions right out of the water because it creates a small "Island" up around the 100 million row range.
Here's the test data generator. It takes only scant seconds to create the ~million rows of test data...
--=====================================================================================================================
-- Build a million row test table and populate it with data.
-- This is NOT a part of the solution.
-- This takes less than 10 seconds to run.
--=====================================================================================================================
--===== Conditionally drop the test table to make reruns easier.
IF OBJECT_ID('TempDB..#JBMTest','U') IS NOT NULL
DROP TABLE #JBMTest
;
GO
--===== This builds a million rows of data where the ControlNumber and EndNo are all the same.
WITH
cteTally AS
(
SELECT TOP (1000000)
N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
)
SELECT RowNum = t.N,
ControlNumber = 'PREFIX'+ RIGHT('0000000000'+CAST(t.N AS VARCHAR(9)),9),
EndNo = 'PREFIX'+ RIGHT('0000000000'+CAST(t.N AS VARCHAR(9)),9)
INTO #JBMTest
FROM cteTally t
;
--===== Create some GAPS
DELETE FROM #JBMTest
WHERE RowNum BETWEEN 1 AND 100 --First 100 numbers
OR RowNum % 100000 = 0 --Numbers that are evenly divisiable by 100,000
OR RowNum BETWEEN 500000 AND 599999 --Big Gap
;
--===== Create a row with a larger span by first removing the singleton rows
-- and then adding in a single row replacement.
DELETE FROM #JBMTest
WHERE RowNum BETWEEN 50000 AND 60000
INSERT INTO #JBMTest
(ControlNumber,EndNo)
SELECT ControlNumber = 'PREFIX'+ RIGHT('0000000000'+CAST(50000 AS VARCHAR(9)),9),
EndNo = 'PREFIX'+ RIGHT('0000000000'+CAST(60000 AS VARCHAR(9)),9)
;
--===== Change some rows to "bad" rows. This will also create some larger gaps
UPDATE #JBMTest
SET EndNo = 'NULL'
WHERE RowNum BETWEEN 800000 AND 800010
;
UPDATE #JBMTest
SET EndNo = Null
WHERE RowNum BETWEEN 800020 AND 800030
;
UPDATE #JBMTest
SET ControlNumber = 'NULL'
WHERE RowNum BETWEEN 800040 AND 800050
;
UPDATE #JBMTest
SET ControlNumber = Null
WHERE RowNum BETWEEN 800060 AND 800070
;
--===== This is a "Tally Table Buster". It's a small island way up at the 100,000,000 million row range
INSERT INTO #JBMTest
(ControlNumber,EndNo)
SELECT ControlNumber = 'PREFIX'+ RIGHT('0000000000'+CAST(100000000 AS VARCHAR(9)),9),
EndNo = 'PREFIX'+ RIGHT('0000000000'+CAST(100000005 AS VARCHAR(9)),9)
;
And here's the solution, again. Nothing has changed since the last time I posted it. The first part isolates the "Start/End" rows of each "island" and does so without needing either a "Quirky Update" or a self-join of any type. The second part creates the desired output (I unioned it, this time). Note that the second part is where we'd probably take the time to figure out how many zero-places to use in the zero-padded field within the document numbers but I'm going to let someone else do that. Maybe I can get to it later today but I've got a pretty full dance card today.
Again... operational details are always in comments in the code.
--=====================================================================================================================
-- Find only the items that are a real "start" or "end" number. Start and End numbers ALWAYS occur in pairs
-- just due to the nature of the data.
-- On my 8 year old, single 1.8GHz CPU desktop box, this takes about a minute and 6 seconds to run.
-- I might have to try the "Quirky Update" after all.
--=====================================================================================================================
--===== Do the testing in a nice, safe spot that everyone has.
USE TempDB
;
--===== Conditionally drop the test table(s) to make reruns easier.
IF OBJECT_ID('TempDB..#StartEnd','U') IS NOT NULL
DROP TABLE #StartEnd
;
WITH
cteSeparation AS
( --=== Separates the Prefix from the numeric values and adds 1 to the EndNo for calculations to come
-- by causing an overlap with the next start number if there is one. We'll remove the extra one
-- when we're all done with the calculations (which are actually VERY simple).
-- PATINDEX is used to find the first digit after the non-numeric prefix and THAT is super important...
-- For reasons of performance, it's assumed that ALL PREFIXES ARE ALPHA ONLY!!!
-- This also filters out some of the known bad data.
SELECT Prefix = LEFT(ControlNumber,PATINDEX('%[0-9]%',ControlNumber)-1),
ControlNumber = CAST(SUBSTRING(ControlNumber,PATINDEX('%[0-9]%',ControlNumber),8000) AS INT),
EndNoPlus1 = CAST(SUBSTRING(EndNo ,PATINDEX('%[0-9]%',ControlNumber),8000) AS INT)+1
FROM #JBMTest
WHERE ControlNumber <= EndNo --ControlNumber is in correct order with EndNo
AND ControlNumber > '' --ControlNumber is not null or blank
AND EndNo > '' --EndNo is not null or blank
AND ControlNumber <> 'NULL' --ControlNumber is not the word "NULL"
AND EndNo <> 'NULL' --EndNo is not the word "NULL"
)
,
cteFindStartEnd AS
( --=== This unpivot columnizes (Single Column) the Start and End numbers and finds only those
-- rows that occur once. Those will be either non overlapping Start or End times.
-- Even numbered rows are a Start, Odd Numbered rows are an End if we start count at "0".
-- THE REAL KEY TO ALL OF THIS IS TO UNDERSTAND THAT EVEN THOUGH IT'S NOT LISTED
-- IN THE UNPIVOT, THE PREFIX COLUMN IS AVAILABLE IN THE UNPIVOT SUBQUERY.
SELECT unpvt.Prefix,
unpvt.RealStartEndNo,
EvenOdd = ROW_NUMBER() OVER (PARTITION BY unpvt.Prefix ORDER BY unpvt.RealStartEndNo)-1 --Start Count At zero
FROM cteSeparation sepdat
UNPIVOT (RealStartEndNo
FOR theCol IN (sepdat.ControlNumber, sepdat.EndNoPlus1)
) unpvt
GROUP BY unpvt.Prefix, unpvt.RealStartEndNo
HAVING COUNT(*) = 1 --Is a start or end value
)
--===== This just removes the 1 from "EndNo" that we added in before. This leaves the data
-- as a "single column" of data so we can easily do different things with it.
SELECT Prefix,
RealStartEndNo = CASE -- If this is an "EndNo", remove the 1 we added before
WHEN EvenOdd%2 = 1
THEN RealStartEndNo-1
ELSE RealStartEndNo
END,
EvenOdd
INTO #StartEnd
FROM cteFindStartEnd
;
--=====================================================================================================================
-- Now that we have the data in the form we want it, lets use it to easily find islands and gaps.
-- Note how the EvenOdd and OddEven columns come into play to make life so easy.
-- The reason why we put the data into a temp table is because we need to revisit the "aggregated" data more than
-- once to find the "islands" and the "gaps". If we do that as a CTE, the CTE is executed twice as if it were
-- a view and the code takes twice as long to run.
-- This stuff is nearly instantaneous because of what the previous seciton did.
--=====================================================================================================================
--===== Show all the islands
SELECT ControlNumber = Prefix + RIGHT('000000000'+CAST(MIN(RealStartEndNo) AS VARCHAR(9)),9),
EndNo = Prefix + RIGHT('000000000'+CAST(MAX(RealStartEndNo) AS VARCHAR(9)),9),
Status = 'Island'
FROM #StartEnd
GROUP BY Prefix, EvenOdd/2
UNION ALL
--===== Show all the gaps that are between the islands
SELECT ControlNumber = Prefix + RIGHT('000000000'+CAST(MIN(RealStartEndNo)+1 AS VARCHAR(9)),9),
EndNo = Prefix + RIGHT('000000000'+CAST(MAX(RealStartEndNo)-1 AS VARCHAR(9)),9),
Status = 'Gap'
FROM #StartEnd
WHERE EvenOdd > 0 AND EvenOdd NOT IN (SELECT MAX(EvenOdd) FROM #StartEnd)
GROUP BY Prefix, (EvenOdd+1)/2
UNION ALL
--===== Show the bad stuff that we ignored
SELECT ControlNumber = ControlNumber,
EndNo = EndNo,
Status = 'Bad'
FROM #JBMTest
WHERE ControlNumber > EndNo
OR ISNULL(ControlNumber,'') = ''
OR ISNULL(EndNo,'') = ''
OR ControlNumber = 'NULL'
OR EndNo = 'NULL'
ORDER BY ControlNumber
;
--Jeff Moden
Change is inevitable... Change for the better is not.
October 13, 2010 at 8:51 am
I really appreciate the help everyone. I was able to figure out the code I need. Couldn't have done it without you guys.
FYI this is my final working code using what I understood from previous posts.
On a 800,000+ row table this takes 3 minutes to run which is acceptable for us. To be fair though the table has a ton of sequence gaps and text prefixes.
--Drop temp table if it exists
IF OBJECT_ID('TempDB..#midpoint','U') IS NOT NULL
DROP TABLE #midpoint
--Create temp table to store text prefix and numerial values of both the control number and endno
CREATE TABLE #MidPoint
(Pattern VARCHAR(100)
,controlnumber varchar(100)
,endno varchar(100)
primary key (pattern,controlnumber))
create index idx_control
on #midpoint (endno)
--Inserting numerical values without text prefix into temp table
INSERT INTO #Midpoint
select left(controlnumber, PatIndex('%[0-9]%', controlnumber)-1) AS Pattern --Strip out text prefix
,SubString(controlnumber, PatIndex('%[0-9]%', controlnumber), 20) AS controlnumber
,ISNULL(SubString(endno, PatIndex('%[0-9]%', endno), 20) --First part of isnull check
,SubString(controlnumber, PatIndex('%[0-9]%', controlnumber), 20)) AS endno --Second part of isnull check
from gaps as d
--Identify ranges of sequential numbers
select d.pattern + d.controlnumber as 'Start of Sequence'
,d.pattern + (select min(a.endno) as id
from #midpoint as a
left outer join #midpoint as b
on a.endno = b.controlnumber - 1
and a.Pattern = b.Pattern
where b.controlnumber is null
and a.endno >= d.endno
and a.Pattern = (select distinct Pattern
from #MidPoint as m2
where d.Pattern= m2.pattern)) as 'End of Sequence'
from #midpoint as d
left outer join #midpoint as d1
on d.controlnumber - 1 = d1.endno
and d.pattern = d1.Pattern
where d1.endno is null
order by d.pattern
,d.controlnumber asc
October 13, 2010 at 8:03 pm
Very cool, Joe. I'm thinking that you didn't test it...
[font="Courier New"]Msg 207, Level 16, State 1, Line 18
Invalid column name 'part_id'.
Msg 245, Level 16, State 1, Line 2
Conversion failed when converting the varchar value 'NULL' to data type int. [/font]
... but it runs quite fast. The other problem is that your code misses the 50,000 to 60,000 single entry because you only check the first column of the data, but I think there might be a workaround for that.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 13, 2010 at 10:26 pm
Craig Farrell (10/8/2010)
--Quirky Update here. It SHOULD be ran with OPTION( MAXDOP 1). Only needs it for this query (...)
UPDATE p
SET
@Series = Series = CASE WHEN p.StartNum <> p2.EndNum + 1 THEN @Series + 1 ELSE @Series END
FROM
#Midpoint AS p
LEFT JOIN-- Link to the prior row
#Midpoint AS p2
ONp.PrevRowID = p2.RowID
-- ORDER BY p.RowID
-- The clustered index deals with this for us.
OPTION( MAXDOP 1)
Craig,
You can't use joins in a Quirky Update.
Paul
Viewing 15 posts - 61 through 75 (of 91 total)
You must be logged in to reply to this topic. Login to reply