October 6, 2010 at 12:10 pm
Please skip to http://www.sqlservercentral.com/Forums/FindPost1000643.aspx for updated requirements.
I have a request from management and am at a complete roadblock. I am a DBA and this is much more developer/programmer logic than I am used to.
Scenario:
Using the below numbers I need to find each unique sequence sets. The data is a unique column in my actual db stored as a varchar (255) field. (Can't change that)
Below is some sample data.
create table gaps
(id varchar(50))
insert into gaps
SELECT 'PPCI000000001' UNION ALL
SELECT 'PPCI000000002' UNION ALL
SELECT 'PPCI000000003' UNION ALL
SELECT 'PPCI000000005' UNION ALL
SELECT 'PPCI000000010' UNION ALL
SELECT 'PPCI000000011' UNION ALL
SELECT 'PPCI000000012' UNION ALL
SELECT 'PPCI000000013' UNION ALL
SELECT 'PPCI000000014' UNION ALL
SELECT 'PPCI000000015' UNION ALL
SELECT 'PPCI000000016' UNION ALL
SELECT 'PPCI000000017' UNION ALL
SELECT 'PPCI000000100' UNION ALL
SELECT 'PPCI000000101' UNION ALL
SELECT 'PPCI000000102' UNION ALL
SELECT 'PPCI000000103' UNION ALL
SELECT 'PPCI000000104' UNION ALL
SELECT 'PPCI000000105' UNION ALL
SELECT 'PPCI000000106' UNION ALL
SELECT 'PPCI000000167' UNION ALL
SELECT 'PPCI000000197'
What I need to do is identify the sequences. For example an ideal output using the sample data would be
PPCI000000001 - PPCI000000003
PPCI000000005
PPCI000000010 - PPCI000000017
PPCI000000100 - PPCI000000106
PPCI000000167
PPCI000000197
*It would also be nice to know the 'missing' numbers in the sequence. for example an ideal output would be
PPCI000000004
PPCI000000006 - PPCI000000009
PPCI000000018 - PPCI000000099
PPCI000000107 - PPCI0000000166
PPCI000000168 - PPCI0000000196
Ideally I would be able to do this with a single query.
***In this example I only included text with a fixed length character prefix (PPCI) our actual data may have a prefix of 1,2,3,4,5, or 6 characters. The column is a unique field so there will not be any duplicates.
Any help is much appreciated, I dont even know where to start as this is a bit more programming logic than I am used too.
October 6, 2010 at 1:23 pm
So far this works using the substring statement. But has alot of hard coding. Although this solves my current problem, it doesn't help with other future needs.
select 'PPCI' + substring(l.id,5,9) as start
,'PPCI' +
(
select substring(min(a.id),5,9) as id
from gaps as a
left outer join gaps as b
on substring(a.id,5,9) = substring(b.id,5,9) - 1
where b.id is null
and substring(a.id,5,9) >= substring(l.id,5,9)
) as [end]
from gaps as l
left outer join gaps as r
on substring(r.id,5,9) = substring(l.id,5,9) - 1
where r.id is null;
But it has 1 big problem.
1. The text can be a variable amount from 1-6+ characters. I have it hardcoded for this example, but it needs to be dynamic.
Any way to make those sections dynamic?
October 6, 2010 at 1:54 pm
Craig,
Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 6, 2010 at 1:59 pm
There are multiple articles at SSC that may be of your interest:
- Generating Missing Dates and Numbers By Jacob Sebastian http://www.sqlservercentral.com/articles/Datetime+Manipulation/61822/
- The "Numbers" or "Tally" Table: What it is and how it replaces a loop. By Jeff Moden http://www.sqlservercentral.com/articles/T-SQL/62867/
Maybe a computed column may help out , if this action is more that a single shot solution.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
October 6, 2010 at 4:04 pm
Craig Farrell (10/6/2010)
Craig,Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.
October 6, 2010 at 4:15 pm
craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
Craig,Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.
260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂
One last confirmation before I build said harness: We're looking at records like:
ABC0001
ABCDEF00002
ABCD003
ABCDEF000006
AB7
...
ETC
Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?
Sorry, just trying to nail down the exact data situation.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 6, 2010 at 4:25 pm
Craig Farrell (10/6/2010)
craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
Craig,Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.
260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂
One last confirmation before I build said harness: We're looking at records like:
ABC0001
ABCDEF00002
ABCD003
ABCDEF000006
AB7
...
ETC
Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?
Sorry, just trying to nail down the exact data situation.
In theory yes. These are received legal documents and are bates stamped (identifier we are referring too) using a different prefix depending on the producing party. So your above example is quit possible. The total length of the numbers can/will vary as well. In this particular case they are all 13 total characters (4 prefix text + 9 numerals) and all have a PPCI prefix. But I am trying to write this script so I can run it against all cases. Some of which have a uniform prefix and uniform character count throughout, others which have a variable text prefix and variable numerical count. One thing though, all identifiers with the same prefix will have the same total character count.
Some tables may have 1 million+ records. I understand whatever solution we find will tax the system since it's checking each id, but it's a necessary process to identify ranges of produced documents for auditing purposes.
I appreciate your time with this. I feel I made some good progress with my above script, but am having trouble moving away from so much hard coding. FYI it's taking almost 30 minutes to run on my system.
October 6, 2010 at 4:29 pm
I've made some assumptions in my possible solution. If my assumptions are wrong, you may well be able to tweak the query to get it working for you.
I have assumed:
1) that the prefix never contains numeric characters [0-9].
2) that the numeric part of the id column always contains exactly 9 numeric characters [0-9] and can always be converted to an int.
3) that you want the results partitioned by the non-numeric prefix
If there is a clustered index on the id field this query should be quite performant.
;WITH cteParse(id, prefix, num, rn) AS (
SELECT
id,
SUBSTRING(id, 1, PATINDEX('%[0-9]%', id) - 1),
CAST(SUBSTRING(id, PATINDEX('%[0-9]%', id), 9) AS int),
ROW_NUMBER() OVER (ORDER BY id)
FROM gaps
)
SELECT
prefix + REPLACE(STR(MIN(num), 9), ' ', '0') AS [From],
CASE WHEN (COUNT(*) > 1) THEN
prefix + REPLACE(STR(MAX(num), 9), ' ', '0')
ELSE '' END AS [To]
FROM cteParse
GROUP BY prefix, num - rn
ORDER BY prefix, num - rn
October 6, 2010 at 4:51 pm
Edit: Sorry for the multi-edit. Something didn't play nicely with the code= statements when I tried to first post this.
The Test Harness:
SELECT TOP 1000000
IDENTITY(INT, 1,1) AS N
INTOTally
FROM
Master.dbo.SysColumns sc1,
Master.dbo.syscolumns sc2
ALTER TABLE Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED(N) with FILLFACTOR = 100
GRANT SELECT, REFERENCES on TALLY TO PUBLIC
select max(n) from Tally
CREATE TABLE #TestHarness
(TextIDVARCHAR(200) NOT NULL)
--TRUNCATE TABLE #TestHarness
INSERT INTO #TestHarness
SELECT
REPLICATE( 'A', FLOOR( (N/10)%5) + 1) -- VAriable length starter
+ REPLICATE( '0', N%7+1 ) -- Variable # of 0's
+ CAST( N AS VARCHAR(15))
FROM
Tally
WHERE
N%5 >1
AND N < 500000
The PatternFinder simplified: (I get 2 second results from this on a desktop server)
select --top 200
DISTINCT
LEFT( TextID, MAX( N) ) AS Pattern
FROM
#TestHarness AS th,
Tally AS T
where
N < LEN(TextID) AND
isnumeric( SUBSTRING( TextID, N, 1)) = 0
GROUP BY
TextID
The patternfinder used to strip to Patterns and Numerics (8 seconds with full row return):
select
TextID,
LEFT( TextID, MAX( N) ) AS Pattern,
CAST( RIGHT( TextID, LEN( TextID) - MAX( N)) AS BIGINT) AS NumPiece
FROM
#TestHarness AS th,
Tally AS T
where
N < LEN(TextID) AND
isnumeric( SUBSTRING( TextID, N, 1)) = 0
GROUP BY
TextID
Now, back to finding the gapping, which I'm not sure of directly. Hopefully anyone else working on this will find this useful.
Of note: The patternfinder backs up into the pattern, so it locates the last letter, and won't break on numerics in the pattern. You've also now got an integer to work from. This happens very quickly.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 6, 2010 at 4:58 pm
^^ Thanks. I'll start chewing through that and seeing if it fits. At first glance I have no idea what it's doing. Like I mentioned I've never had to deal with this detail of coding. Appreciate your time.
October 6, 2010 at 5:12 pm
craig-404139 (10/6/2010)
^^ Thanks. I'll start chewing through that and seeing if it fits. At first glance I have no idea what it's doing. Like I mentioned I've never had to deal with this detail of coding. Appreciate your time.
Heh, whoops, sorry, I'll explain a little more.
The test harness does nothing more then build a million row tally table and then create a bunch of records with variable length varchar patterns at the beginning and then variable #'s of 0s before a number.
The Patternfinder uses the same tally table to locate the largest (MAX(N)) non-numeric character position in the TextID, or your codes. This works because anything else after it is numeric.
So, now that we know the final position of the pattern, or varchar prefix, we take that off (Pattern column), and we convert the rest to numeric (NumPiece).
The thing that speeds this up is the tally table. It's an amazing Anti-RBAR component that sped up an old method I had. The Test Harness, btw, builds out a 300,000 row test case to work against.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 6, 2010 at 6:17 pm
And for my last trick...
This ran in about 3:15 on my system
-- First, separate the numbers and their prefixes into usable data.
select
TextID,
LEFT( TextID, MAX( N) ) AS Pattern,
CAST( RIGHT( TextID, LEN( TextID) - MAX( N)) AS BIGINT) AS NumPiece
INTO
#FindGaps
FROM
#TestHarness AS th,
Tally AS T
where
N < LEN(TextID) AND
isnumeric( SUBSTRING( TextID, N, 1)) = 0
GROUP BY
TextID
-- Give us a fighting chance for optimization...
CREATE CLUSTERED INDEX idx_FindGap ON #FindGaps ( Pattern, NumPiece)
-- Gaptable will store the list of Gaps and their start/end #s
CREATE TABLE #GapTable
(PatternVARCHAR(100),
GapStartBIGINT,
GapEndBIGINT)
CREATE CLUSTERED INDEX idx_GapTable ON #GapTable ( Pattern, GapStart)
-- This is an optimization table to take over work from a subquery
CREATE TABLE #EndGap
(Pattern VARCHAR(100),
GapEnd BIGINT)
CREATE CLUSTERED INDEX idx_EndGap ON #EndGap ( Pattern, GapEnd)
-- Why the cursor? Because I couldn't figure out a way to go against each
-- pattern without making the triangle join even worse...
DECLARE @Pattern VARCHAR(100)
DECLARE PLoop CURSOR
FOR SELECT DISTINCT Pattern
FROM #FindGaps
OPEN PLoop
FETCH NEXT FROM PLoop INTO @Pattern
WHILE @@FETCH_STATUS = 0
BEGIN
-- Find the beginning of each gap. It needs an entry to exist before it
-- considers a gap to have started, so if the first record is 5, it doesn't
-- care about 1-4.
INSERT INTO #GapTable
(Pattern, GapStart)
SELECT
@Pattern,
N AS GapStartNum
FROM
Tally AS t
LEFT JOIN
#FindGaps AS fg
ONt.N = fg.NumPiece
AND Pattern = @Pattern
WHERE
fg.NumPiece IS NULL
AND EXISTS( SELECT 1 FROM #FindGaps WHERE Pattern = @Pattern and NumPiece = N-1)
ORDER BY
t.N
-- This could be used as a subquery, but I got better performance as a separate table.
-- It works like above, identifying the end of each gap.
INSERT INTO #EndGap
(Pattern, GapEnd)
SELECT
@Pattern AS Pattern,
N AS GapEnd
FROM
Tally AS t
LEFT JOIN
#FindGaps AS fg
ONt.N = fg.NumPiece
AND Pattern = @Pattern
WHERE
fg.NumPiece IS NULL
AND EXISTS( SELECT 1 FROM #FindGaps WHERE Pattern = @Pattern and NumPiece = N+1)
FETCH NEXT FROM PLoop INTO @Pattern
END
-- Cleanup
CLOSE PLoop
DEALLOCATE PLoop
-- Inbound triangle join, I know... I know... It's the time eater, too, just can't see a way out.
-- The logic here will find the nearest ending gap to the start point, by pattern/prefix.
UPDATE gt
SETGapEnd = drvEnds2.GapEnd
FROM
#GapTable AS gt
JOIN
(SELECT
gt.Pattern,
gt.GapStart,
MIN( eg.GapEnd) AS GapEnd
FROM
#EndGap AS eg
JOIN
#GapTable AS gt
ON eg.Pattern = gt.Pattern
WHERE
gt.GapStart < eg.GapEnd
GROUP BY
gt.Pattern,
gt.GapStart
) AS drvEnds2
ONgt.Pattern = drvEnds2.Pattern
AND gt.GapStart = drvEnds2.GapStart
SELECT * FROM #GapTable
DROP TABLE #FindGaps
DROP TABLE #GapTable
DROP TABLE #EndGap
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 6, 2010 at 6:25 pm
craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
Craig,Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.
260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂
One last confirmation before I build said harness: We're looking at records like:
ABC0001
ABCDEF00002
ABCD003
ABCDEF000006
AB7
...
ETC
Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?
Sorry, just trying to nail down the exact data situation.
In theory yes. These are received legal documents and are bates stamped (identifier we are referring too) using a different prefix depending on the producing party. So your above example is quit possible. The total length of the numbers can/will vary as well. In this particular case they are all 13 total characters (4 prefix text + 9 numerals) and all have a PPCI prefix. But I am trying to write this script so I can run it against all cases. Some of which have a uniform prefix and uniform character count throughout, others which have a variable text prefix and variable numerical count. One thing though, all identifiers with the same prefix will have the same total character count.
Some tables may have 1 million+ records. I understand whatever solution we find will tax the system since it's checking each id, but it's a necessary process to identify ranges of produced documents for auditing purposes.
I appreciate your time with this. I feel I made some good progress with my above script, but am having trouble moving away from so much hard coding. FYI it's taking almost 30 minutes to run on my system.
Question: when a prefix changes, do the numbers restart at 1? Or do they continue?
(I've got some real nifty, high-speed code (thanks to Jeff!) to quickly identify gaps - but it's designed to work on numbers. Will take some mods to work with the prefix, but should be simple.)
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
October 6, 2010 at 6:28 pm
WayneS (10/6/2010)
craig-404139 (10/6/2010)
Question: when a prefix changes, do the numbers restart at 1? Or do they continue?(I've got some real nifty, high-speed code (thanks to Jeff!) to quickly identify gaps - but it's designed to work on numbers. Will take some mods to work with the prefix, but should be simple.)
Hey Wayne, if you can modify to match the Pattern/NumPiece code above I'd be curious to see that result. The Pattern splitter is pretty zippy and will return a group/numeric resultset.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
October 6, 2010 at 6:54 pm
Craig Farrell (10/6/2010)
Edit: Sorry for the multi-edit. Something didn't play nicely with the code= statements when I tried to first post this.The Test Harness:
SELECT TOP 1000000
IDENTITY(INT, 1,1) AS N
INTOTally
FROM
Master.dbo.SysColumns sc1,
Master.dbo.syscolumns sc2
ALTER TABLE Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED(N) with FILLFACTOR = 100
GRANT SELECT, REFERENCES on TALLY TO PUBLIC
select max(n) from Tally
CREATE TABLE #TestHarness
(TextIDVARCHAR(200) NOT NULL)
--TRUNCATE TABLE #TestHarness
INSERT INTO #TestHarness
SELECT
REPLICATE( 'A', FLOOR( (N/10)%5) + 1) -- VAriable length starter
+ REPLICATE( '0', N%7+1 ) -- Variable # of 0's
+ CAST( N AS VARCHAR(15))
FROM
Tally
WHERE
N%5 >1
AND N < 500000
Craig Farrell - can you modify your TestHarness to meet this stipulation:
craig-404139 (10/6/2010)
One thing though, all identifiers with the same prefix will have the same total character count.
Craig Farrell (10/6/2010)
The PatternFinder simplified: (I get 2 second results from this on a desktop server)
select --top 200
DISTINCT
LEFT( TextID, MAX( N) ) AS Pattern
FROM
#TestHarness AS th,
Tally AS T
where
N < LEN(TextID) AND
isnumeric( SUBSTRING( TextID, N, 1)) = 0
GROUP BY
TextID
The fastest gap detection code that I've ever seen.... doesn't use a tally table. (Though it is an obvious thing to use!)
If you can make the TestHarness be the same for each prefix... I've got something good for you! 😉
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
Viewing 15 posts - 1 through 15 (of 91 total)
You must be logged in to reply to this topic. Login to reply