Finding Sequences (Need help ASAP)

Question

Finding Sequences (Need help ASAP)

craig-404139

Hall of Fame

Points: 3274
More actions
October 6, 2010 at 12:10 pm

#227599

Please skip to http://www.sqlservercentral.com/Forums/FindPost1000643.aspx for updated requirements.
I have a request from management and am at a complete roadblock. I am a DBA and this is much more developer/programmer logic than I am used to.
Scenario:
Using the below numbers I need to find each unique sequence sets. The data is a unique column in my actual db stored as a varchar (255) field. (Can't change that)
Below is some sample data.
create table gaps
(id varchar(50))
insert into gaps
SELECT 'PPCI000000001' UNION ALL
SELECT 'PPCI000000002' UNION ALL
SELECT 'PPCI000000003' UNION ALL
SELECT 'PPCI000000005' UNION ALL
SELECT 'PPCI000000010' UNION ALL
SELECT 'PPCI000000011' UNION ALL
SELECT 'PPCI000000012' UNION ALL
SELECT 'PPCI000000013' UNION ALL
SELECT 'PPCI000000014' UNION ALL
SELECT 'PPCI000000015' UNION ALL
SELECT 'PPCI000000016' UNION ALL
SELECT 'PPCI000000017' UNION ALL
SELECT 'PPCI000000100' UNION ALL
SELECT 'PPCI000000101' UNION ALL
SELECT 'PPCI000000102' UNION ALL
SELECT 'PPCI000000103' UNION ALL
SELECT 'PPCI000000104' UNION ALL
SELECT 'PPCI000000105' UNION ALL
SELECT 'PPCI000000106' UNION ALL
SELECT 'PPCI000000167' UNION ALL
SELECT 'PPCI000000197'
What I need to do is identify the sequences. For example an ideal output using the sample data would be
PPCI000000001 - PPCI000000003
PPCI000000005
PPCI000000010 - PPCI000000017
PPCI000000100 - PPCI000000106
PPCI000000167
PPCI000000197
*It would also be nice to know the 'missing' numbers in the sequence. for example an ideal output would be
PPCI000000004
PPCI000000006 - PPCI000000009
PPCI000000018 - PPCI000000099
PPCI000000107 - PPCI0000000166
PPCI000000168 - PPCI0000000196
Ideally I would be able to do this with a single query.
***In this example I only included text with a fixed length character prefix (PPCI) our actual data may have a prefix of 1,2,3,4,5, or 6 characters. The column is a unique field so there will not be any duplicates.
Any help is much appreciated, I dont even know where to start as this is a bit more programming logic than I am used too.

Viewing 15 posts - 1 through 15 (of 92 total)

You must be logged in to reply to this topic. Login to reply

craig-404139 Hall of Fame Points: 3274 More actions · Answer 1

So far this works using the substring statement. But has alot of hard coding. Although this solves my current problem, it doesn't help with other future needs.

select 'PPCI' + substring(l.id,5,9) as start

,'PPCI' +

(

select substring(min(a.id),5,9) as id

from gaps as a

left outer join gaps as b

on substring(a.id,5,9) = substring(b.id,5,9) - 1

where b.id is null

and substring(a.id,5,9) >= substring(l.id,5,9)

) as [end]

from gaps as l

left outer join gaps as r

on substring(r.id,5,9) = substring(l.id,5,9) - 1

where r.id is null;

But it has 1 big problem.

1. The text can be a variable amount from 1-6+ characters. I have it hardcoded for this example, but it needs to be dynamic.

Any way to make those sections dynamic?

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 2

Craig,

Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Johan Bijnens SSC Guru Points: 135662 More actions · Answer 3

There are multiple articles at SSC that may be of your interest:

- Generating Missing Dates and Numbers By Jacob Sebastian http://www.sqlservercentral.com/articles/Datetime+Manipulation/61822/

- The "Numbers" or "Tally" Table: What it is and how it replaces a loop. By Jeff Moden http://www.sqlservercentral.com/articles/T-SQL/62867/

Maybe a computed column may help out , if this action is more that a single shot solution.

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data and code to get the best help

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Who am I ? Sometimes this is me but most of the time this is me

craig-404139 Hall of Fame Points: 3274 More actions · Answer 4

Craig Farrell (10/6/2010)
Craig,
Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.

Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 5

craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
Craig,
Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.

260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂

One last confirmation before I build said harness: We're looking at records like:

ABC0001

ABCDEF00002

ABCD003

ABCDEF000006

AB7

...

ETC

Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?

Sorry, just trying to nail down the exact data situation.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

craig-404139 Hall of Fame Points: 3274 More actions · Answer 6

Craig Farrell (10/6/2010)
craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
Craig,
Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.
260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂
One last confirmation before I build said harness: We're looking at records like:
ABC0001
ABCDEF00002
ABCD003
ABCDEF000006
AB7
...
ETC
Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?
Sorry, just trying to nail down the exact data situation.

In theory yes. These are received legal documents and are bates stamped (identifier we are referring too) using a different prefix depending on the producing party. So your above example is quit possible. The total length of the numbers can/will vary as well. In this particular case they are all 13 total characters (4 prefix text + 9 numerals) and all have a PPCI prefix. But I am trying to write this script so I can run it against all cases. Some of which have a uniform prefix and uniform character count throughout, others which have a variable text prefix and variable numerical count. One thing though, all identifiers with the same prefix will have the same total character count.

Some tables may have 1 million+ records. I understand whatever solution we find will tax the system since it's checking each id, but it's a necessary process to identify ranges of produced documents for auditing purposes.

I appreciate your time with this. I feel I made some good progress with my above script, but am having trouble moving away from so much hard coding. FYI it's taking almost 30 minutes to run on my system.

andrewd.smith SSCrazy Eights Points: 8444 More actions · Answer 7

I've made some assumptions in my possible solution. If my assumptions are wrong, you may well be able to tweak the query to get it working for you.

I have assumed:

1) that the prefix never contains numeric characters [0-9].

2) that the numeric part of the id column always contains exactly 9 numeric characters [0-9] and can always be converted to an int.

3) that you want the results partitioned by the non-numeric prefix

If there is a clustered index on the id field this query should be quite performant.

;WITH cteParse(id, prefix, num, rn) AS (

SELECT

id,

SUBSTRING(id, 1, PATINDEX('%[0-9]%', id) - 1),

CAST(SUBSTRING(id, PATINDEX('%[0-9]%', id), 9) AS int),

ROW_NUMBER() OVER (ORDER BY id)

FROM gaps

)

SELECT

prefix + REPLACE(STR(MIN(num), 9), ' ', '0') AS [From],

CASE WHEN (COUNT(*) > 1) THEN

prefix + REPLACE(STR(MAX(num), 9), ' ', '0')

ELSE '' END AS [To]

FROM cteParse

GROUP BY prefix, num - rn

ORDER BY prefix, num - rn

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 8

Edit: Sorry for the multi-edit. Something didn't play nicely with the code= statements when I tried to first post this.

The Test Harness:

SELECT TOP 1000000

IDENTITY(INT, 1,1) AS N

INTOTally

FROM

Master.dbo.SysColumns sc1,

Master.dbo.syscolumns sc2

ALTER TABLE Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED(N) with FILLFACTOR = 100

GRANT SELECT, REFERENCES on TALLY TO PUBLIC

select max(n) from Tally

CREATE TABLE #TestHarness

(TextIDVARCHAR(200) NOT NULL)

--TRUNCATE TABLE #TestHarness

INSERT INTO #TestHarness

SELECT

REPLICATE( 'A', FLOOR( (N/10)%5) + 1) -- VAriable length starter

+ REPLICATE( '0', N%7+1 ) -- Variable # of 0's

+ CAST( N AS VARCHAR(15))

FROM

Tally

WHERE

N%5 >1

AND N < 500000

The PatternFinder simplified: (I get 2 second results from this on a desktop server)

select --top 200

DISTINCT

LEFT( TextID, MAX( N) ) AS Pattern

FROM

#TestHarness AS th,

Tally AS T

where

N < LEN(TextID) AND

isnumeric( SUBSTRING( TextID, N, 1)) = 0

GROUP BY

TextID

The patternfinder used to strip to Patterns and Numerics (8 seconds with full row return):

select

TextID,

LEFT( TextID, MAX( N) ) AS Pattern,

CAST( RIGHT( TextID, LEN( TextID) - MAX( N)) AS BIGINT) AS NumPiece

FROM

#TestHarness AS th,

Tally AS T

where

N < LEN(TextID) AND

isnumeric( SUBSTRING( TextID, N, 1)) = 0

GROUP BY

TextID

Now, back to finding the gapping, which I'm not sure of directly. Hopefully anyone else working on this will find this useful.

Of note: The patternfinder backs up into the pattern, so it locates the last letter, and won't break on numerics in the pattern. You've also now got an integer to work from. This happens very quickly.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

craig-404139 Hall of Fame Points: 3274 More actions · Answer 9

^^ Thanks. I'll start chewing through that and seeing if it fits. At first glance I have no idea what it's doing. Like I mentioned I've never had to deal with this detail of coding. Appreciate your time.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 10

craig-404139 (10/6/2010)
^^ Thanks. I'll start chewing through that and seeing if it fits. At first glance I have no idea what it's doing. Like I mentioned I've never had to deal with this detail of coding. Appreciate your time.

Heh, whoops, sorry, I'll explain a little more.

The test harness does nothing more then build a million row tally table and then create a bunch of records with variable length varchar patterns at the beginning and then variable #'s of 0s before a number.

The Patternfinder uses the same tally table to locate the largest (MAX(N)) non-numeric character position in the TextID, or your codes. This works because anything else after it is numeric.

So, now that we know the final position of the pattern, or varchar prefix, we take that off (Pattern column), and we convert the rest to numeric (NumPiece).

The thing that speeds this up is the tally table. It's an amazing Anti-RBAR component that sped up an old method I had. The Test Harness, btw, builds out a 300,000 row test case to work against.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 11

And for my last trick...

This ran in about 3:15 on my system

-- First, separate the numbers and their prefixes into usable data.

select

TextID,

LEFT( TextID, MAX( N) ) AS Pattern,

CAST( RIGHT( TextID, LEN( TextID) - MAX( N)) AS BIGINT) AS NumPiece

INTO

#FindGaps

FROM

#TestHarness AS th,

Tally AS T

where

N < LEN(TextID) AND

isnumeric( SUBSTRING( TextID, N, 1)) = 0

GROUP BY

TextID

-- Give us a fighting chance for optimization...

CREATE CLUSTERED INDEX idx_FindGap ON #FindGaps ( Pattern, NumPiece)

-- Gaptable will store the list of Gaps and their start/end #s

CREATE TABLE #GapTable

(PatternVARCHAR(100),

GapStartBIGINT,

GapEndBIGINT)

CREATE CLUSTERED INDEX idx_GapTable ON #GapTable ( Pattern, GapStart)

-- This is an optimization table to take over work from a subquery

CREATE TABLE #EndGap

(Pattern VARCHAR(100),

GapEnd BIGINT)

CREATE CLUSTERED INDEX idx_EndGap ON #EndGap ( Pattern, GapEnd)

-- Why the cursor? Because I couldn't figure out a way to go against each

-- pattern without making the triangle join even worse...

DECLARE @Pattern VARCHAR(100)

DECLARE PLoop CURSOR

FOR SELECT DISTINCT Pattern

FROM #FindGaps

OPEN PLoop

FETCH NEXT FROM PLoop INTO @Pattern

WHILE @@FETCH_STATUS = 0

BEGIN

-- Find the beginning of each gap. It needs an entry to exist before it

-- considers a gap to have started, so if the first record is 5, it doesn't

-- care about 1-4.

INSERT INTO #GapTable

(Pattern, GapStart)

SELECT

@Pattern,

N AS GapStartNum

FROM

Tally AS t

LEFT JOIN

#FindGaps AS fg

ONt.N = fg.NumPiece

AND Pattern = @Pattern

WHERE

fg.NumPiece IS NULL

AND EXISTS( SELECT 1 FROM #FindGaps WHERE Pattern = @Pattern and NumPiece = N-1)

ORDER BY

t.N

-- This could be used as a subquery, but I got better performance as a separate table.

-- It works like above, identifying the end of each gap.

INSERT INTO #EndGap

(Pattern, GapEnd)

SELECT

@Pattern AS Pattern,

N AS GapEnd

FROM

Tally AS t

LEFT JOIN

#FindGaps AS fg

ONt.N = fg.NumPiece

AND Pattern = @Pattern

WHERE

fg.NumPiece IS NULL

AND EXISTS( SELECT 1 FROM #FindGaps WHERE Pattern = @Pattern and NumPiece = N+1)

FETCH NEXT FROM PLoop INTO @Pattern

END

-- Cleanup

CLOSE PLoop

DEALLOCATE PLoop

-- Inbound triangle join, I know... I know... It's the time eater, too, just can't see a way out.

-- The logic here will find the nearest ending gap to the start point, by pattern/prefix.

UPDATE gt

SETGapEnd = drvEnds2.GapEnd

FROM

#GapTable AS gt

JOIN

(SELECT

gt.Pattern,

gt.GapStart,

MIN( eg.GapEnd) AS GapEnd

FROM

#EndGap AS eg

JOIN

#GapTable AS gt

ON eg.Pattern = gt.Pattern

WHERE

gt.GapStart < eg.GapEnd

GROUP BY

gt.Pattern,

gt.GapStart

) AS drvEnds2

ONgt.Pattern = drvEnds2.Pattern

AND gt.GapStart = drvEnds2.GapStart

SELECT * FROM #GapTable

DROP TABLE #FindGaps

DROP TABLE #GapTable

DROP TABLE #EndGap

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

WayneS SSC Guru Points: 95461 More actions · Answer 12

craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
craig-404139 (10/6/2010)
Craig Farrell (10/6/2010)
Craig,
Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.
Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.
260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂
One last confirmation before I build said harness: We're looking at records like:
ABC0001
ABCDEF00002
ABCD003
ABCDEF000006
AB7
...
ETC
Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?
Sorry, just trying to nail down the exact data situation.
In theory yes. These are received legal documents and are bates stamped (identifier we are referring too) using a different prefix depending on the producing party. So your above example is quit possible. The total length of the numbers can/will vary as well. In this particular case they are all 13 total characters (4 prefix text + 9 numerals) and all have a PPCI prefix. But I am trying to write this script so I can run it against all cases. Some of which have a uniform prefix and uniform character count throughout, others which have a variable text prefix and variable numerical count. One thing though, all identifiers with the same prefix will have the same total character count.
Some tables may have 1 million+ records. I understand whatever solution we find will tax the system since it's checking each id, but it's a necessary process to identify ranges of produced documents for auditing purposes.
I appreciate your time with this. I feel I made some good progress with my above script, but am having trouble moving away from so much hard coding. FYI it's taking almost 30 minutes to run on my system.

Question: when a prefix changes, do the numbers restart at 1? Or do they continue?

(I've got some real nifty, high-speed code (thanks to Jeff!) to quickly identify gaps - but it's designed to work on numbers. Will take some mods to work with the prefix, but should be simple.)

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 13

WayneS (10/6/2010)

craig-404139 (10/6/2010)
Question: when a prefix changes, do the numbers restart at 1? Or do they continue?
(I've got some real nifty, high-speed code (thanks to Jeff!) to quickly identify gaps - but it's designed to work on numbers. Will take some mods to work with the prefix, but should be simple.)

Hey Wayne, if you can modify to match the Pattern/NumPiece code above I'd be curious to see that result. The Pattern splitter is pretty zippy and will return a group/numeric resultset.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

WayneS SSC Guru Points: 95461 More actions · Answer 14

Craig Farrell (10/6/2010)
Edit: Sorry for the multi-edit. Something didn't play nicely with the code= statements when I tried to first post this.
The Test Harness:
SELECT TOP 1000000
IDENTITY(INT, 1,1) AS N
INTOTally
FROM
Master.dbo.SysColumns sc1,
Master.dbo.syscolumns sc2
ALTER TABLE Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED(N) with FILLFACTOR = 100
GRANT SELECT, REFERENCES on TALLY TO PUBLIC
select max(n) from Tally
CREATE TABLE #TestHarness
(TextIDVARCHAR(200) NOT NULL)
--TRUNCATE TABLE #TestHarness
INSERT INTO #TestHarness
SELECT
REPLICATE( 'A', FLOOR( (N/10)%5) + 1) -- VAriable length starter
+ REPLICATE( '0', N%7+1 ) -- Variable # of 0's
+ CAST( N AS VARCHAR(15))
FROM
Tally
WHERE
N%5 >1
AND N < 500000

Craig Farrell - can you modify your TestHarness to meet this stipulation:

craig-404139 (10/6/2010)
One thing though, all identifiers with the same prefix will have the same total character count.

Craig Farrell (10/6/2010)
The PatternFinder simplified: (I get 2 second results from this on a desktop server)
select --top 200
DISTINCT
LEFT( TextID, MAX( N) ) AS Pattern
FROM
#TestHarness AS th,
Tally AS T
where
N < LEN(TextID) AND
isnumeric( SUBSTRING( TextID, N, 1)) = 0
GROUP BY
TextID

The fastest gap detection code that I've ever seen.... doesn't use a tally table. (Though it is an obvious thing to use!)

If you can make the TestHarness be the same for each prefix... I've got something good for you! 😉

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2