Hi y'all
I have some 3rd party data with variable format as follows
Type-Country-Company-Code-Identifier
TT-CCC-AA-99-0000
I am not interested in the Type, Country and Company parts. Only the Code and Identifier is needed. However frequently the data is missing some of the data in this string,
so it could be any of the following
CCC-AA-99-0000
-CCC-AA-99-0000
TT-CCC-AA-0000
-CCC-AA-0000
The Identifier is always present and is numeric translatable to BIGINT
The Code is sometimes present, and is numeric translatable to SMALLINT (max 65535)
I want to extract the Code and Identifier from the string
I have been experimenting with STRING_SPLIT - but this does not guarantee the order of splits,
So I moved on to the Delimited8KSplit function (from this site - all credit to those who made this and shared it - thank-you)
Here is what I have so Far. It looks close
DROP TABLE IF EXISTS #TmpData;
CREATE TABLE #TmpData (
Id INT NOT NULL IDENTITY(1,1),
Name Varchar(20) NOT NULL,
IdentificationString VARCHAR(30) )
INSERT INTO #TmpData (Name, IdentificationString)
VALUES
('Product1', 'PP-UK-MS-123-123456'),
('Product2', '-UK-MS-345-234567'),
('Product3', 'UK-MS-678-345678'),
('Product4', 'PP-UK-MS-123459'),
('Product5', '-UK-MS-234569'),
('Product6', 'UK-MS-234560'),
('Product7', '');
SELECT T.Id, T.Name AS Product, Split.Item AS Identifer, MAX(Split.ItemNumber) AS MaxNum, Split.ItemNumber
FROM #TmpData AS T
CROSS APPLY dbo.DelimitedSplit8K(T.IdentificationString, '-') Split
WHERE ISNUMERIC(Split.Item) = 1
GROUP BY T.Id, T.Name, Split.Item, Split.ItemNumber
HAVING MAX(Split.ItemNumber) = Split.ItemNumber
ORDER BY T.Id
-- Desired Output
IdProduct Code Identifier
1Product1123123456
2Product2345234567
3Product3678345678
4Product4NULL123459
5Product5NULL234569
6Product6NULL234560
7Product7NULLNULL
A bit crude but essentially need to find a way to reverse the item number
DROP TABLE IF EXISTS #TmpData;
CREATE TABLE #TmpData (
Id INT NOT NULL IDENTITY(1,1),
Name Varchar(20) NOT NULL,
IdentificationString VARCHAR(30) )
INSERT INTO #TmpData (Name, IdentificationString)
VALUES
('Product1', 'PP-UK-MS-123-123456'),
('Product2', '-UK-MS-345-234567'),
('Product3', 'UK-MS-678-345678'),
('Product4', 'PP-UK-MS-123459'),
('Product5', '-UK-MS-234569'),
('Product6', 'UK-MS-234560'),
('Product7', '');
;with cte as
(SELECT T.Id, T.Name AS Product, Split.Item AS Identifer, MAX(Split.ItemNumber) AS MaxNum, Split.ItemNumber
FROM #TmpData AS T
CROSS APPLY dbo.DelimitedSplit8K(T.IdentificationString, '-') Split
WHERE ISNUMERIC(Split.Item) = 1
GROUP BY T.Id, T.Name, Split.Item, Split.ItemNumber
HAVING MAX(Split.ItemNumber) = Split.ItemNumber
), cte2 as
(
SELECT
*, ROW_NUMBER() over (partition by id order by maxnum desc) as rn --add a rownumber in the descending order so its identifier as 1, code as 2
from cte
)
select tmp.id, tmp.Name as Product, max(case when rn = 2 then Identifer end) AS Code, max(case when rn = 1 then Identifer end) as Identifier
from #TmpData tmp
left join cte2 c
on tmp.Id = c.id
group by tmp.id, tmp.name
/*
-- Desired Output
IdProduct Code Identifier
1Product1123123456
2Product2345234567
3Product3678345678
4Product4NULL123459
5Product5NULL234569
6Product6NULL234560
7Product7NULLNULL
*/
December 14, 2020 at 2:03 pm
This may work. Didn't really test it thoroughly.
SELECT
T.Id,
T.Name AS Product,
REVERSE(Split.ItemValue) AS Identifer,
Split.ItemNumber AS MaxNum,
Split.ItemNumber
FROM #TmpData AS T
CROSS APPLY dbo.DelimitedSplit8K(REVERSE(T.IdentificationString), '-') Split
WHERE Split.ItemNumber IN (1,2)
AND Split.ItemValue LIKE '%[0-9]%'
ORDER BY T.Id
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
December 14, 2020 at 3:57 pm
This has less overhead, which may or not be a concern to you for this task, but I thought I'd post it just in case.
;WITH cte_find_last_2_dashes AS (
SELECT *, CHARINDEX('-', REVERSE(IdentificationString)) AS last_dash,
CHARINDEX('-', REVERSE(IdentificationString), CHARINDEX('-', REVERSE(IdentificationString)) + 1) AS next_to_last_dash
FROM #TmpData
),
cte_get_last_2_strings AS (
SELECT *, CASE WHEN next_to_last_dash = 0 THEN NULL ELSE SUBSTRING(IdentificationString, LEN(IdentificationString) -
next_to_last_dash + 2, next_to_last_dash - last_dash - 1) END AS next_to_last_string,
CASE WHEN last_dash = 0 THEN NULL ELSE SUBSTRING(IdentificationString, LEN(IdentificationString) -
last_dash + 2, 10) END AS last_string
FROM cte_find_last_2_dashes
)
SELECT
Id, Name AS Product,
CASE WHEN next_to_last_string LIKE '%[^0-9]%' THEN NULL ELSE next_to_last_string END AS Code,
CASE WHEN last_string LIKE '%[^0-9]%' THEN NULL ELSE last_string END AS Identifier
FROM cte_get_last_2_strings
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 14, 2020 at 4:45 pm
This has less overhead...
Just curious, Scott... how are you measuring "overhead" for this?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 14, 2020 at 4:59 pm
ScottPletcher wrote:This has less overhead...
Just curious, Scott... how are you measuring "overhead" for this?
Actually, for this one, I just looked at the query plans. The other code is doing a lot of operations, including at least one sort. They query plan for my code shows a single table scan with some computations.
Edit: btw, I was comparing to the post marked as "Answer", not the subsequent post. Although I think any function invocation would be more overhead than just some simple calcs.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
December 14, 2020 at 10:54 pm
Jeff Moden wrote:ScottPletcher wrote:This has less overhead...
Just curious, Scott... how are you measuring "overhead" for this?
Actually, for this one, I just looked at the query plans. The other code is doing a lot of operations, including at least one sort. They query plan for my code shows a single table scan with some computations.
Edit: btw, I was comparing to the post marked as "Answer", not the subsequent post. Although I think any function invocation would be more overhead than just some simple calcs.
On this type of thing, I totally agree. I'm playing with some things (including your code) on a typical "Million Row Table" I built to test performance. My personal opinion is that the code you wrote is going to be very difficult to beat for performance and can probably only be tied at best. Things like DelimitedSplit8k don't stand a chance.
I'm also in the process of finding out just how expensive CROSS APPLY and ISNULL can be. I knew there was a little cost but had no idea that (for example) ISNULL was so bloody expensive (just adding two doubled the CPU usage).
--Jeff Moden
Change is inevitable... Change for the better is not.
December 15, 2020 at 1:00 pm
Thanks everyone.
As I'm developing under a deadline, I went with the first answer. And I'm developing with limited data, and we are expecting no more than a few thousand at a time, and only a few times per week, so there is no performance issue - yet. But I've put a link to Scott's answer as a comment in the code, so if it becomes a problem, a faster alternative is there.
December 16, 2020 at 5:33 am
Thanks everyone.
As I'm developing under a deadline, I went with the first answer. And I'm developing with limited data, and we are expecting no more than a few thousand at a time, and only a few times per week, so there is no performance issue - yet. But I've put a link to Scott's answer as a comment in the code, so if it becomes a problem, a faster alternative is there.
I hope that the performance issues that you push off because of "deadlines" don't actually cause you to have more deadlines. It's a common issue known as "Technical Debt". It's a viscous circle that can only be broken one way... do it right the first time.
Didn't mean for that to be a lecture and certainly not a slam, especially since you probably weren't the one that came up with the "deadline". Just wanted you to know. In one shop I worked in, everything was a struggle. Once we adopted the philosophy that if anything could go wrong, it needed to be fixed even before it went to QA, rework plummeted to nearly zero and, oddly enough, taking the little bit of time to "do it right the first time" caused us to be a whole lot more productive because, it turned out, that rework took 8 times longer than doing it right and about a hundred times longer than the little bit of extra time we spent to do that.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2020 at 7:19 am
WITH cteParse(Id, Name, IdentificationString)
AS (
SELECT Id,
Name,
REPLACE(RIGHT(IdentificationString, CHARINDEX('-', REVERSE(IdentificationString), CHARINDEX('-', REVERSE(IdentificationString)) + 1)), '-', '.') AS IdentificationString
FROM #TmpData
)
SELECT Id,
Name,
TRY_PARSE(PARSENAME(IdentificationString, 2) AS SMALLINT) AS Code,
TRY_PARSE(PARSENAME(IdentificationString, 1) AS BIGINT) AS Identifier
FROM cteParse;
N 56°04'39.16"
E 12°55'05.25"
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply