Counting occurances of strings in table column

Question

Post reply

Counting occurances of strings in table column

dwill

SSC Eights!

Points: 823
More actions
May 10, 2011 at 10:24 am

#147947

Hi,
I have a what I thought was a simple problem.. now I am not so sure, though perhaps my brain is on vacation.
I have a large-ish table (circa 1.5 million records) with a varchar(max) field containing text for each record.
I have another 'look up' table that contains strings of product identifications - not that many, a few hundred.
The task is to produce an SQL report totalling the number of times each product occurs in the text field of the main table. If the code appears multiple times in the text field then each occurance is to be added.
The report will simply be:
Product A - 100
Product B - 90
Product C - 88
Product D - 123
etc
My brain is failing to cope with how do this with a join onto the look up table. I have a function that may do the counting bit (from here: http://www.naterice.com/articles/52) - but I am not sure I really need that.
No cursors please!... unless that is the last resort.
Cheers

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 1

Here's how you get the count.

(DATALENGTH(REPLACE(text, product, '')) - DATALENGTH(text) ) / DATALENGTH(product) AS Occurences

Then SELECT Whatever, dtCNt.* FROM base OUTER APPLY (SELECT Product, count formula FROM Lookup) dtCNt

dwill SSC Eights! Points: 823 More actions · Answer 2

That could well be a work of art - many thanks! Truly marvellous for one looking so young.

It certainly does the job on my test data - i hope it scales OK to my real table.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

dwill (5/10/2011)
That could well be a work of art - many thanks! Truly marvellous for one looking so young.
It certainly does the job on my test data - i hope it scales OK to my real table.

That's never going to be really fast. You're actually going to process 300 - 500 million columns, and then you have to do some fairly long string manipulation to figure out the total.

I suggest you run that query dring OFF hours and then save the results... cause you might not want to run that too often!

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 4

Psst, don't tell anyone but that,s not my real picture :w00t:.

I'm almost on the 3rd of a century mark (only 1000 days off)! :w00t:

Ninja/Anakin

dwill SSC Eights! Points: 823 More actions · Answer 5

Hmmm, well the out of hours bit is OK as I am actually running it as a part of a daily load - so I am collating daily totals and saving them in an MI table. I'd already figured that doing this on the fly was not really going to work.

For my test I simply used your solution as a derived table and then grouped on the Product to get SUMs for the totals. Woked well on a small sub-set - still waiting to see if it is practical on the real thing.

dwill SSC Eights! Points: 823 More actions · Answer 6

PS - ha, I don't even remeber the third of a century mark!

GSquared SSC Guru Points: 260824 More actions · Answer 7

Keep in mind that this method will give false results if product codes can contain other product codes as a substring.

If, for example, you have a product code "A100" and another "A1000", then when it goes to count all the "A100" occurances, it will count "A1000" as well. If you can be sure there are no partial overlaps like that, it should work. If you do have overlaps, it will give incorrect results.

Also, the math there looks like it will give negative numbers. Easy enough to correct for, but also easy enough to put the original length first and the replaced length second.

And be sure you do all the math on the same datatype. If you replace ASCII strings inside Unicode strings, the final piece (dividing by the length of the ASCII string) could give wrong results. I haven't tested that, but it's worth checking.

The big one to be absolutely certain of is the substring thing.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

dwill SSC Eights! Points: 823 More actions · Answer 8

Good point on the strings in strings... that may bring me some more unhappiness 🙁

I did spot the negatives - but thought i'd let that one go 🙂

I'm girding my loins to do some tests with a cursor... not a happy place.

GSquared SSC Guru Points: 260824 More actions · Answer 9

You probably don't need to use a cursor.

Can you provide sample data? If so, optimized code without overlap errors, et al, is possible.

(Don't get me wrong, Ninja's code is great for counting occurances of a string within another string, but it assumes lack of substring-matches.)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

dwill SSC Eights! Points: 823 More actions · Answer 10

Well the initial test took half an hour... probably quite acceptable offline as it will be run, but there are overlaps in the 'look up types' - so that is an issue.

I can't supply data as the only meaningful data is live data and not for the public domain. The tables are very simple though - the main, 1 million plus rows are in a staging table very similar to :

CREATE TABLE [dbo].[tbl_audit_trail_import_staging](

int_device_id INT NULL,

dte_start_timestamp DATETIME NULL,

dte_stop_timestamp DATETIME NULL,

vchr_customer_id VARCHAR(20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

vchr_audit_trail VARCHAR(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS NULL

) ON [PRIMARY]

The look up table with actual 1300 entries is:

CREATE TABLE [dbo].[tbl_screens](

[vchr_screen_id] [varchar](20) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,

[vchr_description] [varchar](200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

CONSTRAINT [PK_tbl_screens] PRIMARY KEY CLUSTERED

(

[vchr_screen_id] ASC

)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]

) ON [PRIMARY]

The vchr_screen_id is the look up ID and it occurs in the vchr_audit_trail column

The "working" code adapted from Mr Ninja is:

SELECT

a.vchr_screen_id,

sum(a.Occurences)

FROM

(

SELECT x.* FROM dbo.tbl_audit_trail_import_staging OUTER APPLY

(

SELECT

vchr_screen_id,

(DATALENGTH(REPLACE(vchr_audit_trail, vchr_screen_id, '')) - DATALENGTH(vchr_audit_trail) ) / DATALENGTH( vchr_screen_id) AS Occurences

FROM

dbo.tbl_screens

) x

) a

GROUP BY

a.vchr_screen_id

Not sure if any of that helps - but any pointers to solve the overlaps and/or make the query more efficient (without a cursor!) much appreciated!

GSquared SSC Guru Points: 260824 More actions · Answer 11

Part of why I was asking about sample data is I'm hoping there's some format to it, like the product codes are separated by spaces or by some other character or ASCII code, or something like that. If that's the case, then a string parser could be applied to break the data up an normalize it into rows, which would then be easy to query. A runtime parser wouldn't require altering any actual data, just querying it so it works as if it were normalized. (Jeff Moden has a recent article on this site on the subject of parsing string data into rows, which would be applicable in this case.)

Or, if the data has some other formatting, such as "fixed width" or even XML strings, it will be easy enough to do the same thing.

But I really can't get more specific without sample data.

Do those generalities help?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 12

Here's a sample of how a string parser might look:

IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

DROP TABLE #T ;

CREATE TABLE #T

(ID INT IDENTITY

PRIMARY KEY,

Col1 VARCHAR(100)) ;

INSERT INTO #T

(Col1)

VALUES ('A1,A10,B1,B12,A1') ;

INSERT INTO #T

(Col1)

VALUES ('A2,A1,B23,A10') ;

SELECT *

FROM #T ;

;

WITH SmallNumbers(Number)

AS (SELECT ROW_NUMBER() OVER (ORDER BY OBJECT_ID) - 1

FROM sys.columns)

SELECT Sub,

COUNT(*)

FROM #T

CROSS APPLY (SELECT SUBSTRING(Col1, Number,

ISNULL(NULLIF(CHARINDEX(',', Col1,

Number), 0)

- Number, 10)) AS Sub

FROM SmallNumbers

WHERE Number = 0

OR SUBSTRING(Col1, Number - 1, 1) = ',') AS S1

GROUP BY Sub

ORDER BY Sub ;

You'll note that, despite "A1" being a substring of "A10", it returns the correct values for both counts. And so on for the other substrings.

On a large table, such as you have, it won't be as fast as Ninja's method, but it does correctly handle substrings.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

dwill SSC Eights! Points: 823 More actions · Answer 13

Thanks again for your replies.

I see what you mean about the data and it's format. Sorry about not providing specifics. The varchar(max) field that I am searching through is the result of another applications output (over which i have no control) and is essentially a load of audit data from various data entry screens all dumped into one field. The field contents will vary considerably from NULL to many hundreds of characters depending on which of the very many sequence of events and screens the data has come from.

To be honest I hadn't considered looking for any specific delimeter around my search text - but now I have looked a little more at the raw data the codes I need to locate seem to be prefixed with "Screen ". So it will appear as "Screen XXXXXX" - always a discrete piece of text with a line break following. (No doubt there will be exceptions to this - as ever - but it's a promising start!)

The "normalising" idea sounds interesting - though performance may be an issue from what you say. I will have a closer look tomorrow and see what I can muster!

Cheers!

TomThomson SSC Guru Points: 104773 More actions · Answer 14

With that you don't need to split into separate records (unless any of the product strings contains "Screen " followed by another product string, for example a product "1000" and also a product "99 Screen 1000123", which seems unlikely) you can directly use the separator "Screen " as part of your match string, replace Ninja's extract count from text bit of code by something like

...

( DATALENGTH(text) - DATALENGTH(REPLACE(text, 'Screen '+product, '')) ) / DATALENGTH('Screen '+product) AS Occurences

...

[/code]

(if your strings are nvarchar rather than varchar, use N'Screen ' instead of 'Screen ')

and use Outer Apply as before.

Tom