Deduplicating rows by choosing the row with the shortest string

Question

Deduplicating rows by choosing the row with the shortest string

Mark Dalley

SSCrazy

Points: 2981
More actions
November 7, 2024 at 6:06 pm

#4478567

Hello T-SQL experts
I have a table containing team codes and descriptions. Unfortunately, many of the team codes are duplicated, often with different descriptions.
I want to deduplicate the table (in T-SQL) by selecting the shortest description (in number of characters) for each team code. How do I do this in a non-manual fashion?
I could obviously group in the team code, but there is no suitable aggregate function that I can think of for pulling out the shortest description in each group. MIN() won’t work as it sorts (e.g.) ‘aa’ before ‘b’. Also, T-SQL has no FIRST() or LAST() aggregate function.
Any ideas anyone?
Yours hopefully
Mark Dalley

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 101123 More actions · Answer 1

If your version of SQL has use of FIRST_VALUE, you could try this:

SELECT DISTINCT team_code, /* , ...*/
    FIRST_VALUE(description) OVER(PARTITION BY team_code ORDER BY LEN(description)) AS description
FROM dbo.table_name

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

kaj SSCommitted Points: 1607 More actions · Answer 2

Maybe something like this?

declare @Teams table (
    TeamCode varchar(10),
    Description varchar(30)
    )

INSERT INTO @Teams (TeamCode, Description)
    VALUES
        ('AC', 'Action Committee'),
        ('AC', 'Active Cowards'),
        ('AC', 'Attentive Collegeboys'),
        ('AC', 'Act. Com.'),
        ('WC', 'World-class Citizens'),
        ('WC', 'World Citizens'),
        ('WC', 'Wretched Cowards'),
        ('WC', 'Wicked Consultants')
        ;

WITH TeamData AS (
SELECT
    TeamCode,
    Description,
    MIN(RIGHT(SPACE(10)+CAST(LEN(Description) AS VARCHAR(10)),10)+Description) OVER(PARTITION BY TeamCode) AS NewDescription
FROM @Teams)
select
    TeamCode,
    Description,
    substring(NewDescription,11,255) AS NewDescription
from TeamData

TeamCode   Description                    NewDescription
---------- ------------------------------ ----------------------------------------
AC         Action Committee               Act. Com.
AC         Active Cowards                 Act. Com.
AC         Attentive Collegeboys          Act. Com.
AC         Act. Com.                      Act. Com.
WC         World-class Citizens           World Citizens
WC         World Citizens                 World Citizens
WC         Wretched Cowards               World Citizens
WC         Wicked Consultants             World Citizens

(8 rows affected)

SoCal_DBD Hall of Fame Points: 3383 More actions · Answer 3

I love a good reason to use any of the widow functions. Thank you to Kaj for some sample data. I can't tell you how many times I've had to dedupe and have leveraged this.

DECLARE @Teams TABLE (
    TeamCode VARCHAR(10),
    Description VARCHAR(30)
    )

INSERT INTO @Teams (TeamCode, Description)
    VALUES
        ('AC', 'Action Committee'),
        ('AC', 'Active Cowards'),
        ('AC', 'Attentive Collegeboys'),
        ('AC', 'Act. Com.'),
        ('WC', 'World-class Citizens'),
        ('WC', 'World Citizens'),
        ('WC', 'Wretched Cowards'),
        ('WC', 'Wicked Consultants')
        ;

-- Show how the data is grouped and numbered
SELECT *,
       RowNum = ROW_NUMBER() OVER (PARTITION BY TeamCode ORDER BY LEN([Description]))
FROM @Teams

-- Remove anything with RowNum > 1
; WITH Dupes AS
(SELECT *,
        RowNum = ROW_NUMBER() OVER (PARTITION BY TeamCode ORDER BY LEN([Description]))
 FROM @Teams
)

DELETE FROM Dupes
WHERE RowNum > 1

-- Final result
SELECT *
FROM @Teams

Ed B Hall of Fame Points: 3669 More actions · Answer 4

If you have any empty strings you need to ensure they are not chosen as the shortest description.

SELECT  TeamCode, Description,
        RowNum = ROW_NUMBER() OVER (PARTITION BY TeamCode 
                 ORDER BY IIF(TRIM(ISNULL(TeamCode,'')) = '', 1000,LEN(TeamCode)))
FROM    @Teams

Mark Dalley SSCrazy Points: 2981 More actions · Answer 5

Ha, I thought that the answer might involve window functions. I have seen them used but hardly used them myself.

After posting this question I came across Kathi Kellenberger's article here. Following her lead I arrived at a bit of code exactly analogous to the approach used by SoCal_DBD of deleting duplicates via a CTE. (Though I think you meant to call them window functions).

Delightful bit of test data there, kaj! And thanks Scott for flagging up FIRST_VALUE. I have a recent SQL Server version so it works fine. Also good shout Ed for your comment about empty strings. Fortunately I didn't have any, but next time, who knows.

Here's to the great community on SSC -- cheers!

MarkD

Eirikur Eiriksson SSC Guru Points: 182916 More actions · Answer 6

Hi @mark-3,

my concern here is that this exercise will render the data nonsensical and ambiguous.

😎

This free text approach is YASA, yet another SQL antipattern, a better approach is to introduce a lookup table to prevent the problem.

Mark Dalley SSCrazy Points: 2981 More actions · Answer 7

As it happens, a lookup table is precisely what I am creating.

The descriptions are supposed to be unique. If the originating organisation puts in duplicates because they cannot decide on a single description for their team, and I choose the wrong one, that is on them.

Fortunately I am in touch with said organisation, albeit indirectly. I just needed an interim fix for what I hope is an interim issue.

I have a vision of perfect data sometime in the indefinite future.

MarkD

Eirikur Eiriksson SSC Guru Points: 182916 More actions · Answer 8

Mark Dalley wrote:

As it happens, a lookup table is precisely what I am creating.
The descriptions are supposed to be unique. If the originating organisation puts in duplicates because they cannot decide on a single description for their team, and I choose the wrong one, that is on them.
Fortunately I am in touch with said organisation, albeit indirectly. I just needed an interim fix for what I hope is an interim issue.
I have a vision of perfect data sometime in the indefinite future.
MarkD

Good luck mate, still think that there will be some manual input required.

😎