select just one url from each domain in url_list

  • Hi

    i have a database with the following format:

    Table: URL_LIST

    Format: Id,url

    content:

    1 | http://www.google.com/nice_page

    2 | http://www.google.com/good/

    3 | http://www.google.com/bad/

    4 | http://www.google.com/well/

    5 | http://www.yahoo.com/test/

    6 | http://www.yahoo.com/more/test/

    7 | http://www.yahoo.com/example/

    8 | http://www.bing.com/example.html

    9 | http://www.bing.com/example.php?variable=34234&hit="no"

    my first question is how can i get just one url of each domain?

    second: how can i get a random url out of each domain?

    so my result set should looks like:

    2 | http://www.google.com/good/

    5 | http://www.yahoo.com/test/

    9 | http://www.bing.com/example.php?variable=34234&hit="no"

    Thank You

  • ORDER BY newid() is the way to get randomized data.

    you've got to do a few things: first, extract the domain from the urls...in my example, as long is everything is .com, mine wors...add .net/.org/edu, and you've got more work to do.

    to get one form each, you need to use row_number and partition by.

    ;With URL_LIST(Id,url)

    AS

    (

    SELECT '1',' http://www.google.com/nice_page' UNION ALL

    SELECT '2',' http://www.google.com/good/' UNION ALL

    SELECT '3',' http://www.google.com/bad/' UNION ALL

    SELECT '4',' http://www.google.com/well/' UNION ALL

    SELECT '5',' http://www.yahoo.com/test/' UNION ALL

    SELECT '6',' http://www.yahoo.com/more/test/' UNION ALL

    SELECT '7',' http://www.yahoo.com/example/' UNION ALL

    SELECT '8',' http://www.bing.com/example.html' UNION ALL

    SELECT '9',' http://www.bing.com/example.php?variable=34234&hit="no"'

    )

    , FormattedDomain as

    (

    SELECT SUBSTRING(url,1,charindex('.com',url)+4) as thedomain,*

    from URL_LIST)

    SELECT * FROM (

    SELECT ROW_NUMBER() OVER(PARTITION BY thedomain order by newid()) AS RW,*

    FROM FormattedDomain)MyAlias

    WHERE MyAlias.RW = 1

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • If the list of top-level domains in your table is small, you can do something like this easily:

    select top 1 ID, URL from URL_LIST where URL like '%google.com%' order by NEWID()

    select top 1 ID, URL from URL_LIST where URL like '%yahoo.com%' order by NEWID()

    select top 1 ID, URL from URL_LIST where URL like '%bing.com%' order by NEWID()

    _________________________________
    seth delconte
    http://sqlkeys.com

  • Too late now but I was going with Jeff's splitter.

    Now that I think of it just using left (patindex) makes more sense to find the end character.

    You can't use .com because there are way more extensions than that!

    BEGIN TRAN

    DECLARE @Table TABLE (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, URL VARCHAR(500) NOT NULL)

    INSERT INTO @Table (URL)

    SELECT 'http://www.google.com/nice_page'

    UNION ALL

    SELECT 'http://www.google.com/good/'

    UNION ALL

    SELECT 'http://www.google.com/bad/'

    UNION ALL

    SELECT 'http://www.google.com/well/'

    UNION ALL

    SELECT 'http://www.yahoo.com/test/'

    UNION ALL

    SELECT 'http://www.yahoo.com/more/test/'

    UNION ALL

    SELECT 'http://www.yahoo.com/example/'

    UNION ALL

    SELECT 'http://www.bing.com/example.html'

    UNION ALL

    SELECT 'http://www.bing.com/example.php?variable=34234&hit="no"'

    SELECT dtRandom.ID, dtRandom.Domain, dtRandom.URL FROM (

    SELECT

    t.ID

    , t.URL

    , dtSplit.ItemNumber

    , dtSplit.Item AS Domain

    , ROW_NUMBER() OVER(PARTITION BY dtSplit.Item ORDER BY NEWID()) AS TieBreaker

    FROM@Table t

    OUTER APPLY (SELECT ItemNumber, Item FROM dbo.DelimitedSplit8K(REPLACE(t.URL, 'http://', ''), '/')) dtSplit

    WHEREdtSplit.ItemNumber = 1

    ) dtRandom

    WHERE dtRandom.TieBreaker = 1

    ORDER BY dtRandom.Domain

    ROLLBACK

    Splitter function http://www.sqlservercentral.com/articles/Tally+Table/72993/

  • Lowell (11/11/2011)


    ORDER BY newid() is the way to get randomized data.

    you've got to do a few things: first, extract the domain from the urls...in my example, as long is everything is .com, mine wors...add .net/.org/edu, and you've got more work to do.

    to get one form each, you need to use row_number and partition by.

    ;With URL_LIST(Id,url)

    AS

    (

    SELECT '1',' http://www.google.com/nice_page' UNION ALL

    SELECT '2',' http://www.google.com/good/' UNION ALL

    SELECT '3',' http://www.google.com/bad/' UNION ALL

    SELECT '4',' http://www.google.com/well/' UNION ALL

    SELECT '5',' http://www.yahoo.com/test/' UNION ALL

    SELECT '6',' http://www.yahoo.com/more/test/' UNION ALL

    SELECT '7',' http://www.yahoo.com/example/' UNION ALL

    SELECT '8',' http://www.bing.com/example.html' UNION ALL

    SELECT '9',' http://www.bing.com/example.php?variable=34234&hit="no"'

    )

    , FormattedDomain as

    (

    SELECT SUBSTRING(url,1,charindex('.com',url)+4) as thedomain,*

    from URL_LIST)

    SELECT * FROM (

    SELECT ROW_NUMBER() OVER(PARTITION BY thedomain order by newid()) AS RW,*

    FROM FormattedDomain)MyAlias

    WHERE MyAlias.RW = 1

    Nice!

    _________________________________
    seth delconte
    http://sqlkeys.com

  • thank's Ninja; your example made me realize the third slash ends the domain....

    here's my example again tweaked with that realization: this will work for any domain extention now

    ;With URL_LIST(Id,url)

    AS

    (

    SELECT '1',' http://www.google.com/nice_page' UNION ALL

    SELECT '2',' http://www.google.com/good/' UNION ALL

    SELECT '3',' http://www.google.com/bad/' UNION ALL

    SELECT '4',' http://www.google.com/well/' UNION ALL

    SELECT '5',' http://www.yahoo.com/test/' UNION ALL

    SELECT '6',' http://www.yahoo.com/more/test/' UNION ALL

    SELECT '7',' http://www.yahoo.com/example/' UNION ALL

    SELECT '8',' http://www.bing.com/example.html' UNION ALL

    SELECT '9',' http://www.bing.com/example.php?variable=34234&hit="no"'

    )

    , FormattedDomain as

    (

    SELECT SUBSTRING(REPLACE(url,'http://',''),1,charindex('/',REPLACE(url,'http://',''))) as thedomain,*

    from URL_LIST)

    SELECT * FROM (

    SELECT ROW_NUMBER() OVER(PARTITION BY thedomain order by newid()) AS RW,*

    FROM FormattedDomain)MyAlias

    WHERE MyAlias.RW = 1

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Yes but now we have to make it work when there's no slash!!!

    As always... string manipulation gets complicated real fast!

    BEGIN TRAN

    DECLARE @Table TABLE (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, URL VARCHAR(500) NOT NULL)

    INSERT INTO @Table (URL)

    SELECT 'http://www.google.com/nice_page' UNION ALL

    SELECT 'http://www.google.com/good/' UNION ALL

    SELECT 'http://www.google.com/bad/' UNION ALL

    SELECT 'http://www.google.com/well/' UNION ALL

    SELECT 'http://www.yahoo.com/test/' UNION ALL

    SELECT 'http://www.yahoo.com/more/test/' UNION ALL

    SELECT 'http://www.yahoo.com/example/' UNION ALL

    SELECT 'http://www.bing.com/example.html' UNION ALL

    SELECT 'http://www.bing.com/example.php?variable=34234&hit="no"' UNION ALL

    SELECT 'http://www.failThis.com'

    ;WITH CTE (ID, URL, shortURL)

    AS

    (

    SELECT

    t.ID

    , t.URL

    , REPLACE(t.URL, 'http://', '') AS shortURL

    FROM@Table t

    )

    , DOMAIN (ID, URL, Domain)

    AS

    (

    SELECT

    t.ID

    , t.URL

    , LEFT(t.shortURL, ISNULL(NULLIF(PATINDEX('%/%', t.shortURL), 0), LEN(t.shortURL))) AS Domain

    FROMCTE t

    )

    SELECT * FROM

    (

    SELECT

    ID

    , URL

    , Domain

    , ROW_NUMBER() OVER(PARTITION BY Domain ORDER BY NEWID()) AS TieBreaker

    FROMDOMAIN

    ) dtRandom

    WHERE dtRandom.TieBreaker = 1

    ORDER BY dtRandom.Domain

    ROLLBACK

  • the table contains more than 100.000 urls....with every ending that is possible...

    maybe its better to make a new column with the domains ?

    like

    id,url,domain

    so would that be easier to do a easy select=?

  • schumichen (11/11/2011)


    the table contains more than 100.000 urls....with every ending that is possible...

    maybe its better to make a new column with the domains ?

    like

    id,url,domain

    so would that be easier to do a easy select=?

    Definitely. You can pull DISTINCT domains in first part & then join with table back to get TOP 1 url for each domain.

  • Dev (11/11/2011)


    schumichen (11/11/2011)


    the table contains more than 100.000 urls....with every ending that is possible...

    maybe its better to make a new column with the domains ?

    like

    id,url,domain

    so would that be easier to do a easy select=?

    Definitely. You can pull DISTINCT domains in first part & then join with table back to get TOP 1 url for each domain.

    Can't, he needs random URL.

  • schumichen (11/11/2011)


    the table contains more than 100.000 urls....with every ending that is possible...

    maybe its better to make a new column with the domains ?

    like

    id,url,domain

    so would that be easier to do a easy select=?

    What's wrong with the code I posted? Just replace the demo table with your base table and give it a go. It shouldn't be too long to run.

    If you want us to do perf tuning as well we need to know how many top level domains you have so we can build a sensible test dataset.

  • If you don't specify ORDER BY with TOP , the resultset order would be uncertain (random) ONLY.

  • Dev (11/11/2011)


    If you don't specify ORDER BY with TOP , the resultset order would be uncertain (random) ONLY.

    They won't be ramdom either.

    You want RAMDOM order by you need to ORDER BY RAMDOM.

  • Ninja's_RGR'us (11/11/2011)


    schumichen (11/11/2011)


    the table contains more than 100.000 urls....with every ending that is possible...

    maybe its better to make a new column with the domains ?

    like

    id,url,domain

    so would that be easier to do a easy select=?

    What's wrong with the code I posted? Just replace the demo table with your base table and give it a go. It shouldn't be too long to run.

    If you want us to do perf tuning as well we need to know how many top level domains you have so we can build a sensible test dataset.

    100,000 urls & use of string functions. It would be slow (as compared to new column approach).

  • Dev (11/11/2011)


    Ninja's_RGR'us (11/11/2011)


    schumichen (11/11/2011)


    the table contains more than 100.000 urls....with every ending that is possible...

    maybe its better to make a new column with the domains ?

    like

    id,url,domain

    so would that be easier to do a easy select=?

    What's wrong with the code I posted? Just replace the demo table with your base table and give it a go. It shouldn't be too long to run.

    If you want us to do perf tuning as well we need to know how many top level domains you have so we can build a sensible test dataset.

    100,000 urls & use of string functions. It would be slow (as compared to new column approach).

    Based on your extensive testing of this case?

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply