November 11, 2011 at 5:49 am
Hi
i have a database with the following format:
Table: URL_LIST
Format: Id,url
content:
1 | http://www.google.com/nice_page
2 | http://www.google.com/good/
3 | http://www.google.com/bad/
4 | http://www.google.com/well/
5 | http://www.yahoo.com/test/
6 | http://www.yahoo.com/more/test/
7 | http://www.yahoo.com/example/
8 | http://www.bing.com/example.html
9 | http://www.bing.com/example.php?variable=34234&hit="no"
my first question is how can i get just one url of each domain?
second: how can i get a random url out of each domain?
so my result set should looks like:
2 | http://www.google.com/good/
5 | http://www.yahoo.com/test/
9 | http://www.bing.com/example.php?variable=34234&hit="no"
Thank You
November 11, 2011 at 6:35 am
ORDER BY newid() is the way to get randomized data.
you've got to do a few things: first, extract the domain from the urls...in my example, as long is everything is .com, mine wors...add .net/.org/edu, and you've got more work to do.
to get one form each, you need to use row_number and partition by.
;With URL_LIST(Id,url)
AS
(
SELECT '1',' http://www.google.com/nice_page' UNION ALL
SELECT '2',' http://www.google.com/good/' UNION ALL
SELECT '3',' http://www.google.com/bad/' UNION ALL
SELECT '4',' http://www.google.com/well/' UNION ALL
SELECT '5',' http://www.yahoo.com/test/' UNION ALL
SELECT '6',' http://www.yahoo.com/more/test/' UNION ALL
SELECT '7',' http://www.yahoo.com/example/' UNION ALL
SELECT '8',' http://www.bing.com/example.html' UNION ALL
SELECT '9',' http://www.bing.com/example.php?variable=34234&hit="no"'
)
, FormattedDomain as
(
SELECT SUBSTRING(url,1,charindex('.com',url)+4) as thedomain,*
from URL_LIST)
SELECT * FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY thedomain order by newid()) AS RW,*
FROM FormattedDomain)MyAlias
WHERE MyAlias.RW = 1
Lowell
November 11, 2011 at 6:37 am
If the list of top-level domains in your table is small, you can do something like this easily:
select top 1 ID, URL from URL_LIST where URL like '%google.com%' order by NEWID()
select top 1 ID, URL from URL_LIST where URL like '%yahoo.com%' order by NEWID()
select top 1 ID, URL from URL_LIST where URL like '%bing.com%' order by NEWID()
_________________________________
seth delconte
http://sqlkeys.com
November 11, 2011 at 6:38 am
Too late now but I was going with Jeff's splitter.
Now that I think of it just using left (patindex) makes more sense to find the end character.
You can't use .com because there are way more extensions than that!
BEGIN TRAN
DECLARE @Table TABLE (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, URL VARCHAR(500) NOT NULL)
INSERT INTO @Table (URL)
SELECT 'http://www.google.com/nice_page'
UNION ALL
SELECT 'http://www.google.com/good/'
UNION ALL
SELECT 'http://www.google.com/bad/'
UNION ALL
SELECT 'http://www.google.com/well/'
UNION ALL
SELECT 'http://www.yahoo.com/test/'
UNION ALL
SELECT 'http://www.yahoo.com/more/test/'
UNION ALL
SELECT 'http://www.yahoo.com/example/'
UNION ALL
SELECT 'http://www.bing.com/example.html'
UNION ALL
SELECT 'http://www.bing.com/example.php?variable=34234&hit="no"'
SELECT dtRandom.ID, dtRandom.Domain, dtRandom.URL FROM (
SELECT
t.ID
, t.URL
, dtSplit.ItemNumber
, dtSplit.Item AS Domain
, ROW_NUMBER() OVER(PARTITION BY dtSplit.Item ORDER BY NEWID()) AS TieBreaker
FROM@Table t
OUTER APPLY (SELECT ItemNumber, Item FROM dbo.DelimitedSplit8K(REPLACE(t.URL, 'http://', ''), '/')) dtSplit
WHEREdtSplit.ItemNumber = 1
) dtRandom
WHERE dtRandom.TieBreaker = 1
ORDER BY dtRandom.Domain
ROLLBACK
Splitter function http://www.sqlservercentral.com/articles/Tally+Table/72993/
November 11, 2011 at 6:42 am
Lowell (11/11/2011)
ORDER BY newid() is the way to get randomized data.you've got to do a few things: first, extract the domain from the urls...in my example, as long is everything is .com, mine wors...add .net/.org/edu, and you've got more work to do.
to get one form each, you need to use row_number and partition by.
;With URL_LIST(Id,url)
AS
(
SELECT '1',' http://www.google.com/nice_page' UNION ALL
SELECT '2',' http://www.google.com/good/' UNION ALL
SELECT '3',' http://www.google.com/bad/' UNION ALL
SELECT '4',' http://www.google.com/well/' UNION ALL
SELECT '5',' http://www.yahoo.com/test/' UNION ALL
SELECT '6',' http://www.yahoo.com/more/test/' UNION ALL
SELECT '7',' http://www.yahoo.com/example/' UNION ALL
SELECT '8',' http://www.bing.com/example.html' UNION ALL
SELECT '9',' http://www.bing.com/example.php?variable=34234&hit="no"'
)
, FormattedDomain as
(
SELECT SUBSTRING(url,1,charindex('.com',url)+4) as thedomain,*
from URL_LIST)
SELECT * FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY thedomain order by newid()) AS RW,*
FROM FormattedDomain)MyAlias
WHERE MyAlias.RW = 1
Nice!
_________________________________
seth delconte
http://sqlkeys.com
November 11, 2011 at 6:51 am
thank's Ninja; your example made me realize the third slash ends the domain....
here's my example again tweaked with that realization: this will work for any domain extention now
;With URL_LIST(Id,url)
AS
(
SELECT '1',' http://www.google.com/nice_page' UNION ALL
SELECT '2',' http://www.google.com/good/' UNION ALL
SELECT '3',' http://www.google.com/bad/' UNION ALL
SELECT '4',' http://www.google.com/well/' UNION ALL
SELECT '5',' http://www.yahoo.com/test/' UNION ALL
SELECT '6',' http://www.yahoo.com/more/test/' UNION ALL
SELECT '7',' http://www.yahoo.com/example/' UNION ALL
SELECT '8',' http://www.bing.com/example.html' UNION ALL
SELECT '9',' http://www.bing.com/example.php?variable=34234&hit="no"'
)
, FormattedDomain as
(
SELECT SUBSTRING(REPLACE(url,'http://',''),1,charindex('/',REPLACE(url,'http://',''))) as thedomain,*
from URL_LIST)
SELECT * FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY thedomain order by newid()) AS RW,*
FROM FormattedDomain)MyAlias
WHERE MyAlias.RW = 1
Lowell
November 11, 2011 at 6:54 am
Yes but now we have to make it work when there's no slash!!!
As always... string manipulation gets complicated real fast!
BEGIN TRAN
DECLARE @Table TABLE (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, URL VARCHAR(500) NOT NULL)
INSERT INTO @Table (URL)
SELECT 'http://www.google.com/nice_page' UNION ALL
SELECT 'http://www.google.com/good/' UNION ALL
SELECT 'http://www.google.com/bad/' UNION ALL
SELECT 'http://www.google.com/well/' UNION ALL
SELECT 'http://www.yahoo.com/test/' UNION ALL
SELECT 'http://www.yahoo.com/more/test/' UNION ALL
SELECT 'http://www.yahoo.com/example/' UNION ALL
SELECT 'http://www.bing.com/example.html' UNION ALL
SELECT 'http://www.bing.com/example.php?variable=34234&hit="no"' UNION ALL
SELECT 'http://www.failThis.com'
;WITH CTE (ID, URL, shortURL)
AS
(
SELECT
t.ID
, t.URL
, REPLACE(t.URL, 'http://', '') AS shortURL
FROM@Table t
)
, DOMAIN (ID, URL, Domain)
AS
(
SELECT
t.ID
, t.URL
, LEFT(t.shortURL, ISNULL(NULLIF(PATINDEX('%/%', t.shortURL), 0), LEN(t.shortURL))) AS Domain
FROMCTE t
)
SELECT * FROM
(
SELECT
ID
, URL
, Domain
, ROW_NUMBER() OVER(PARTITION BY Domain ORDER BY NEWID()) AS TieBreaker
FROMDOMAIN
) dtRandom
WHERE dtRandom.TieBreaker = 1
ORDER BY dtRandom.Domain
ROLLBACK
November 11, 2011 at 7:21 am
the table contains more than 100.000 urls....with every ending that is possible...
maybe its better to make a new column with the domains ?
like
id,url,domain
so would that be easier to do a easy select=?
November 11, 2011 at 7:28 am
schumichen (11/11/2011)
the table contains more than 100.000 urls....with every ending that is possible...maybe its better to make a new column with the domains ?
like
id,url,domain
so would that be easier to do a easy select=?
Definitely. You can pull DISTINCT domains in first part & then join with table back to get TOP 1 url for each domain.
November 11, 2011 at 7:30 am
Dev (11/11/2011)
schumichen (11/11/2011)
the table contains more than 100.000 urls....with every ending that is possible...maybe its better to make a new column with the domains ?
like
id,url,domain
so would that be easier to do a easy select=?
Definitely. You can pull DISTINCT domains in first part & then join with table back to get TOP 1 url for each domain.
Can't, he needs random URL.
November 11, 2011 at 7:31 am
schumichen (11/11/2011)
the table contains more than 100.000 urls....with every ending that is possible...maybe its better to make a new column with the domains ?
like
id,url,domain
so would that be easier to do a easy select=?
What's wrong with the code I posted? Just replace the demo table with your base table and give it a go. It shouldn't be too long to run.
If you want us to do perf tuning as well we need to know how many top level domains you have so we can build a sensible test dataset.
November 11, 2011 at 7:34 am
If you don't specify ORDER BY with TOP , the resultset order would be uncertain (random) ONLY.
November 11, 2011 at 7:36 am
Dev (11/11/2011)
If you don't specify ORDER BY with TOP , the resultset order would be uncertain (random) ONLY.
They won't be ramdom either.
You want RAMDOM order by you need to ORDER BY RAMDOM.
November 11, 2011 at 7:38 am
Ninja's_RGR'us (11/11/2011)
schumichen (11/11/2011)
the table contains more than 100.000 urls....with every ending that is possible...maybe its better to make a new column with the domains ?
like
id,url,domain
so would that be easier to do a easy select=?
What's wrong with the code I posted? Just replace the demo table with your base table and give it a go. It shouldn't be too long to run.
If you want us to do perf tuning as well we need to know how many top level domains you have so we can build a sensible test dataset.
100,000 urls & use of string functions. It would be slow (as compared to new column approach).
November 11, 2011 at 7:42 am
Dev (11/11/2011)
Ninja's_RGR'us (11/11/2011)
schumichen (11/11/2011)
the table contains more than 100.000 urls....with every ending that is possible...maybe its better to make a new column with the domains ?
like
id,url,domain
so would that be easier to do a easy select=?
What's wrong with the code I posted? Just replace the demo table with your base table and give it a go. It shouldn't be too long to run.
If you want us to do perf tuning as well we need to know how many top level domains you have so we can build a sensible test dataset.
100,000 urls & use of string functions. It would be slow (as compared to new column approach).
Based on your extensive testing of this case?
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply