July 22, 2005 at 3:56 pm
If this is not the correct forum, my apologies.
I have a table that stores log data of file downloads. Each record includes a user IP and a date/time stamp. I want to eliminate all duplicate records that occur within a 5 minute period. So far, I have figured out how to declare a cursor and cycle through the records (using FETCH and a WHILE loop), storing the fields I need to compare and successfully performing the comparison.
My question is this: I can use the print function to return results, but how do I return the data as a recordset to the calling script?
Here is my code - be warned, I am a relative newbie & may be doing something stupid.
Use encor5 Go declare @OldTime smalldatetime declare @OldIP nvarchar(50) declare @OldSong nvarchar(255) declare @CurrTime smalldatetime declare @CurrIP nvarchar(50) declare @CurrSong nvarchar(255) declare @CurrFile nvarchar(255) declare @CurrComposer nvarchar(100) declare @CurrArtist nvarchar(100) declare @CurrEntertainer nvarchar(100) declare SongLogs cursor for SELECT tblSongLogs.Request_Date_Time, tblSongLogs.User_IP, tblSongs.name, tblSongs.[file], tblSongs.Composer, tblSongs.[Original Artist], tblEntertainers.name FROM tblEntertainers RIGHT OUTER JOIN tblSongs ON tblEntertainers.id = tblSongs.entertainer_id RIGHT OUTER JOIN tblSongLogs ON tblSongs.id = tblSongLogs.Song_id ORDER BY tblSongLogs.User_IP, tblSongLogs.Request_Date_Time Open SongLogs FETCH NEXT FROM SongLogs INTO @CurrTime,@CurrIP,@CurrSong,@CurrFile,@CurrComposer,@CurrArtist,@CurrEntertainer WHILE @@FETCH_STATUS = 0 BEGIN If ((@OldSong=@CurrSong) and (@OldIP=@CurrIP) and DateDiff(mi,@OldTime, @CurrTime)<5) BEGIN print '***** Duplicate Found of ' + @CurrSong + '. *****' Set @OldSong = @CurrSong Set @OldIP = @CurrIP Set @OldTime = @CurrTime END Else BEGIN print @CurrSong + ' , ' + CAST(DateDiff(mi,@OldTime, @CurrTime) AS nvarchar(100)) Set @OldSong = @CurrSong Set @OldIP = @CurrIP Set @OldTime = @CurrTime END FETCH NEXT FROM SongLogs INTO @CurrTime,@CurrIP,@CurrSong,@CurrFile,@CurrComposer,@CurrArtist,@CurrEntertainer END CLOSE SongLogs DEALLOCATE SongLogs
Am I even on the right track?
Matthew K. Lindley
July 22, 2005 at 11:47 pm
To make things efficient, try rewriting to not use a cursor. I'll have a quick shot at it now...
But. When you say "eliminate all duplicate records that occur within a 5 minute period" it appears that a duplicate record is one where the same IP address has used the same song within a 5 minute period. Do you want these to be
if you want the the third option (forgive me - i'm tired but on review of this msg I think that's what you want), then, from your SQL
SELECT tblSongLogs.Request_Date_Time, tblSongLogs.User_IP, tblSongs.name,
tblSongs.[file], tblSongs.Composer, tblSongs.[Original Artist], tblEntertainers.name
FROM tblEntertainers RIGHT OUTER JOIN
tblSongs ON tblEntertainers.id = tblSongs.entertainer_id RIGHT OUTER JOIN
tblSongLogs ON tblSongs.id = tblSongLogs.Song_id
ORDER BY tblSongLogs.User_IP, tblSongLogs.Request_Date_Time
we replace the tblSongsLog table with a derived table (some SQL that it will treat as a table reference)..
select log.request_date_time, log.user_ip, log.song_id from tblSongsLog log left join tblSongsLog logFuture on log.user_ip = logFuture.user_ip and log.song_id = logFuture.song_id and logFuture.request_date_time between log.request_date_time and DateAdd(minute, 5, log.request_date_time) where logFuture.request_date_time is null
This essentially takes the log table and joins it to itself, with the join condition being to join with records containing the same user, same song within a period between the log's time and the log's time plus 5 minutes.
The where clause then limits the rows returned to those that did not have a join (ie, no songs in the future 5 minute period).
This is probably the most efficient way that I can think of the perform the filtering you require. Make sure that you have an index on your songlog table that does the user_ip, song_id and request_date_time (in that order).
You could recode this using "not exists" or even "not in" but past experience has usually shown that the outer join (here it is a left join) tends to look a bit stranger due to the need for the where [fieldname] = null clause but tends to perform faster.
In any case, this is the entire SQL statement that you should run...
SELECT tblSongLogs.Request_Date_Time, tblSongLogs.User_IP, tblSongs.name,
tblSongs.[file], tblSongs.Composer, tblSongs.[Original Artist], tblEntertainers.name
FROM tblEntertainers RIGHT OUTER JOIN
tblSongs ON tblEntertainers.id = tblSongs.entertainer_id RIGHT OUTER JOIN
(
select log.request_date_time, log.user_ip, log.song_id
from tblSongsLog log
left join tblSongsLog logFuture
on log.user_ip = logFuture.user_ip
and log.song_id = logFuture.song_id
and logFuture.request_date_time between log.request_date_time and DateAdd(minute, 5, log.request_date_time)
where logFuture.request_date_time is null
  tblSongLogs ON tblSongs.id = tblSongLogs.Song_id
ORDER BY tblSongLogs.User_IP, tblSongLogs.Request_Date_Time
The winking face you see is actually a closing bracket... I am not sure how to get rid of it!?
Then you needn't worry about cursors, etc. By the way, to return the data using your cursor, you would need to insert the desired rows one by one into a temporary table and then select all data from that table...
Hope this helps!!
July 24, 2005 at 11:48 am
Ok, I am trying this again. Yesterday, I tried to make a post, but took so long that when I went to preview it, I lost everything I had done. No, I was not pleased.
Anyway, thank you for your help. Unfortunatly, it did not work - it returns an empty set. However, it seems that you were close, so I started working with the portion of the query that will find all of the duplicate entries.
Also, to simplify things, I created a view that combines the data from the relevant tables. Just for reference, here is the view vw_SongViews:
SELECT tblSongLogs.Request_Date_Time, tblSongLogs.User_IP, tblSongs.name AS [Song Name], tblSongs.[file] AS [Song File], tblSongs.Composer, tblSongs.[Original Artist], tblEntertainers.name AS Entertainer, tblSongLogs.Song_id, tblSongLogs.id AS [Log ID] FROM tblEntertainers RIGHT OUTER JOIN tblSongs ON tblEntertainers.id = tblSongs.entertainer_id RIGHT OUTER JOIN tblSongLogs ON tblSongs.id = tblSongLogs.Song_id ORDER BY tblSongLogs.User_IP, tblSongLogs.Request_Date_Time
Ok, then I tried adapting the "duplicates" portion of the query to use this view. What I came up with is this:
SELECT logCurrent.* FROM vw_SongViews logCurrent LEFT JOIN vw_SongViews logFuture ON logCurrent.User_IP = logFuture.User_IP AND logCurrent.Song_id=logFuture.Song_id AND logFuture.Request_Date_Time BETWEEN logCurrent.Request_Date_Time AND DateAdd(minute,5,logCurrent.Request_Date_Time)
My expectation was that if I could get a set of repeat song request from the same IP returned, I would be a little closer to my goal. Removing the "WHERE" clause gave me some search results. The problem appeared to be that this query returned every row of the log file, whether or not it had any duplicate entries within the specified time span. Looking a little closer, I found that this query did produce duplicate records for certain log records. By that I mean that some Log ID's would appear once, some twice, some ten times, etc.
By cross checking this with the raw log data, I found that it was producing duplicate log entries for all log files that had duplicates within the 5 minute time span. Or so it seemed.
For example, if I have log data for the following records:
Request_Date_Time User_IP Song Name Song File Composer Original Artist Entertainer Song_id Log ID 2005-06-14 13:17:00 12.119.150.134 City Lights cityligh.rm Emerald City Emerlad City Emerald City 261 950 2005-06-14 13:17:00 12.119.150.134 City Lights cityligh.rm Emerald City Emerlad City Emerald City 261 951 2005-06-14 13:20:00 12.119.150.134 City Lights cityligh.rm Emerald City Emerlad City Emerald City 261 952 2005-06-14 13:21:00 12.119.150.134 Love Shack rrshack.mp3 NULL NULL Roof Raisers 323 953
Then the query would return records (Log ID's) 950 and 951 three times each, and would return record 953 only one time. This is because for each of the two log records (950-951) there are three matches for the query - each record matches itself, and the other two are also within the specified time range. Record 953 was not a duplicate request for the City Lights song, so it only appears one time, as you would expect. The oddball here is record 952. This record was returned only one time.
Try as I might, I cannot puzzle that out. I had though that if the query is returning these duplicates in a predictable manner, I may be able to use that somehow. But this behavior does not seem to follow the logic of the query. I would have expected to see all three records duplicated 3 times, since each of them occurs within a 5 minute period. Indeed, in other places, where there are three requests by the same IP for the same song, all in less than 1 minute, each log ID is returned 3 times. The strange behavior only pops up when there is at least 1 minute separating the log entries.
I would not have thought that 1 minute would make a difference, since the criteria specifies a 5 minute time span. If I remove the time portion of the query, then I get predictable results - well, more predictable to me. Each duplicate request log is returned as many times as there are duplicates (i.e. 5 requests from the same IP for same song will return 25 records, 5 duplicates of each of the 5 log records). Since I have to count repeated downloads, if they were not within a 5 minute period, this does not help me.
I think I was closer with the cursor solution. I hate that because it seems clumsy. Maybe I would be better off just comparing the records in my ASP. I'll research creating a temporary table while using the cursor. Since I have that logic working (I think) it might be the easiest solution for me at this point.
Thank you for your help, anyway. I knew this was a tough question before I even started working on it. Now that I am thinking about it, I wonder why I thought it would be easier to do these computations on the SQL server, rather than in my own ASP. It would not necessarily be faster, would it?
Matthew K. Lindley
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply