June 24, 2003 at 10:33 am
Hi there,
I have the following tables:
UserActivity (userId, date, col1, col2, col3)
UserImpression (userId, date, cola, colb, colc)
UserDate (userId, latestDate)
UserActivity has 10 million rows.
UserImpression has 100 million rows.
UserDate has 70 million rows.
UserImpression and UserActivity store information based on the joint primary key of userId and date
UserDate stores the latest date that any information for a particular userId has been stored thus a primary key on userId alone.
These tables need to be joined together as follows:
select I.*
from UserActivity A
join UserDate D on D.userId = A.userId
and D.date = A.date
join UserImpression I on D.userId= I.userId and D.latestDate = I.date
If this query returns 70 million rows, what is a reasonable expected runtime?
I'm currently running this on a machine with 2 Gigs memory and 4 PIII Xeon 700MHz processors
Due to the enormous size and frequency of data turnover of UserImpression, I have created that table as a partitioned view with the check constraint on the date column. Each day has approximately 30 million rows and 30 days of data are stored.
Please help.
Edited by - jraha on 06/24/2003 11:32:27 AM
June 24, 2003 at 10:42 am
No idea. Kinda overwhelmed at the idea of actually returning 70 million rows - what do you do with them? You could try just returning a subset and extrapolating from there.
Andy
June 24, 2003 at 11:23 am
Actually I didn't express that correctly... I would return the 70 million rows into a new table and eventually run some aggregating queries on that.
June 25, 2003 at 12:54 am
the ultimate wet dream would be getting this one sub-second π I'd recomment on matching the indexes (clustered ? ) and trying to get them in ram (more than 2gb). Of course you'll be disk dependant because of massive IO-ops (at least for userimpression).
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution π
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
June 25, 2003 at 12:54 am
the ultimate wet dream would be getting this one sub-second π I'd recomment on matching the indexes (clustered ? ) and trying to get them in ram (more than 2gb). Of course you'll be disk dependant because of massive IO-ops (at least for userimpression).
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution π
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
June 25, 2003 at 4:01 am
Normally such operations are required for data warehousing sort of applications. But even there we can better represent information using some statistical ideas like summing, frequency table etc. If you can post the exact requirement I think many will be there to help you out
Instead of join if you are for βexists β in where clause then it will not block the tables as a whole since the sub query runs row by row, even it may take one more hour (joke)
Regards
John
select I.*
from
fom UserImpression I
where
exists (
select *
from
UserActivity A ,
UserDate D
where
D.userId= I.userId and D.latestDate = I.date
and UserDate D on D.userId = A.userId
)
June 25, 2003 at 4:05 am
First off based on
quote:
UserDate (userId, latestDate)
should this line
quote:
join UserDate D on D.userId = A.userIdand D.date = A.date
actually be
join UserDate D on D.userId = A.userId
and D.latestDate = A.date
and if so then why not just do
SELECT
I.*
FROM
UserActivity A
INNER JOIN
UserImpression I
ON
A.userId= I.userId AND
A.[date] = I.[date]
Even so based on this
quote:
UserImpression (userId, date, cola, colb, colc)
70 million rows may or may not fit in 2GB as I don't know what userid, date, cola, colb, colc are as far as width so I can't tell you if your running into any memory bottlenecks here.
But you come back and say
quote:
I would return the 70 million rows into a new table
To get maximum insert performance make sure all constraints, triggers and indexes are disabled or removed from this table then add or enable after insert is done.
Also I assume your indexes would be
UserDate (userId, latestDate) CLUSTERED
UserImpression (userId, date) CLUSTERED
UserActivity (userId, date) CLUSTERED
To maximize join performance, if you always use userid and latestdate/date to join these tables and work with them I would consider also making them
datefield, userid as the order of the columns in the index. It is best to have you most unique value first in a composite (multicolumn) index because only the first column is sampled to build the statistics value.
Another thing which I assume you have already done is the columns used in the join work fastest as long as they are the same datatype. Otherwise SQL will auto-type or you will need to perform an explicit CAST (which is better than SQL auto-typing) if they are not.
Other than that it is a matter of execution plan and tweaking to see what can be done.
June 25, 2003 at 5:28 am
If you do run into performance problems due to a lack of memory, consider breaking up the whole insert into smaller chunks.
Should be fairly easy, based on the description you gave of the UserDate table.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply