January 7, 2011 at 11:35 am
I'm working on correlations so I have, say, 10 time series, from which I need to generate 10 x 10 single values. At the moment I join the table to itself and calculate 100 values, which works. The query plan shows all the time is spent in a hash match inner join (~15%) a hash match partial aggregate (~45%) and a hash match aggregate (~25%). When I have more than 10 series, say 10,000, I get 100,000,000 rows to work with, which takes a long time...
Is there a smarter way? At present I batch up the calculation in a WHILE loop so it does say 50 x 10,000 then another 50 x 10,000 etc. This keeps my 4-core CPU running flat out during batches, with some IO between batches, which seems ok to me, but I'm wondering if SQL Server is the tool for this job? I build a JOIN that produces this:
A A
A B
A C
A D
A E
B A
B B
B C
B D
B E
C A
B B
etc
... and that seems to be what the hash matches are doing. Am I missing any obvious solution to a problem of this type?
Is this a candidate for a CLR C# approach at all?
Many thanks,
Alistair
------
Here's the key query:
INSERT INTO SumProd
SELECT Diff.Id, Diff2.Id, SUM (Diff.DiffFromAvg * Diff2.DiffFromAvg)
FROM Diff INNER JOIN
Diff Diff2 ON Diff.Dt = Diff2.Dt AND Diff.Id >= Diff2.Id -- I only need half the results so this >= produces that
WHEREDiff.Id >= @FirstId AND Diff.Id < @LastId -- Only some Ids are used
GROUP BY Diff.Id, Diff2.Id
I have structured things so as little as possible happens in this megajoin stage. Diff looks like (Id smallint, Dt date, DiffFromAvg real, AnotherCol real) with a clustered index on Id, Dt. SumProd has no index to avoid a big index insert. Is there anything else I could do?
January 7, 2011 at 12:06 pm
Hi,
I don't quite understand what you want to accomplish, but to get cartesian products, you need to use CROSS JOIN.
Cheers
January 7, 2011 at 12:45 pm
Thanks - I just gave that a try to see if the query plan was any different but it seems identical to what I'm doing. Running a longer test now though...
January 7, 2011 at 12:47 pm
It's not actually a cartesian product, so an Inner Join and a Cross Join with the same criteria will produce the same plan and same results.
On truly huge datasets, like what you're talking about, the solution you're using is pretty standard.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 7, 2011 at 1:02 pm
Thanks - that's reassuring... so it's not worthwhile passing things out to some procedural language that's good at loops somehow?
January 7, 2011 at 1:38 pm
You could probably pass the looping to one, but the gains will almost certainly be microscopic in the scope of the code.
A simple T-SQL while loop will probably be good enough. It'll be milliseconds slower than a .NET loop, but I'd bet that shaving milliseconds off this process isn't what you're looking for.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
January 7, 2011 at 1:57 pm
I'd probably only consider a .NET approach if I could get savings of 30% or so somehow. My while loop only has a few iterations so I'm wondering about this table join I'm doing inside it - presumably SQL Server is doing a bunch of clever stuff in there - is any of it redundant and likely to be bettered by a single purpose routine that just reads the array of 10,000 series in a nested loop and passes back the 100,000,000 results.
Seems to me like there would an awful lot of overhead in that and a whole new set of memory issues, but I don't know anything about using the CLR from within the DB - maybe it's awesome?
January 10, 2011 at 7:18 am
CLR that accesses the database uses the same query engine as anything else that accesses the database. It won't query any faster, insert any faster, etc. What it does do is give you access to code that isn't part of T-SQL, like regexes, more efficient loops and cursors, access to DLLs and web services outside the scope of the database, and so on.
Since what you're doing is just DML, I don't see any advantage to moving it to CLR.
What you might try is different batch sizes in your loop. See if you can optimize that way.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply