November 7, 2006 at 2:15 pm
Hi all,
I'm working on a side-project that does a tremendous amount of pure data crunching. The workhorse table is 100M rows, can't be altered/parsed/simplified/etc. , and the pertinant SQL uses a few self-joins. Yes, it's ugly but that's just life for this project...
So, does anyone know of an on-line service that has some serious computational horsepower that can be rented for the short-term? Ideally this would simply let me upload a backup of the database (SQL Server 2000), run some SQL to load a results table, and then download the results. I'm really fishing for some solid references, especially for sites that you have done business with in the past.
Much Thanks!
November 8, 2006 at 11:34 am
How about giving us a peek at those self-joins.
What kind of hardware are you running on right now?
Do you think just raw CPU power will solve your problems? Maybe it's the disk or not enough RAM.
How long do you estimate it will take on your actual hadware? 1 hour? 1 day? 1 month? 1 year?
Maybe doing some procedural code in a foreign language would solve the problem in no time.
What is the problem anyway?
November 8, 2006 at 11:37 am
All great questions. 100M rows on SQL Server today is not that huge. Depending on what the problem is maybe Analysis Server could be of assistance as well.
RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."
November 8, 2006 at 12:03 pm
Thanks for the input! The project is... Netflix of course!
The issue is that all I have is a single desktop running SQL Server, and it's going to take on the order of weeks to calculate a result set. This is totally a play-project, so I'm not too interested in sinking more money into hardware upgrades that could help in Netflix, but would be completely unnecessary for my day-to-day needs.
The theory is pretty simple. I'm given a user and movie, and need to estimate what rating the user will give to the movie.This estimate is based purely on a 100M dataset of other user\movie\ratings.
I've done some preliminary work to optimize this, but it still requires churning through a lot of data. So I'm really just looking for a quick and simple way to wade through this calculation.
Thanks!
November 8, 2006 at 1:03 pm
I seem to have seen this problem of movie ratings somehwere on this forum. Are sure you haven't discussed this elsewhere?
In any case, you have raised my curiosity. So spill the beans on the problem.
By the way, on a freshly started SQL Server service how long does it take (in Query Analyzer) to do
select * from netflixdata where 0<>1
How long does the 2nd and 3rd execution take?
I happened to dig up a 1 million record table and on my very unimpressive notebook it took 37, 28 and 29 seconds respectively. Maybe this way we can have a (very indirect) idea of what kind of box you're running on.
November 8, 2006 at 2:07 pm
OK, checking out http://www.netflixprize.com/
Will get back to you.
November 8, 2006 at 2:12 pm
Key Question:
--> Will the posters in this forum share in the proceeds if you win ???
RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."
November 8, 2006 at 2:17 pm
When (LOL) I win I'll gladly share my proceeds with the folks who were most helpful in cementing my victory.
Also, I have a bridge for sale......
November 8, 2006 at 2:24 pm
I prefer land in Florida or swamp in Arizona ...
RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."
November 9, 2006 at 9:08 am
If people get away from sulking over the whole NetFlix prize money thing, I'd be interested in hearing an answer to the original question. What kind of good offsite hosting for SQL server processing is there around? Obviously, they would have to be reliable and give assurances as to data confidentiality.
November 9, 2006 at 12:24 pm
I don't know of the existence of this kind of service. I certainly cannot imagine a business model that would make something like this viable. Hardware is getting cheaper by the second.
As for NetFlix, I'm not sulking over it. In fact, it is an interesting project. I am currently loading the data. Estimating app. 23 hours to get it into a SQL Server database on a P4/1500. There has to be a quicker way to load 100 million records.
My guess is that initially you should work with a subset of the data to tune your algorithms and approaches to the problem. In the end you apply it to the full data space.
The legal language is a bit overwhelming but not insurmountable.
One aspect that I still need to become clear on is that the data clearly shows that people vote in movies using integer values ranging from 1 to 5. Yet the readme file shows that we can submit values like 3.2 or 4.8 etc.
So I'm hopelessly hooked. Anyone care to join?
November 11, 2006 at 6:30 am
A long long time ago, I seem to recall that it was possible to take your RAM and assign parts of it to a disk drive letter. Does anyone know if this is possible with today's Windows XP and all this RAM we now get thrown at us?
Thus the Netflix data expands to a 2.3GB SQL Server MDF file. If I have 4GB of RAM I could get it all into RAM and have lightning performance.
On the other hand I'm playing with the idea of having a process read the data into RAM and keep it there. This should take 4 bytes for the customer, 2 bytes for the movie, 4 bytes for the date and 1 byte for the rating for a total of 11x100 million or 1.1GB. Then have me send 'commands' to the process to 'quickly' get information out of the data through an ActiveX interface. Has anyone ever done this sort of data access optimization? Is this maybe how Google does it? Or maybe they are a unix shop where ActiveX is Greek.
Someone, somewhere said that 100 million records is not that much. Believe me, it makes everything come to a grinding halt.
November 11, 2006 at 2:29 pm
Predictive queries are the domain of data mining. Analysis Services and DMX are the keys here, not a multiple self-join of a large data set.
As far as digging up a computing host, how much are you willing to spend? Training that mining set in such a way that the experts haven't already tried is going to take several passes and a lot of storage. As you can only submit one prediction resultset per day, and then wait to see if it was a good one, you're looking at renting a lot of space for a long time, and quite a number of computational passes.
I think a better approach would be to list your analytical cred and troll here for a few DBAs with access to underused/recently taken out of production/not yet assigned hardware who would want to work with you.
Heck, going to the netflix site and reading up on it so I could respond to the post got me thinking it's a great way to demonstrate data mining to the rest of my team... it's a data set ready-made to start playing with. Just designing the clustering algorithm alone is a full workshop...
Hey, anybody going to the PASS summit next week interested in pursuing this from an training angle? I'm half serious at this point... SSAS data mining is an awesome tool, it's just unknown by so many people because it's not exposed as something most people would find as useful. A starter kit (with a possible $1 million result) might get more people playing with it.
Maybe I should download the data and rifle through it a bit before I get too excited. I've got two idle 8-way DL-585's, a SAN vendor who wants us to try out their stuff, and an SSAS initiative I'm kicking off over the next month. This could get interesting.
-Eddie
Eddie Wuerch
MCM: SQL
November 13, 2006 at 2:21 am
One of these days I'll get around to Anlysis Services and DMX. I'm not sure how much they will help with this sort of problem. You need to find a way to get large quantities of data quickly into RAM and run your algorithms and tests.
One approach that I'm seriously considering is for VB6 to read and write very large arrays to a disk file very, very fast - a bit like memory mapped files. You will need the physical RAM though. Thus to get the 100 million record Netflix data into arrays would require 2 bytes for the movie id, 4 bytes for the customer id, 1 byte for the rating and 2 bytes for the date. That's a total of 900MB. To read the full SQL Server table using ADO takes about an hour. To read and write these 900MB arrays takes about 10 seconds.
Try this:
Dim lngArray(20000, 10000) As Long ' that's 800MB
lngArray(1, 1) = 1
lngArray(1, 5000) = 2
lngArray(20000, 1) = 3
lngArray(20000, 10000) = 4
Open "c:\temp\test.dat" For Binary As #1
Put #1, , lngArray()
Close #1
As I said, don't do this if you don't have the physical RAM.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply