hosted SQL Server data crunching

Question

Post reply

hosted SQL Server data crunching

PSScott

SSC Enthusiast

Points: 121
More actions
November 7, 2006 at 2:15 pm

#367985

Hi all,
I'm working on a side-project that does a tremendous amount of pure data crunching. The workhorse table is 100M rows, can't be altered/parsed/simplified/etc. , and the pertinant SQL uses a few self-joins. Yes, it's ugly but that's just life for this project...
So, does anyone know of an on-line service that has some serious computational horsepower that can be rented for the short-term? Ideally this would simply let me upload a backup of the database (SQL Server 2000), run some SQL to load a results table, and then download the results. I'm really fishing for some solid references, especially for sites that you have done business with in the past.

Much Thanks!

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Michael Meierruth SSChampion Points: 10051 More actions · Answer 1

How about giving us a peek at those self-joins.

What kind of hardware are you running on right now?

Do you think just raw CPU power will solve your problems? Maybe it's the disk or not enough RAM.

How long do you estimate it will take on your actual hadware? 1 hour? 1 day? 1 month? 1 year?

Maybe doing some procedural code in a foreign language would solve the problem in no time.

What is the problem anyway?

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 2

All great questions. 100M rows on SQL Server today is not that huge. Depending on what the problem is maybe Analysis Server could be of assistance as well.

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

PSScott SSC Enthusiast Points: 121 More actions · Answer 3

Thanks for the input! The project is... Netflix of course!

The issue is that all I have is a single desktop running SQL Server, and it's going to take on the order of weeks to calculate a result set. This is totally a play-project, so I'm not too interested in sinking more money into hardware upgrades that could help in Netflix, but would be completely unnecessary for my day-to-day needs.

The theory is pretty simple. I'm given a user and movie, and need to estimate what rating the user will give to the movie.This estimate is based purely on a 100M dataset of other user\movie\ratings.

I've done some preliminary work to optimize this, but it still requires churning through a lot of data. So I'm really just looking for a quick and simple way to wade through this calculation.

Thanks!

Michael Meierruth SSChampion Points: 10051 More actions · Answer 4

I seem to have seen this problem of movie ratings somehwere on this forum. Are sure you haven't discussed this elsewhere?

In any case, you have raised my curiosity. So spill the beans on the problem.

By the way, on a freshly started SQL Server service how long does it take (in Query Analyzer) to do

select * from netflixdata where 0<>1

How long does the 2nd and 3rd execution take?

I happened to dig up a 1 million record table and on my very unimpressive notebook it took 37, 28 and 29 seconds respectively. Maybe this way we can have a (very indirect) idea of what kind of box you're running on.

Michael Meierruth SSChampion Points: 10051 More actions · Answer 5

OK, checking out http://www.netflixprize.com/

Will get back to you.

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 6

Key Question:

--> Will the posters in this forum share in the proceeds if you win ???

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

PSScott SSC Enthusiast Points: 121 More actions · Answer 7

When (LOL) I win I'll gladly share my proceeds with the folks who were most helpful in cementing my victory.

Also, I have a bridge for sale......

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 8

I prefer land in Florida or swamp in Arizona ...

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

DaveItz SSCrazy Points: 2146 More actions · Answer 9

If people get away from sulking over the whole NetFlix prize money thing, I'd be interested in hearing an answer to the original question. What kind of good offsite hosting for SQL server processing is there around? Obviously, they would have to be reliable and give assurances as to data confidentiality.

Michael Meierruth SSChampion Points: 10051 More actions · Answer 10

I don't know of the existence of this kind of service. I certainly cannot imagine a business model that would make something like this viable. Hardware is getting cheaper by the second.

As for NetFlix, I'm not sulking over it. In fact, it is an interesting project. I am currently loading the data. Estimating app. 23 hours to get it into a SQL Server database on a P4/1500. There has to be a quicker way to load 100 million records.

My guess is that initially you should work with a subset of the data to tune your algorithms and approaches to the problem. In the end you apply it to the full data space.

The legal language is a bit overwhelming but not insurmountable.

One aspect that I still need to become clear on is that the data clearly shows that people vote in movies using integer values ranging from 1 to 5. Yet the readme file shows that we can submit values like 3.2 or 4.8 etc.

So I'm hopelessly hooked. Anyone care to join?

Michael Meierruth SSChampion Points: 10051 More actions · Answer 11

A long long time ago, I seem to recall that it was possible to take your RAM and assign parts of it to a disk drive letter. Does anyone know if this is possible with today's Windows XP and all this RAM we now get thrown at us?

Thus the Netflix data expands to a 2.3GB SQL Server MDF file. If I have 4GB of RAM I could get it all into RAM and have lightning performance.

On the other hand I'm playing with the idea of having a process read the data into RAM and keep it there. This should take 4 bytes for the customer, 2 bytes for the movie, 4 bytes for the date and 1 byte for the rating for a total of 11x100 million or 1.1GB. Then have me send 'commands' to the process to 'quickly' get information out of the data through an ActiveX interface. Has anyone ever done this sort of data access optimization? Is this maybe how Google does it? Or maybe they are a unix shop where ActiveX is Greek.

Someone, somewhere said that 100 million records is not that much. Believe me, it makes everything come to a grinding halt.

Eddie Wuerch SSChampion Points: 12535 More actions · Answer 12

Predictive queries are the domain of data mining. Analysis Services and DMX are the keys here, not a multiple self-join of a large data set.

As far as digging up a computing host, how much are you willing to spend? Training that mining set in such a way that the experts haven't already tried is going to take several passes and a lot of storage. As you can only submit one prediction resultset per day, and then wait to see if it was a good one, you're looking at renting a lot of space for a long time, and quite a number of computational passes.

I think a better approach would be to list your analytical cred and troll here for a few DBAs with access to underused/recently taken out of production/not yet assigned hardware who would want to work with you.

Heck, going to the netflix site and reading up on it so I could respond to the post got me thinking it's a great way to demonstrate data mining to the rest of my team... it's a data set ready-made to start playing with. Just designing the clustering algorithm alone is a full workshop...

Hey, anybody going to the PASS summit next week interested in pursuing this from an training angle? I'm half serious at this point... SSAS data mining is an awesome tool, it's just unknown by so many people because it's not exposed as something most people would find as useful. A starter kit (with a possible $1 million result) might get more people playing with it.

Maybe I should download the data and rifle through it a bit before I get too excited. I've got two idle 8-way DL-585's, a SAN vendor who wants us to try out their stuff, and an SSAS initiative I'm kicking off over the next month. This could get interesting.

-Eddie

Eddie Wuerch
MCM: SQL

Michael Meierruth SSChampion Points: 10051 More actions · Answer 13

One of these days I'll get around to Anlysis Services and DMX. I'm not sure how much they will help with this sort of problem. You need to find a way to get large quantities of data quickly into RAM and run your algorithms and tests.

One approach that I'm seriously considering is for VB6 to read and write very large arrays to a disk file very, very fast - a bit like memory mapped files. You will need the physical RAM though. Thus to get the 100 million record Netflix data into arrays would require 2 bytes for the movie id, 4 bytes for the customer id, 1 byte for the rating and 2 bytes for the date. That's a total of 900MB. To read the full SQL Server table using ADO takes about an hour. To read and write these 900MB arrays takes about 10 seconds.

Try this:

Dim lngArray(20000, 10000) As Long ' that's 800MB

lngArray(1, 1) = 1

lngArray(1, 5000) = 2

lngArray(20000, 1) = 3

lngArray(20000, 10000) = 4

Open "c:\temp\test.dat" For Binary As #1

Put #1, , lngArray()

Close #1

As I said, don't do this if you don't have the physical RAM.