I haven’t gotten to involved in the whole “Big Data” movement to much, I’ve kept an arms length watch and played around with a few of those technologies but never really jumped in and got down and dirty. However, with the public preview of Azure Data Lake, I thought that it was probably time to get past the simple parrot style examples and tutorials and try something from scratch.
What is compelling for me about ADL over other technologies is I don’t have to learn a platform. I never have to setup commission / provision any hardware. As a data guy, to even get to creating and setting up a data storage area in other tech there are any number of hoops to be jumped through, software to be installed and VMs to be built. With ADL all I have to do is give my card details to Microsoft ( YMMV on if that is a good or bad thing) and I can start playing.
In a more practical sense, ADL also provides you massive scale ( in my Pay as you go subscription, I’m limited to 50 AU, which I believe is effectively 2 cores and many many more are available if required), but 0 of the hardship in managing the machines. Not only do you have access to that power but you have control where to use it. If you only have a single job that requires 2000 processors then you only have to pay for them while that job is running, every other job you can pay for the 20 that are needed.
The integration to Visual Studio is neat and there is now no longer any need to submit jobs to Azure to simply try then out as a local emulator is used instead. Also the powershell functionality looks to be fairly wide-ranging and will help in automating tasks.
What follows is an overview of my experiments that i have published into a GitHib repo. The “Examples” folder are what i would term “simple learnings” and “Full Scripts” are scripts that to a lesser or greater extent do something “useful”. Im also not suggesting that anything here is “best practice” or method A performs better than method B, I simply do not have the required size of data to make that call. My aim was to learn the language.
TLDR: Check out the script MovieLens09-CosineSimilarityFromCSVWithMax.usql for a U-SQL movie recommender.
The Movie Recommender is somewhat of the “Hello World” of a Big Data system and is also an interesting subject that everyone can relate to, so here are my attempts in exploring Azure Data Lake and Data Store with this final goal in mind.
First matter is to find a decent data set and for this ill be using the MovieLens 1m dataset, this presents an interesting challenge in simply consuming the data as its not in a CSV format that is used by ADL. ADL does support the use of “Custom Extractors”, so that was the logical place to start and I had a few interesting experiments defining an extractor using a few different possibilities, Ie Using a ‘hardcoded’ string output or using the meta data of the EXTRACT statement itself. This is done by simply interrogating the Output.schema array :
var s = output.Schema[x];
if (s.Type.Name.Equals("String", StringComparison.InvariantCultureIgnoreCase))
{
output.Set(x, strings[x].ToString());
}
if (s.Type.Name.Equals("Int32", StringComparison.InvariantCultureIgnoreCase))
{
try {
output.Set(x, Int32.Parse(strings[x].ToString()));
}
catch
{
output.Set(x, -1);
}
}
Finally though, I considered an extractor a bit ‘cumbersome’ and opaque for my purposes here, but in a productionised system I can certainly see the uses of it especially if you are consuming more complex type data.
Instead I reverted a string split method, again useful to understand how to call the integrated .Net functionality as this required setting up a string array inline to pass in the 2 character separator for the dataset. This necessitates a three part operation, read the file as a series of long strings ( @UserLine ), split on ‘::’ into an array (@UserSplit) and reform into a properly formed, typed and named dataset (@Users).
@UserLine =
EXTRACT UserLine string
FROM @USERSInp
USING Extractors.Tsv( rowDelimiter: "\n");
@UserSplit =
SELECT new SQL.ARRAY<string>(UserLine.Split(new string[]{"::"},StringSplitOptions.None)) AS UserArr
FROM @UserLine;
@Users =
SELECT int.Parse(UserArr[0]) AS UserID,
UserArr[1] AS Gender,
int.Parse(UserArr[2]) AS Age,
int.Parse(UserArr[3]) AS Occupation,
UserArr[4] AS Zip
FROM @UserSplit;
This is functionality that has been long overdue in TSQL and its good that we now have STRING_SPLIT to do this however, joining a string together is not supported (well).
In U-SQL its a snap, first use ARRAY_AGG and then use .Net (string.Join) to form the string.
@UserWithMovies =
SELECT u.*,ARRAY_AGG(r.MovieID) AS MoviesArr
FROM @Users AS u
JOIN
@Ratings AS r
ON u.UserID == r.UserID
GROUP BY u.UserID,
u.Gender,
u.Age,
u.Occupation,
u.Zip;
@outer =
SELECT UserID,MoviesArr.Count AS MoviesCount,string.Join(":",MoviesArr) AS MovieList
FROM @UserWithMovies;
For the maths behind the recommender itself I chose a cosine similarity to match those people who are most similar, as it is good enough for my purposes and simple enough to understand. I also match people on Age and Occupation and filter out ‘exact’ similarities.
I wont repeat the full code here verbatim but ive added comments and if you have an ADL account you should simply be able to download the movielens dataset cut n paste the script and run it.
Please take a look over the other scripts in the Repo, they may not be as well commented but the intention should be clear.
In terms of cost, how much did It cost me to get this far ? Peanuts, £1.50 was my bill. Next step is to take the 20m dataset and repeat this exercise, hopefully with that size data I might see some measurable differences between use flat CSV, tables etc and experiment with the best way to increase vertices.