In October 2017, I completed the capstone project of the Microsoft Professional Program for Data Science. I’ve blogged about this program before:
- Microsoft Professional Program in Data Science – The Road So Far
- Microsoft Professional Program in Data Science – Continued
I’m happy to announce I got the certificate:
How did the capstone project go? It was a very interesting experience. The project consists of three parts:
- Answering some questions on edX on the data set.
- You create a predictive model, which is then scored against a test data set.
- You write a report with all your findings.
You log into a website hosting a “data science competition”. There you get your assignment; in my case predicting the average income of students in the USA a couple of years after graduating an institution of higher education. For the first part, there are some questions on the edX website under the header “data analysis”. These questions forced you to take a look at your data, create a couple of charts and histograms, and calculate some statistics in order to find the answer to those questions. All in all the questions were pretty easy so I scored the maximum score.
The second part is creating a predictive model. The competition website gives you a training data set, a test data set and a data glossary. You can choose whatever method you please to create the model. You can use R, Python, Azure ML or any combination of the previous. It’s totally up to you. You score the test data set against your model and upload the .csv file with the results to the competition website. It validates your results based on the RMSE versus the actual data. The lower the RMSE, the higher your score. This score is published on a leaderboard, so you can compare yourself against your peers.
Personally I found the challenge to be … ehrm … challenging The main problem is the huge gap between the skills needed to finish all the courses (where you can easily score over 90%) and the skills needed to get over 80% in the capstone project. The training set had a lot of features (over 200 columns) and many columns had missing values. Especially the missing values could mess with the predictive abilities of your model. Another problem is that you can only do 3 submissions a day to calculate your score. If you’re like me (and probably most people who do the capstone project), you have day jobs and need to do this project in your free time. This means that on some evenings or in the weekend I could make some time for this project. If you like to work with multiple iterations, too bad, you can only do 3 iterations at a time.
I used Azure ML to do this project and it worked fine most of the time. I had a standard workspace, not the free version, so it’s a bit faster. But with such a huge training set (many features, the number of rows was about 17,000) Azure ML could be really slow sometimes, especially if you are tuning the hyperparameters over the entire grid. At the end, my model easily ran over 1 hour. At one point in time, I made a change and the model kept running for hours and there was no solid reason for this (I asked a question about this on MSDN and the Azure ML engineers still have to respond). Scoring your test data set against your model in Azure ML wasn’t easy as well. Using the Excel add-in and a webservice was in most cases problematic and a lot of students struggled with this. I ended up by creating a predictive experiment, but instead of publishing it as a webservice I modified it to use the test data set instead.
Aside from all of these issues, I definitely learned a lot about creating a model: data cleaning , feature engineering, feature selection and so on. I’m really glad I did the capstone project. Am I a professional data scientist now? Not really, but at least I know the basics now. If there’s a data science project at a client, at least I can chat along
The last part, creating a report wasn’t that hard, it’s just time consuming to create a lot of charts. And because my writing style isn’t really condensed, I ended up with 30 pages (charts included). Whoops.
All-in-all, the Microsoft Professional Program for Data Science was a really nice experience. Will I recommend it to other people? Yes. Will I recommend them to pay for the certificate (about $1000 for all the courses and the project)? Probably not. I wouldn’t pay for it full price (in the beginning of the program we had a discount), unless my employer chips in.