I saw an announcement this week that CamGraph is one of the projects from Microsoft Research that deals with "big data sets. This is supposed to be an appliance for scalable machine learning, essentially allowing a computer system, or cluster in this case, to make guesses based on the data set, applying some mathematical and statistical algorithms to a problem. It sounds interesting, and being built on the CamCube cluster that scales from 3 to 8,000 nodes indicates that it could have a lot of potential problem solving power.
However there was one interesting thing in the note about the architecture. It's based on CamCube, formally called BorgDube, which makes it cool right away. This architecture is supposed to handle the core libraries, presumably the code distrubution across nodes, and, of course, data presistence. The note that I found very interesting was this one:
"The platform also provides persistent data storage and management, in the form of a key value store. " - from Camgraph's page.
Key-value store. That's what a lot of NoSQL platforms use, and it was the original limitation for the SQL Server Data Services in the cloud a few years back. Now this is a join project between a few Microsoft Research groups, and I like that they are investigating all different sorts of technologies. It gives me hope that the researchers are pursing interesting projects without being bound by current shipping technologies.
However it does make me wonder if the key-value stores are better for large data sets, or are they just simpler to work with? Will we see a more relational framework, or perhaps even some streaming technology being used in the future for the processing or large data sets? I don't know, but it seems like we need more research into larger and larger data sets, both structured and unstructured as the volume of data we capture and process continues to grow.