I recently had the following discussions with a number of data architects, in different communities, in particular (but not limited to) the TDWI group on LinkedIn. This is a summary of the discussion, featuring differences between data scientists and data architects, and how both can work together.
It shows some of the challenges that still need to be addressed before this new analytics revolution is complete. Following are several questions asked by data architects and database administrators, and my answers. The discussion is about optimizing joins in SQL queries, or just moving away from SQL altogether. Several modern databases now offer many of the features discussed here, including hash table joins and fine-tuning the query optimizer by end users. The discussion illustrates the conflicts between data scientists, data architects, and also business analysts. It also touches on many innovative concepts.
Question: You say that one of the bottlenecks with SQL is users writing queries with (say) three joins, when these queries could be split into two queries each with two joins. Can you elaborate?
Answer: Typically, the way I write SQL code is to embed it into a programming language such as Python, and store all lookup tables that I need as hash tables in memory. So I rarely have to do a join, and when I do, it’s just two tables at most.