November 12, 2020 at 12:00 am
Comments posted to this topic are about the item Extreme SQL
November 12, 2020 at 7:03 am
These extreme systems are so extremly fascinating to me.
The company im working at is still working with Cubes. We have Dimensions with sometimes like around a million entries and alot MDX. As a result some of the reports take some time to calculate. But these numbers are kinda rookie numbers compared to these extreme systems with billions of rows. Even though they use (i assume?) in memory, i am still impressed and curious at how instantaneously you get results in something like google, youtube, facebook etc.
I want to be the very best
Like no one ever was
November 12, 2020 at 2:36 pm
They just rely on the power of more and more nodes to distribute the workload of course. Taking that larger problem and breaking it up into a bunch of smaller problems. At least that's my assumption and how I technically handle the billions and trillions of records I have to query with my datasets.
But there is the time element too. If you're using things like PolyBase, it's extremely slow compared to just taking a little bit of time and querying data from PolyBase into a table that you can hash/optimize. For me, I have a little bit of time to spare to ensure analytical queries can be queried from an internal table versus having no time and trying to query directly from an external table (or HIVE-like table) sitting on a data store.
Even at billions of rows to query, which is lower than Facebook etc, Azure, Google, and AWS have great options that are surely within 10second result times for as long as you know how to query data right. I think the only time I've ever ran into issues doing ad-hoc queries with any of these 3 companies is when I broke Google BigQuery doing a many-to-many relationship query upon a billion records. But the results at some point was in the trillions before it aggregates down to a few dozen. So, it's understandable how I crashed their engine as it's not a usual use case to run into.
November 12, 2020 at 6:11 pm
It's what xsevensinzx said, but also there's likely some cheating. Pre-fetch, caching, likely some pre-calcs or other tricks that simulate certain things being really fast to the user. Trading space for time, is something we do often.
November 13, 2020 at 6:48 pm
What's essential is knowing what data access patterns are being used by each application and then partitioning, structuring, and indexing the data optimally. For ad-hoc aggregate reporting on millions or billions of records, a ColumnStore table like SQL Server DW, Hbase, or Amazon Redshift. For something like an online banking app, each user accesses a wide range of data elements but only for their personal slice of the pie. What you need is a near realtime datamart, maybe something like Cosmos DB or MongoDB, where all the recent transactional and profile data for a customer is contained in something like a JSON document which can be fetched within 10 ms.
RDMS engines like SQL Server are good for transactional read/write applications and for a highly constrained single version of truth, but even in the hands of a expert level SQL coder, it's only so-so when it comes to TB scale aggregate reporting and high volume queries. If you're willing to work outside the box, there are better options for the use cases described above.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
November 17, 2020 at 3:12 pm
I've worked in extreme environments before, but not on the scale that Steve is talking about.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply