Introduction and overview of the MPP market place.
Over the past few months I have been looking at data warehouse appliances such as
Neteeza and Teradata. Teradata and
Neteeza have been around for a long time and are used by some of the largest
commercial companies in the world. Green Plum is a relatively new kid on the block but shows incredible promise and
as they have recently been acquired by EMC2 they look like an interesting
alternative. In fact the Gartner report,
Magic Quadrant for Data Warehouse Database Management Systems, is available
on the Green Plum web site.
Dataupia have an interesting proposition
in that their product parallelises customer existing platforms whether it is SQL
Server, Oracle, DB2, etc. DATAllegro was acquired by Microsoft which resulted in Project Madison which we
now know as SQL2008R2 Parallel Processing Edition.
So the Massive Parallel Processing market is looking very interesting at the moment.
Microsoft's acquisition of DATAllegro and the subsequent release of SQL2008R2
Parallel Processing Edition is potentially a very significant step in the data warehousing
world. The main thrust of development for SQL2008 was the BI stack and the SQL2008R2
MPP has to be viewed as a weighty foundation stone underpinning that stack.
What are Massive Parallel Processing appliances?
In a nutshell the end user sees a single database server but under the hood it is
many database servers operating in tandem (parallel) to deliver blistering performance
and massive scalability. Products like Kognitio WX2 and Neteeza are delivered as a populated rack of blade servers.
The different appliances use a variety of methods to achieve the same ends.
- Some offer shared-nothing architecture where each "node" within the appliance
is unaware of the other nodes participating in the appliance.
- Some offer nodes that are fully aware of each other.
- Others offer a system where there are master nodes that parcel out the work to other
worker nodes.
In fact, when I first read about soft Numa (non uniform memory access) within SQL2008
my first reaction was that MPP appliances do this at the macro level. The appliance is not simply a bunch of servers co-operating with each other. They also manage their own storage so in some respects they act a bit like a SAN.
What are MPP appliances used for?
It would be a mistake to regard these devices as big boys toys for those organisations
with more money than sense. I think in the next 5 years we can expect these devices
to filter down into medium sized enterprises.
In a nutshell these appliance are aimed at heavy duty analytical work rather than
fast production of stock reports. They obviously can support such stock reports
but that would be like buying a Ferrari and only driving 5 miles to the shops.
Speed
Throughout my career a common bug-bear has been the user who fires off an horrendous
query against the database and while the DB server sinks to its knees that user
goes off on their lunch break. When they come back from said lunch break they take
one look at the results and cannot remember why they wrote the query in the first
place!
These people have been the bane of a DBA's existence. As a DBA I have been in
that position but now realise that it was me that was at fault and not the user
writing the query.
What that user was trying to do, and what MPP boxes make it possible to do, is to
fire off a train of thought query
- Query => I wonder what is going on in product line 'x'
- Answer leads to further query =>Ah so that mean that activity 'y' must
be happening
- Answer leads to further query =>Not quite but this unexpected behaviour reveals
a number of data hot spots
- Answer leads to further query =>Are their enough customers in those hots spots
to represent a business opportunity?
- Answer leads to significant business insight, increase in profits, bonuses and job
security all round!
The speed at which the MPP boxes bring back the answer to these queries allows these
question and answers to be chained together. In a traditional database the end user
might not get beyond stage 2.
Another point to bear in mind is that these train of thought queries are one-off
queries. They will probably not be run again which means the following
- The queries will never be optimised because they are run only once
- A cached execution plan is next to useless because they are run only once
Size
We have gone from the age of data being generated by human beings and is therefore
limited to the speed and capacity that humans generate data to an age where data
is generated from machines.
Any serious web site will be analysing its web traffic possibly in real time. At
the SQL Bits conference in London one of the presenters was demonstrating the use
of Stream Insight to produce cubes of data that are there to reveal the performance
of data cubes used by their business. Their business is incredibly time sensitive
so if the data cubes are not available millions of pounds can be lost in a few minutes.
This sort of telemetry represents a massive jump in the volumes of data so systems
that scale to the petabyte scale rapidly become necessary.
Analytics Integration
Teradata, Neteeza & HP Neoview have the facility to accept pass-thru queries
from SAS Analytics. This means that SAS can recognise that these platforms support
its core functionality and therefore leverage the intrinsic power of the MPP platform.
SAS can throw some rather nasty queries at SQL Server so formal support for SAS
pass-thru queries in the Microsoft stack is something I would very much like to
see.
Green Plum supports the R open source analytics
language.
What are the criteria for choosing an MPP appliance?
Requirements, requirements, requirements
You really need to be clear on what business problem you are trying to solve.
- What data volumes do you have?
- Is this data volume made up of well structured, high quality data or a mash of disparate
data imported from legacy systems?
- What is the projected growth of the data?
- How up-to-date does the data need to be?
- What query response times are needed and/or desired?
In short, know your business, know your data. Once you have a clear idea of this
you have a firm basis to get meaningful information out of the vendors.
Integration
Investing in an MPP platform is a major investment for the business in terms of
people, processes and technology.
My personal opinion is that, while such devices will eventually lead to investing
in new reporting, analytics and ETL tools they should initially be capable of integrating
with what the organisation already has.
MPP platforms do require a change of mindset and it is just as possible to implement
then badly as it is with any other piece of technology. Although most MPP vendors
are very good with regard to support if something doesn't work you do not want
to be caught between two vendors arguing that each other's technology is the
cause.
Strategic Partnerships
This could come under the banner of integration however it is worth asking if the
MPP Vendor has any strategic partnerships with toolset or analytics vendors. If you do find that you need to replace part of your BI stack then knowing that the vendor works well with a known range of tools is of major importance.
Physical footprint
These devices are in reality more than one physical server. One vendor highlighted
their biggest client as having a 192 blade appliance supporting 40PB of data! Questions
have to be asked as to the physical requirements for the chosen MPP.
- Will your chosen appliance fit in your data centre?
- What are its power requirements and can your data centre meet them?
- What are its cooling requirements and can your cooling systems meet them?
- What other connectivity is required for the appliance?
Security
The whole purpose of a data warehouse is to allow access to data but there are going
to be pools of data that are sensitive and need protecting. There are a number of
security questions that need to be asked.
- Does the appliance support encryption?
- Does the appliance integrate with LDAP, Active Directory etc or does it have its
own security scheme?
- Does it have object level security?
- Does it support role based security?
- Does it expose users passwords in any way? WARNING, some of them do expose users
passwords to sys admins.
Auditing
There are a variety of system auditing tasks that I would expect an appliance to
perform
- Query usage?
- Security access?
- System performance?
- Dependency tracking?
Licensing
The different vendors follow different licensing models and you need to be clear on precisely what you are buying into. For some the appliance includes software and hardware and ANY change to the hardware requires an upgrade license. Others are more casual and allow upgrades to RAM and disk without licensing consequence. Others regard only the production environment for licensing purposes.
One other thing you might consider asking: does the vendor have a model that supports data warehousing as a service?
Programmability
Sooner or later you are going to want to do more involved that running straight
queries against the MPP. A few things you might want to know:
- Does it have the equivalent of stored procedures? Not all of them do.
- Does it have a scripting language?
- Does it support any embedded language such as Java?
- Does it have an API that allows extensibility?
- Do you have the skills in house to use the API, scripting or embedded language?
- Does it have any form of scheduler or something comparable to SQLSERVERAGENT?
Recoverability
It is never a nice experience to have to go through disaster recovery on a multi
terabyte system.
- Does the appliance support a full backup strategy
- Differential backups
- Point in time recovery
- Object level backups
- How does the system react to hardware failures?
- RAM
- Disk
- An entire blade going down?
- Does it support hot swappable hardware?
- Can the system operate in the event of hardware failure albeit in a reduced mode?
- Does it have the equivalent of rollback/undo logs?
Support
- Does the vendor offer a support contract or SLA agreement?
- Is there a user group?
- Are there user forums?
- Does the vendor supply spares kit?
Scalability
Most appliances will operate into the high terabyte range and some well into the
petabyte range. The question is not "can they scale" but the nature of how they should scale.
- What is the recommended ratio of RAM to disk?
- Is there a case for adding a lot of slower RAM rather than a smaller amount of ultra-high
speed RAM?
- When is it more prudent to add blades rather than upgrade other blades?
- How much disk should each blade address?
- Are there bottlenecks in the appliance design? For example, work may be performed
across many blades but if they are managed by a single blade then that single blade
is the bottleneck.
- What user concurrency can they support? Some solutions do not major on massive user
concurrency.
One point I should like to make is that while in OLTP systems growth estimates tend
to be in the 2x or 3x over a planning cycle data warehouse growth tends to be in
the 10x or higher growth range.
Training
Teradata has a full certification scheme. Do not underestimate the training required for these appliances.
Given the cost of MPP appliances it is worth negotiating training as part of the
price. Also consider who needs training
- DBAs?
- Infrastructure specialists?
- Analytics users?
- Disaster recovery specialists?
Affordability
The purchase price is just the start although Dataupia have made "the price
you see is the price you pay" part of their selling proposition.
When talking to vendors don't make any assumptions about what the price includes.
Always be explicit in what is expected.
- What are the ongoing licensing costs?
- What are the maintenance/support costs?
- Does it have any proprietary hardware requirements?
- Are there additional modules or tools to be bought?
Some vendors will be remarkably flexible on the purchase price but ongoing costs
(total cost of ownership) is high.
Flexibility
Some MPP appliances are designed to be used a specific way.
- Is the appliance suitable for near real-time data warehousing?
- Can it run on commodity hardware or does it require specialist kit?
- Can it run on heterogeneous hardware or does it require all hardware to be configured
the same?
- Can it handle online data loading or is loading and querying to be treated as two
separate operations that must not coincide?
Flexibility can also include the way in which the appliance is used
- If licensed on processor power is it possible to rent temporary license extensions?
For example 11 months of the year the appliance is quite capable of coping with
however many blades are in the appliance but for 2 weeks of the year it really needs
another couple of blades.
- How does the vendor treat the installation of any client tools? For example, does
uninstalling a tool from one machine and installing it on another require any additional
service and support?
Design considerations
One thing that is very important in the design of a database to be put onto an MPP
is the distribution of the data. Most appliances have some form of hashing algorithm
to distribute the data so that the maximum number of nodes can be brought to bear
on a given query.
One thing to watch out for is where you have data that has hotspots in it. For example,
if you have data for the past 12 months you do not want the most recent month to
be confined to a single blade otherwise that single blade will be attempting to
do the lion's share of the work!
My experience has been that new technologies require a new mindset in order to get
the best out of them. The vendor should be able to advise on any key considerations
that are required in order to get the best out of their appliance.
In conclusion
Before Microsoft bundled DTS in with SQL Server ETL tools were a specialist niche with a niche price to match. Potentially SQL2008R2 Parallel Processing edition could have the same effect although it has to be said that there is already healthy competition in the MPP market place.
From what I have seen so far each of these appliances have their strengths and weaknesses.
I have seen an appliance perform at "I still don't believe it" speeds
when configured and used properly and in other cases far worse than SQL Server 2000!
The next 12-24 months should prove to be very interesting for those of us interested
in the world of MPP platforms!