Big Data: An Introduction with Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

As with any database management system (DBMS), we can run our Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a HiveThrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of 3-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

The above sections are taken from http://manishku.blogspot.in/2012/12/big-data-and-hadoop-part-2.html

Now you will think that we already have Pig a powerful and simple language than why should be look for Hive? The downside of Pig is that it is something new and we need to learn it and then master it.

Facebook folks have developed a runtime Hadoop support structure which is called Hive which allows anyone who is already familiar with SQL to control the Hadoop platform right out of the gate. Hive is a platform where SQL developers are allowed to write HQL (Hive Query Language) which are similar to standard SQL statements. HQL is limited in the commands it understands but it is still pretty useful.

HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. We can run Hive queries in many ways. We can run them from

Hive Shell (a command line interface)
Java Database Connectivity (JDBC)
Open Database Connectivity (ODBC)
HiveThrift Client (The Hive Thrift Client is much like any database client that gets installed on a user’s client machine written in C++, Java, PHP, Python, or Ruby)

Hive Thrift Client can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Let us consider below simple example. Here we create a FBComments table, populate it, and then query that table using Hive:

CREATE TABLE FBComments(from_user STRING, userid BIGINT, commenttext STRING,

recomments INT)

COMMENT 'This is Facebook comments table and a simple example'

STORED AS SEQUENCEFILE;

LOAD DATA INPATH 'hdfs://node/fbcommentdata' INTO TABLE FBComments;

SELECT from_user, SUM(recomments)

FROM FBComments

GROUP BY from_user;

After looking above code, Hive (HQL) looks very much similar to traditional database SQL code. There exist small difference and any SQL developer can point it out.Hive is based on Hadoop and MapReduce operations, there are few key differences.

High Latency- Hadoop is intended for long sequential scans, we can expect queries to have a very high latency (many minutes). Hive would not be appropriate for applications that need very fast response times.

Not Suitable for Transaction processing - Hive is read-based and therefore not suitable for transaction processing where you expect high percentage of write operations.

Big Data: An Introduction with Hive

The above sections are taken from http://manishku.blogspot.in/2012/12/big-data-and-hadoop-part-2.html

HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. We can run Hive queries in many ways. We can run them from

Hive Shell (a command line interface)
Java Database Connectivity (JDBC)
Open Database Connectivity (ODBC)
HiveThrift Client (The Hive Thrift Client is much like any database client that gets installed on a user’s client machine written in C++, Java, PHP, Python, or Ruby)

Hive Thrift Client can use these client-side languages with embedded SQL to access a database such as DB2 or Informix.

Let us consider below simple example. Here we create a FBComments table, populate it, and then query that table using Hive:

CREATE TABLE FBComments(from_user STRING, userid BIGINT, commenttext STRING,

recomments INT)

COMMENT 'This is Facebook comments table and a simple example'

STORED AS SEQUENCEFILE;

LOAD DATA INPATH 'hdfs://node/fbcommentdata' INTO TABLE FBComments;

SELECT from_user, SUM(recomments)

FROM FBComments

GROUP BY from_user;

Not Suitable for Transaction processing - Hive is read-based and therefore not suitable for transaction processing where you expect high percentage of write operations.

Book Review: Big Red - Voyage of a Trident Submarine

by Andy Warren

SQLServerCentral.com

Blogs

I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-03-10

1,439 reads

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

by Robert Davis

SQLServerCentral.com

Blogs

Question: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? This question was sent to me via email. My reply follows. Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? Databases to be mirrored are currently running on 2005 SQL instances but will be upgraded to 2008 SQL in the near future.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-23

1,567 reads

Inserting Markup into a String with SQL

by Phil Factor

SQLServerCentral.com

T-SQL

In which Phil illustrates an old trick using STUFF to intert a number of substrings from a table into a string, and explains why the technique might speed up your code...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-18

1,631 reads

Networking - Part 4

by Andy Warren

SQLServerCentral.com

Blogs

You may want to read Part 1 , Part 2 , and Part 3 before continuing. This time around I'd like to talk about social networking. We'll start with social networking. Facebook, MySpace, and Twitter are all good examples of using technology to let...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-17

1,530 reads

Speaking at Community Events - More Thoughts

by Andy Warren

SQLServerCentral.com

Blogs

Last week I posted Speaking at Community Events - Time to Raise the Bar?, a first cut at talking about to what degree we should require experience for speakers at events like SQLSaturday as well as when it might be appropriate to add additional focus/limitations on the presentations that are accepted. I've got a few more thoughts on the topic this week, and I look forward to your comments.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-13

360 reads

Big Data: An Introduction with Hive

Rate

Share

Share

Rate

Big Data: An Introduction with Hive

Rate

Share

Share

Rate

Big Data: An Introduction with Hive

Rate

Share

Share

Rate

Big Data: An Introduction with Hive

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts