Blog Post

Introduction To Hive’s Partitioning #bigdata #hive

,

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can overcome this issue by implementing partitions in Hive. Hive makes it very easy to implement partitions by using the automatic partition scheme when the table is created.

In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query.

Today we are going to see how we can load a csv file to a partitioned table. For this we are going to use Airline OnTime dataset. Loading csv data to a partitioned table involves below mentioned two steps:

  1. Load csv file to a non-partitioned table.
  2. Load non-partitioned table data to partitioned table.

See more at the source @ Big Data/BI Zone http://ift.tt/1c5MJJO

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating