Apache Hive Beyond Basics: Part I

akhil anand
DataDrivenInvestor
Published in
7 min readOct 24, 2023

--

Source

Overview

Welcome back! I hope you enjoyed the previous article. Let's delve deeper into Apache Hive. One question on everyone's mind is why Hive, with its SQL-like functionality, is given such significant importance within the big data community. Apache Hive is indeed an exceptional tool, and it possesses special features that set it apart from others. These features include Hive's UDFs, partitioning, and bucketing. Don't worry, by the end of this article, you will be able to understand all the technical jargon.

Bucketing and Partitioning

Bucketing

When dealing with big data, a significant challenge is data skewness. If you’re processing a large dataset like 100GB, data skew can lead to uneven data distribution, causing some specific nodes to bear a disproportionate load. This results in two problems:

  1. The overloaded node takes too long to process its data.
  2. The overloaded node might fail to handle the excessive data.

To address this issue, bucketing is introduced. It clusters data in a manner that minimizes skewness, ensuring a more even data distribution among the nodes. The key to achieving this even distribution is hashing.

How Hashing Works?
Suppose you have an employee table, which consists of columns like employee ID, employee Name, State of belonging, and Age. Let’s see how bucketing works in this case.

Image 1

Let’s say I am selecting the State column for bucketing (you can choose any column based on your use case or observation). Suppose I want to create buckets using the modular division of the length of the state by 4. I have obtained the hashing/bucketing outcome as shown in the image below and have segmented it into different buckets. Don’t worry, In Hive, once you write the bucketing syntax, Hive will automatically perform all the operations for you.

Image 2

Now, based on the provided data, Hive has successfully created four distinct clusters of data. These data clusters are nearly evenly distributed, ensuring that each node will receive almost an equal amount of data to process. Hence, processing times will be similar across all nodes, and the system will be less prone to faults or failures.

Bucketing helps in improving join operations on larger datasets. It ensures the even distribution of data. It does not store any metadata info and directly assigns the data to a bucket with the help of a hashing function.

Partitioning

Suppose you are working in an e-commerce company, if you have a dataset that contains day-wise sales data for each state over the past 7 years, and you need to extract information for the month of April 2022, using traditional filtering methods would require scanning the entire table to retrieve the desired results. To optimize this process and avoid the overhead of scanning the entire dataset, you can implement partitioning.

Partitioning is a technique used to organize and structure your data in a way that allows for faster and more efficient data retrieval. By partitioning the data, you can segregate it into distinct subsets, often based on a specific attribute like date or state in this case. With this approach, when you need data for April 2022, you can directly access the partition corresponding to that timeframe, rather than scanning the entire dataset, resulting in significantly improved query performance and reduced processing time.

Do and Don’t of Partitioning

Please consider the following common practices when implementing data partitioning:

  1. Avoid using Primary/Incremental IDs for partitioning: Primary or incremental keys are designed to store unique records. If you have 10 million rows of data and create 10 million partitions based on these keys, it can lead to performance issues.
  2. columns with repeated values: In the example mentioned (please refer to Image 2), the ‘state’ column can be an excellent choice for partitioning. This is because it contains values that are repeated multiple times, making it suitable for organizing and optimizing data storage.

How does partitioning retrieve data quickly?

  1. Partition Column: Choose the column that you want to use as the partitioning column.
  2. Subdirectory Creation: A separate subdirectory is created for each partition.
  3. Metadata Information: Metadata keeps track of the location and timestamp for each partition. When performing partitioned queries, metadata efficiently skips irrelevant partitions and directs the query to the specific subdirectory where the desired partition is stored. This optimization reduces the time and space required to fetch and process data through partitioning.

Partitioning in Hive

Static Partition:
i. Static partitioning is performed manually, where you define the partitions explicitly.
ii. It is generally faster than dynamic partitioning.
iii. It is suitable for ad hoc use cases where you have control over the partition values.

Dynamic Partition:
i. Dynamic partitioning is done automatically by Hive.
ii. Provide the partition column name to Hive, and Hive analyzes the data to extract unique values from the column and creates partitions accordingly
iii. Slower in Execution as compared to static Partition.

UDFs(User defined functions)

Talking about programming language, every language has its own built-in function to do some operations. Some common aggregate functions are Min(), Max(), Count(), add(), etc. The major concern is do all these inbuilt functions can solve our all business use cases the answer is a big no we need some flexibility to build custom functions that can fulfill the ad-hoc requirements. Here HQL UDFs function plays a pivotal role it gives the flexibility to write custom UDFs in multiple scripting languages (Python, java, etc) which is further utilized by hive query to get the desired result.

Source

Joins

Joins help in combining two or more tables. Suppose you are working in an e-commerce organization, and the management wants to see the Current Month’s Gross Merchandise Value on the buyer and state levels. You have been provided with three tables: order_master, user_master, and state_master. You can achieve the above result by combining all three of the mentioned tables, and all these tables can be combined using joins.

Reduce Side Join

Image 3
  1. Mapper: In Mapper different datasets are mapped in Key Value Pairs. Let’s Say you have Dataset A and Dataset B each and every element of both datasets will be converted into a key : Value Pair.Key will be used in Joining the datasets. Please refer below image.
Image 4

2. Shuffling: The mapper splits each dataset into key-value pairs, but the system needs to group all similar datasets into one group. Here, shuffling comes into the picture, it makes a unique group of every unique record. Please refer below image.

Image 5

3. Reducer: As unique key-value pairs are generated earlier for both datasets, the reducer now combines all the unique keys present in both datasets.

What happens if some keys are not present in dataset A and some keys are not present in dataset B?

Image 6

Please refer the Image 6 , Keys 1 is present in dataset A but not present in dataset B and vice versa for Key 5. Let’s see step by step how join works;

  1. Keys 2, 3 & 4 match in both datasets hence they will be combined together.
Image 7
  1. Key 1 is not present in dataset B, you need to create Key 1 an NA/NULL value in dataset B.Key 5is not present in dataset A, you need to create Key 5an NA/NULL value in dataset A.

After emitting the NULL value in both datasets combine the datasets using reduce side join, the final outcome will look like below;

Image 8

Advantages & Disadvantages

  1. It handles a large set of data where the out-of-memory issue occurs.
  2. fault tolerant in nature. Suppose reduce failers during the reduce phase the operation will start re-executing on another node.
  3. It handles complex and non-equijoin (the key of two datasets does not match but still combines both the datasets as explained above).
  4. Reduce side Joins are not a good choice for smaller datasets.

Ending Note

Apache Hive developed by Meta, addresses coding-level challenges faced by analytics professionals to process batches of data. Many companies, including key players like Paytm and Amex, rely on Hive for their analytics tasks.

Thank you for reading this article. Happy learning🏫.

Subscribe to DDIntel here.

Submit your work to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Follow us on LinkedIn, Twitter, YouTube, and Facebook.

--

--