An Introduction to Hadoop

5 min readDec 4, 2022

Introduction

When we are dealing with Big Data, our system would be handling a large set of data i.e. Gigabytes and Petabytes. To deal with these large chunks of data nowadays data engineers use Hadoop,(clusters of multiple systems) Hadoop helps analyze massive chunks of data in a parallel and more quick manner.

Suppose you are a Youtuber and you have a laptop that can store 1 Tb of data. Assuming you are posting your content on a regular basis. After a certain period of time, you have uploaded the 1TB of content on youtube now you need an additional storage system for Backup, You can do this in two ways either you will replace your existing system(Scale Up), with the more powerful system(System with 5Tb of Storage) or you can purchase an additional hard disk (scale out)to store more data.

Scaling Up(Replacing the old system with a new system) will be costlier than scaling Out(addition of commodity hardware). Hadoop is also using the scale-out technique to process large chunks of data.

H.D.F.S 1.X

H.D.F.S :- Hadoop distributed file system.

HDFS 1. X works as a Master-Slave Architecture. It consists of a single master and multiple slaves. Suppose You are working in an organization you will have one manager and 5–10 team members, the main task of the manager is to allocate the task between different team members, the same way HDFS 1.X works, The manager is the Master and team members are slaves.

Now the question is what does Hadoop distributed file system mean?

Suppose you want to process 1GB of data and you have 4 systems available to process the requested data. The system will equally divide the data(250 Mb) among all four systems(considering all the systems are available with their full potential). The parallel processing & distribution of data is done by the distributed file system.
Let’s Understand the Hadoop 1.X Architecture.

The Master Node

The master node consists of the Name Node and Job tracker.

a. The Name Node :

1. The main task of the name node is to store the metadata information.

What does metadata mean?
It stores information related to available systems. for example
i> Availability of system.
ii> System Configuration (Ram, Storage)
iii> Processing power of the system
iv> Checkpointing.
v> Assign work to the slave node

Note:- Name Node does not store the data but it stores the metadata information of the data node.

Name Node is also divided into two parts. i.e:- Primary Name Node and Secondary Name Node.

i.> Primary Name Node:- When the system starts, it stores the metadata information about the available systems at Oth second, which is known as fs_image (file system image). It Gives the first response to any kind of execution.
fs_image :- file system image stores the information of the device like configuration, processing power, and the kind of data that will be processed by the system.

ii.> Secondary Name Node:- Suppose you commute regularly from your home to your office. You leave your home at a particular time let’s say 8:30 AM and you do your lunch at 2:30 PM and leave the office at 7:00 PM these timings are the logs of your life same way when a machine does any kind of operation it records some logs, with the help of these logs we can conclude the success of any task. Those logs are recorded in the secondary Name Node. Suppose Admin has set the time to update the log every 50 seconds and then at every 50th second the edit_log file will be created and the fs_image+edit_log combined to form the updated fs_image_2.0 which will again be sent back to Primary Name Node. once the fs_image information will be updated in the secondary name node, then the secondary name node sends the flag to PNN(primary name node) once PNN will be free it will store the updated fs_image file. The above process will happen until the task would be done.

The process of updating the fs_image_2.0 is known as checkpointing and always remember checkpointing only happens in the secondary name node.

b. Job Tracker:

It works on the Cluster level. When a client requests the data, the Job tracker starts pinging the task to different clusters(Slave Nodes). Whichever cluster will respond to the job tracker the task will be assigned to that specific cluster.
Job tracker chooses the cluster on a first come first serve basis. Below are some important tasks the job tracker does;
i> Job Scheduling
ii> Resource Management
iii> Job Monitoring.

The Slave Node :

One Master Node can have multiple slave nodes. The slave node consists of Task Tracker & Data Node.

a. Task Tracker

Task tracker and job tracker both have similar work. The only major difference is that the job tracker works on a cluster level and the task tracker works on a node level.
Task Tracker gives the response to the job tracker about the available resources present for the task.

b. Data Node

It initiates the task. It is responsible for storing the actual data in HDFS. It configures a lot of hardware space because the actual data-level work happens in the data node.

Conclusion:-
If you have any questions in your mind kindly comment below. Keep learning keep exploring.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com