Big Data Notes

DSA

Software Engineering

Software Architecture

Operating System

Big Data

Data Mining and Warehousing

TOC

Ada

CPP

DBMS

All Topics (16)

1. What is Big Data?
2. Big Data Characteristics
3. Types of Big Data
4. Traditional Data vs Big Data
5. Evolution of Big Data
6. Challenges with Big Data
7. Technologies Available for Big Data
7. Infrastructure for Big Data
9. Uses of Data Analytics
10. Hadoop
11. Hadoop Core Components
12. Hadoop Ecosystem
13. Hive Physical Architecture
14. Hadoop Limitations
15. RDBMS vs Hadoop
16. Hadoop Distributed File System (HDFS)

16. Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used in Apache Hadoop.

It is specially designed to:

store very large datasets
work across multiple machines
provide fault tolerance
support Big Data processing

HDFS is inspired by:

Google File System (GFS)

and is one of the most important components of the Hadoop ecosystem.

What is HDFS?

HDFS stands for:

Hadoop Distributed File System

It stores huge files by:

Splitting files into blocks
Distributing blocks across many computers (DataNodes)

This makes storage:

scalable
reliable
fault tolerant

Key Features of HDFS

1. Distributed Storage

Explanation

HDFS divides large files into smaller blocks and stores them on different nodes in the cluster.

Example

Suppose a file size is:

1 TB

HDFS splits it into:

128 MB blocks

and distributes blocks across many machines.

2. Fault Tolerance

Explanation

Each block is replicated multiple times.

Default replication factor:

3 copies

If one node fails, data is still available from another node.

Example

Suppose:

Block A stored on Node1
Replica stored on Node2 and Node3

If Node1 crashes:

Hadoop reads block from Node2 or Node3.

3. Scalability

Explanation

HDFS supports horizontal scaling.

You can increase storage by simply adding more DataNodes.

Example

A company storing:

5 PB customer data

can expand storage by adding more machines to the cluster.

4. High Throughput

Explanation

HDFS is optimized for:

large sequential reads
large batch writes

It is not optimized for:

small random reads/writes

Example

Processing:

100 TB log files

is very efficient in HDFS.

5. Cost-Effective

Explanation

HDFS works on:

commodity hardware

which means low-cost ordinary servers can be used.

This reduces infrastructure cost.

6. Flexibility

Explanation

HDFS can store:

structured data
semi-structured data
unstructured data

Examples

HDFS can store:

CSV files
JSON logs
videos
images
social media data

HDFS Architecture

HDFS follows:

Master-Slave Architecture

Main Components:

NameNode (Master)
DataNode (Slave)
Secondary NameNode

HDFS Architecture Diagram (Conceptual)

              Client
                 |
            NameNode
          /     |     \
     DataNode DataNode DataNode

1. NameNode

Role

The NameNode is the:

Master Server

It manages metadata of HDFS.

Responsibilities of NameNode

The NameNode stores:

file names
directory structure
block locations
permissions

It also:

tracks DataNodes
manages cluster health
handles file operations

Example

Suppose file:

sales_data.csv

is divided into blocks.

NameNode stores information like:

Block1 → DataNode1
Block2 → DataNode5
Block3 → DataNode2

Important Note

If NameNode fails:

HDFS becomes inaccessible

because metadata is unavailable.

Modern Hadoop uses:

High Availability (HA)
Backup NameNode

to reduce this problem.

2. DataNode

Role

DataNodes are:

Slave Nodes

that store actual data blocks.

Responsibilities

DataNodes:

store data blocks
read/write data
send heartbeat signals
perform replication

Heartbeat Mechanism

Each DataNode regularly sends:

heartbeat messages

to NameNode.

If heartbeat stops:

NameNode assumes node failure.

Example

Suppose:

DataNode2 crashes

NameNode automatically creates another replica on another node.

3. Secondary NameNode

Role

The Secondary NameNode assists the NameNode.

Important:

It is NOT a backup NameNode

Functions

It:

merges edit logs
creates checkpoints
reduces NameNode restart time

Example

Over time:

NameNode metadata grows large

Secondary NameNode periodically merges:

FSImage
Edit Logs

to optimize metadata management.

HDFS File Storage Mechanism

Step 1: File Splitting

Large files are divided into blocks.

Default block size:

128 MB

Step 2: Replication

Each block is copied multiple times.

Default replication:

3 replicas

Step 3: Metadata Management

NameNode stores:

block information
DataNode locations

Step 4: Data Storage

Actual blocks are stored on DataNodes.

Example

Suppose file size:

1 TB

Number of blocks:

~8000 blocks

With replication factor 3:

Total storage needed ≈ 3 TB

distributed across cluster nodes.

HDFS Read Process

Step-by-Step

Step 1

Client requests file from NameNode.

Step 2

NameNode provides block locations.

Example:

Block1 → DataNode3
Block2 → DataNode7

Step 3

Client directly reads blocks from DataNodes.

Blocks can be read in parallel.

Example

Reading:

100 GB video file

becomes faster because multiple nodes serve data simultaneously.

HDFS Write Process

Step-by-Step

Step 1

Client requests write operation from NameNode.

Step 2

NameNode selects DataNodes.

Step 3

Client writes block to first DataNode.

Step 4

Block is replicated to other DataNodes.

Step 5

DataNodes confirm successful storage.

Example

Suppose replication factor:

Data flow:

Client → DataNode1 → DataNode2 → DataNode3

Advantages of HDFS

1. Fault Tolerance

Automatic replication protects against node failure.

2. High Throughput

Efficient for Big Data batch processing.

3. Scalability

Storage grows by adding more nodes.

4. Cost-Effective

Uses low-cost hardware.

5. Supports Multiple Data Types

Handles:

structured
semi-structured
unstructured data

Limitations of HDFS

1. Not Suitable for Real-Time Processing

HDFS is batch-oriented.

Not ideal for:

low latency
instant querying

2. Small File Problem

Millions of small files overload NameNode memory.

Example

Suppose:

10 million 1 KB files

Metadata storage becomes a problem.

3. Single NameNode Dependency

Traditional HDFS depends heavily on NameNode.

Failure can stop the cluster.

4. High Storage Usage

Replication consumes extra storage.

Example:

1 TB data with replication factor 3 = 3 TB storage

5. Limited Transaction Support

HDFS is not fully ACID compliant.

Not suitable for:

banking systems
OLTP applications

Real-Life Example of HDFS

Suppose YouTube stores:

videos
user logs
comments

Data size:

Petabytes of data

HDFS helps by:

distributing files across many servers
replicating data
enabling parallel processing

If one server fails:

videos are still available from replicas.

Key Takeaways

Important Points

HDFS is the backbone of Hadoop storage.
Uses distributed architecture.
NameNode manages metadata.
DataNodes store actual blocks.
Files are split into blocks and replicated.
Optimized for:
- scalability
- fault tolerance
- high throughput

Page 4 of 4