Big Data Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (16)

  • 1. What is Big Data?
  • 2. Big Data Characteristics
  • 3. Types of Big Data
  • 4. Traditional Data vs Big Data
  • 5. Evolution of Big Data
  • 6. Challenges with Big Data
  • 7. Technologies Available for Big Data
  • 7. Infrastructure for Big Data
  • 9. Uses of Data Analytics
  • 10. Hadoop
  • 11. Hadoop Core Components
  • 12. Hadoop Ecosystem
  • 13. Hive Physical Architecture
  • 14. Hadoop Limitations
  • 15. RDBMS vs Hadoop
  • 16. Hadoop Distributed File System (HDFS)

16. Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used in Apache Hadoop.

It is specially designed to:

  • store very large datasets
  • work across multiple machines
  • provide fault tolerance
  • support Big Data processing

HDFS is inspired by:

Google File System (GFS)

and is one of the most important components of the Hadoop ecosystem.

What is HDFS?

HDFS stands for:

Hadoop Distributed File System

It stores huge files by:

  1. Splitting files into blocks
  2. Distributing blocks across many computers (DataNodes)

This makes storage:

  • scalable
  • reliable
  • fault tolerant

Key Features of HDFS

1. Distributed Storage

Explanation

HDFS divides large files into smaller blocks and stores them on different nodes in the cluster.

Example

Suppose a file size is:

1 TB

HDFS splits it into:

128 MB blocks

and distributes blocks across many machines.

2. Fault Tolerance

Explanation

Each block is replicated multiple times.

Default replication factor:

3 copies

If one node fails, data is still available from another node.

Example

Suppose:

  • Block A stored on Node1
  • Replica stored on Node2 and Node3

If Node1 crashes:

  • Hadoop reads block from Node2 or Node3.

3. Scalability

Explanation

HDFS supports horizontal scaling.

You can increase storage by simply adding more DataNodes.

Example

A company storing:

5 PB customer data

can expand storage by adding more machines to the cluster.

4. High Throughput

Explanation

HDFS is optimized for:

  • large sequential reads
  • large batch writes

It is not optimized for:

  • small random reads/writes

Example

Processing:

100 TB log files

is very efficient in HDFS.

5. Cost-Effective

Explanation

HDFS works on:

commodity hardware

which means low-cost ordinary servers can be used.

This reduces infrastructure cost.

6. Flexibility

Explanation

HDFS can store:

  • structured data
  • semi-structured data
  • unstructured data

Examples

HDFS can store:

  • CSV files
  • JSON logs
  • videos
  • images
  • social media data

HDFS Architecture

HDFS follows:

Master-Slave Architecture

Main Components:

  1. NameNode (Master)
  2. DataNode (Slave)
  3. Secondary NameNode

HDFS Architecture Diagram (Conceptual)

              Client
|
NameNode
/ | \
DataNode DataNode DataNode
 

1. NameNode

Role

The NameNode is the:

Master Server

It manages metadata of HDFS.

Responsibilities of NameNode

The NameNode stores:

  • file names
  • directory structure
  • block locations
  • permissions

It also:

  • tracks DataNodes
  • manages cluster health
  • handles file operations

Example

Suppose file:

sales_data.csv

is divided into blocks.

NameNode stores information like:

Block1 → DataNode1
Block2 → DataNode5
Block3 → DataNode2
 

Important Note

If NameNode fails:

HDFS becomes inaccessible

because metadata is unavailable.

Modern Hadoop uses:

  • High Availability (HA)
  • Backup NameNode

to reduce this problem.

2. DataNode

Role

DataNodes are:

Slave Nodes

that store actual data blocks.

Responsibilities

DataNodes:

  • store data blocks
  • read/write data
  • send heartbeat signals
  • perform replication

Heartbeat Mechanism

Each DataNode regularly sends:

heartbeat messages

to NameNode.

If heartbeat stops:

  • NameNode assumes node failure.

Example

Suppose:

  • DataNode2 crashes

NameNode automatically creates another replica on another node.

3. Secondary NameNode

Role

The Secondary NameNode assists the NameNode.

Important:

It is NOT a backup NameNode
 

Functions

It:

  • merges edit logs
  • creates checkpoints
  • reduces NameNode restart time

Example

Over time:

  • NameNode metadata grows large

Secondary NameNode periodically merges:

  • FSImage
  • Edit Logs

to optimize metadata management.

HDFS File Storage Mechanism

Step 1: File Splitting

Large files are divided into blocks.

Default block size:

128 MB
 

Step 2: Replication

Each block is copied multiple times.

Default replication:

3 replicas
 

Step 3: Metadata Management

NameNode stores:

  • block information
  • DataNode locations

Step 4: Data Storage

Actual blocks are stored on DataNodes.

Example

Suppose file size:

1 TB

Number of blocks:

~8000 blocks

With replication factor 3:

Total storage needed ≈ 3 TB

distributed across cluster nodes.

HDFS Read Process

Step-by-Step

Step 1

Client requests file from NameNode.

Step 2

NameNode provides block locations.

Example:

Block1 → DataNode3
Block2 → DataNode7
 

Step 3

Client directly reads blocks from DataNodes.

Blocks can be read in parallel.

Example

Reading:

100 GB video file

becomes faster because multiple nodes serve data simultaneously.

HDFS Write Process

Step-by-Step

Step 1

Client requests write operation from NameNode.

Step 2

NameNode selects DataNodes.

Step 3

Client writes block to first DataNode.

Step 4

Block is replicated to other DataNodes.

Step 5

DataNodes confirm successful storage.

Example

Suppose replication factor:

3

Data flow:

Client → DataNode1 → DataNode2 → DataNode3
 

Advantages of HDFS

1. Fault Tolerance

Automatic replication protects against node failure.

2. High Throughput

Efficient for Big Data batch processing.

3. Scalability

Storage grows by adding more nodes.

4. Cost-Effective

Uses low-cost hardware.

5. Supports Multiple Data Types

Handles:

  • structured
  • semi-structured
  • unstructured data

Limitations of HDFS

1. Not Suitable for Real-Time Processing

HDFS is batch-oriented.

Not ideal for:

  • low latency
  • instant querying

2. Small File Problem

Millions of small files overload NameNode memory.

Example

Suppose:

10 million 1 KB files

Metadata storage becomes a problem.

3. Single NameNode Dependency

Traditional HDFS depends heavily on NameNode.

Failure can stop the cluster.

4. High Storage Usage

Replication consumes extra storage.

Example:

1 TB data with replication factor 3 = 3 TB storage

5. Limited Transaction Support

HDFS is not fully ACID compliant.

Not suitable for:

  • banking systems
  • OLTP applications

Real-Life Example of HDFS

Suppose YouTube stores:

  • videos
  • user logs
  • comments

Data size:

Petabytes of data

HDFS helps by:

  • distributing files across many servers
  • replicating data
  • enabling parallel processing

If one server fails:

  • videos are still available from replicas.

Key Takeaways

Important Points

  • HDFS is the backbone of Hadoop storage.
  • Uses distributed architecture.
  • NameNode manages metadata.
  • DataNodes store actual blocks.
  • Files are split into blocks and replicated.
  • Optimized for:
    • scalability
    • fault tolerance
    • high throughput
Page 4 of 4