Big Data Notes
All Topics (16)
- 1. What is Big Data?
- 2. Big Data Characteristics
- 3. Types of Big Data
- 4. Traditional Data vs Big Data
- 5. Evolution of Big Data
- 6. Challenges with Big Data
- 7. Technologies Available for Big Data
- 7. Infrastructure for Big Data
- 9. Uses of Data Analytics
- 10. Hadoop
- 11. Hadoop Core Components
- 12. Hadoop Ecosystem
- 13. Hive Physical Architecture
- 14. Hadoop Limitations
- 15. RDBMS vs Hadoop
- 16. Hadoop Distributed File System (HDFS)
16. Hadoop Distributed File System (HDFS)
HDFS is the primary storage system used in Apache Hadoop.
It is specially designed to:
- store very large datasets
- work across multiple machines
- provide fault tolerance
- support Big Data processing
HDFS is inspired by:
Google File System (GFS)
and is one of the most important components of the Hadoop ecosystem.
What is HDFS?
HDFS stands for:
Hadoop Distributed File System
It stores huge files by:
- Splitting files into blocks
- Distributing blocks across many computers (DataNodes)
This makes storage:
- scalable
- reliable
- fault tolerant
Key Features of HDFS
1. Distributed Storage
Explanation
HDFS divides large files into smaller blocks and stores them on different nodes in the cluster.
Example
Suppose a file size is:
1 TB
HDFS splits it into:
128 MB blocks
and distributes blocks across many machines.
2. Fault Tolerance
Explanation
Each block is replicated multiple times.
Default replication factor:
3 copies
If one node fails, data is still available from another node.
Example
Suppose:
- Block A stored on Node1
- Replica stored on Node2 and Node3
If Node1 crashes:
- Hadoop reads block from Node2 or Node3.
3. Scalability
Explanation
HDFS supports horizontal scaling.
You can increase storage by simply adding more DataNodes.
Example
A company storing:
5 PB customer data
can expand storage by adding more machines to the cluster.
4. High Throughput
Explanation
HDFS is optimized for:
- large sequential reads
- large batch writes
It is not optimized for:
- small random reads/writes
Example
Processing:
100 TB log files
is very efficient in HDFS.
5. Cost-Effective
Explanation
HDFS works on:
commodity hardware
which means low-cost ordinary servers can be used.
This reduces infrastructure cost.
6. Flexibility
Explanation
HDFS can store:
- structured data
- semi-structured data
- unstructured data
Examples
HDFS can store:
- CSV files
- JSON logs
- videos
- images
- social media data
HDFS Architecture
HDFS follows:
Master-Slave Architecture
Main Components:
- NameNode (Master)
- DataNode (Slave)
- Secondary NameNode
HDFS Architecture Diagram (Conceptual)
Client
|
NameNode
/ | \
DataNode DataNode DataNode
1. NameNode
Role
The NameNode is the:
Master Server
It manages metadata of HDFS.
Responsibilities of NameNode
The NameNode stores:
- file names
- directory structure
- block locations
- permissions
It also:
- tracks DataNodes
- manages cluster health
- handles file operations
Example
Suppose file:
sales_data.csv
is divided into blocks.
NameNode stores information like:
Block1 → DataNode1
Block2 → DataNode5
Block3 → DataNode2
Important Note
If NameNode fails:
HDFS becomes inaccessible
because metadata is unavailable.
Modern Hadoop uses:
- High Availability (HA)
- Backup NameNode
to reduce this problem.
2. DataNode
Role
DataNodes are:
Slave Nodes
that store actual data blocks.
Responsibilities
DataNodes:
- store data blocks
- read/write data
- send heartbeat signals
- perform replication
Heartbeat Mechanism
Each DataNode regularly sends:
heartbeat messages
to NameNode.
If heartbeat stops:
- NameNode assumes node failure.
Example
Suppose:
- DataNode2 crashes
NameNode automatically creates another replica on another node.
3. Secondary NameNode
Role
The Secondary NameNode assists the NameNode.
Important:
It is NOT a backup NameNode
Functions
It:
- merges edit logs
- creates checkpoints
- reduces NameNode restart time
Example
Over time:
- NameNode metadata grows large
Secondary NameNode periodically merges:
- FSImage
- Edit Logs
to optimize metadata management.
HDFS File Storage Mechanism
Step 1: File Splitting
Large files are divided into blocks.
Default block size:
128 MB
Step 2: Replication
Each block is copied multiple times.
Default replication:
3 replicas
Step 3: Metadata Management
NameNode stores:
- block information
- DataNode locations
Step 4: Data Storage
Actual blocks are stored on DataNodes.
Example
Suppose file size:
1 TB
Number of blocks:
~8000 blocks
With replication factor 3:
Total storage needed ≈ 3 TB
distributed across cluster nodes.
HDFS Read Process
Step-by-Step
Step 1
Client requests file from NameNode.
Step 2
NameNode provides block locations.
Example:
Block1 → DataNode3
Block2 → DataNode7
Step 3
Client directly reads blocks from DataNodes.
Blocks can be read in parallel.
Example
Reading:
100 GB video file
becomes faster because multiple nodes serve data simultaneously.
HDFS Write Process
Step-by-Step
Step 1
Client requests write operation from NameNode.
Step 2
NameNode selects DataNodes.
Step 3
Client writes block to first DataNode.
Step 4
Block is replicated to other DataNodes.
Step 5
DataNodes confirm successful storage.
Example
Suppose replication factor:
3
Data flow:
Client → DataNode1 → DataNode2 → DataNode3
Advantages of HDFS
1. Fault Tolerance
Automatic replication protects against node failure.
2. High Throughput
Efficient for Big Data batch processing.
3. Scalability
Storage grows by adding more nodes.
4. Cost-Effective
Uses low-cost hardware.
5. Supports Multiple Data Types
Handles:
- structured
- semi-structured
- unstructured data
Limitations of HDFS
1. Not Suitable for Real-Time Processing
HDFS is batch-oriented.
Not ideal for:
- low latency
- instant querying
2. Small File Problem
Millions of small files overload NameNode memory.
Example
Suppose:
10 million 1 KB files
Metadata storage becomes a problem.
3. Single NameNode Dependency
Traditional HDFS depends heavily on NameNode.
Failure can stop the cluster.
4. High Storage Usage
Replication consumes extra storage.
Example:
1 TB data with replication factor 3 = 3 TB storage
5. Limited Transaction Support
HDFS is not fully ACID compliant.
Not suitable for:
- banking systems
- OLTP applications
Real-Life Example of HDFS
Suppose YouTube stores:
- videos
- user logs
- comments
Data size:
Petabytes of data
HDFS helps by:
- distributing files across many servers
- replicating data
- enabling parallel processing
If one server fails:
- videos are still available from replicas.
Key Takeaways
Important Points
- HDFS is the backbone of Hadoop storage.
- Uses distributed architecture.
- NameNode manages metadata.
- DataNodes store actual blocks.
- Files are split into blocks and replicated.
- Optimized for:
- scalability
- fault tolerance
- high throughput