Big Data Notes
All Topics (19)
- 1. What is Big Data?
- 2. Big Data Characteristics
- 3. Types of Big Data
- 4. Traditional Data vs Big Data
- 5. Evolution of Big Data
- 6. Challenges with Big Data
- 7. Technologies Available for Big Data
- 7. Infrastructure for Big Data
- 9. Uses of Data Analytics
- 10. Hadoop
- 11. Hadoop Core Components
- 12. Hadoop Ecosystem
- 13. Hive Physical Architecture
- 14. Hadoop Limitations
- 15. RDBMS vs Hadoop
- 16. Hadoop Distributed File System (HDFS)
- 17. Processing Data with Hadoop
- 18. Hadoop YARN
- 19. MapReduce Programming
16. Hadoop Distributed File System (HDFS)
HDFS is the primary storage system used in Apache Hadoop.
It is specially designed to:
- store very large datasets
- work across multiple machines
- provide fault tolerance
- support Big Data processing
HDFS is inspired by:
Google File System (GFS)
and is one of the most important components of the Hadoop ecosystem.
What is HDFS?
HDFS stands for:
Hadoop Distributed File System
It stores huge files by:
- Splitting files into blocks
- Distributing blocks across many computers (DataNodes)
This makes storage:
- scalable
- reliable
- fault tolerant
Key Features of HDFS
1. Distributed Storage
Explanation
HDFS divides large files into smaller blocks and stores them on different nodes in the cluster.
Example
Suppose a file size is:
1 TB
HDFS splits it into:
128 MB blocks
and distributes blocks across many machines.
2. Fault Tolerance
Explanation
Each block is replicated multiple times.
Default replication factor:
3 copies
If one node fails, data is still available from another node.
Example
Suppose:
- Block A stored on Node1
- Replica stored on Node2 and Node3
If Node1 crashes:
- Hadoop reads block from Node2 or Node3.
3. Scalability
Explanation
HDFS supports horizontal scaling.
You can increase storage by simply adding more DataNodes.
Example
A company storing:
5 PB customer data
can expand storage by adding more machines to the cluster.
4. High Throughput
Explanation
HDFS is optimized for:
- large sequential reads
- large batch writes
It is not optimized for:
- small random reads/writes
Example
Processing:
100 TB log files
is very efficient in HDFS.
5. Cost-Effective
Explanation
HDFS works on:
commodity hardware
which means low-cost ordinary servers can be used.
This reduces infrastructure cost.
6. Flexibility
Explanation
HDFS can store:
- structured data
- semi-structured data
- unstructured data
Examples
HDFS can store:
- CSV files
- JSON logs
- videos
- images
- social media data
HDFS Architecture
HDFS follows:
Master-Slave Architecture
Main Components:
- NameNode (Master)
- DataNode (Slave)
- Secondary NameNode
HDFS Architecture Diagram (Conceptual)
Client
|
NameNode
/ | \
DataNode DataNode DataNode
1. NameNode
Role
The NameNode is the:
Master Server
It manages metadata of HDFS.
Responsibilities of NameNode
The NameNode stores:
- file names
- directory structure
- block locations
- permissions
It also:
- tracks DataNodes
- manages cluster health
- handles file operations
Example
Suppose file:
sales_data.csv
is divided into blocks.
NameNode stores information like:
Block1 → DataNode1
Block2 → DataNode5
Block3 → DataNode2
Important Note
If NameNode fails:
HDFS becomes inaccessible
because metadata is unavailable.
Modern Hadoop uses:
- High Availability (HA)
- Backup NameNode
to reduce this problem.
2. DataNode
Role
DataNodes are:
Slave Nodes
that store actual data blocks.
Responsibilities
DataNodes:
- store data blocks
- read/write data
- send heartbeat signals
- perform replication
Heartbeat Mechanism
Each DataNode regularly sends:
heartbeat messages
to NameNode.
If heartbeat stops:
- NameNode assumes node failure.
Example
Suppose:
- DataNode2 crashes
NameNode automatically creates another replica on another node.
3. Secondary NameNode
Role
The Secondary NameNode assists the NameNode.
Important:
It is NOT a backup NameNode
Functions
It:
- merges edit logs
- creates checkpoints
- reduces NameNode restart time
Example
Over time:
- NameNode metadata grows large
Secondary NameNode periodically merges:
- FSImage
- Edit Logs
to optimize metadata management.
HDFS File Storage Mechanism
Step 1: File Splitting
Large files are divided into blocks.
Default block size:
128 MB
Step 2: Replication
Each block is copied multiple times.
Default replication:
3 replicas
Step 3: Metadata Management
NameNode stores:
- block information
- DataNode locations
Step 4: Data Storage
Actual blocks are stored on DataNodes.
Example
Suppose file size:
1 TB
Number of blocks:
~8000 blocks
With replication factor 3:
Total storage needed ≈ 3 TB
distributed across cluster nodes.
HDFS Read Process
Step-by-Step
Step 1
Client requests file from NameNode.
Step 2
NameNode provides block locations.
Example:
Block1 → DataNode3
Block2 → DataNode7
Step 3
Client directly reads blocks from DataNodes.
Blocks can be read in parallel.
Example
Reading:
100 GB video file
becomes faster because multiple nodes serve data simultaneously.
HDFS Write Process
Step-by-Step
Step 1
Client requests write operation from NameNode.
Step 2
NameNode selects DataNodes.
Step 3
Client writes block to first DataNode.
Step 4
Block is replicated to other DataNodes.
Step 5
DataNodes confirm successful storage.
Example
Suppose replication factor:
3
Data flow:
Client → DataNode1 → DataNode2 → DataNode3
Advantages of HDFS
1. Fault Tolerance
Automatic replication protects against node failure.
2. High Throughput
Efficient for Big Data batch processing.
3. Scalability
Storage grows by adding more nodes.
4. Cost-Effective
Uses low-cost hardware.
5. Supports Multiple Data Types
Handles:
- structured
- semi-structured
- unstructured data
Limitations of HDFS
1. Not Suitable for Real-Time Processing
HDFS is batch-oriented.
Not ideal for:
- low latency
- instant querying
2. Small File Problem
Millions of small files overload NameNode memory.
Example
Suppose:
10 million 1 KB files
Metadata storage becomes a problem.
3. Single NameNode Dependency
Traditional HDFS depends heavily on NameNode.
Failure can stop the cluster.
4. High Storage Usage
Replication consumes extra storage.
Example:
1 TB data with replication factor 3 = 3 TB storage
5. Limited Transaction Support
HDFS is not fully ACID compliant.
Not suitable for:
- banking systems
- OLTP applications
Real-Life Example of HDFS
Suppose YouTube stores:
- videos
- user logs
- comments
Data size:
Petabytes of data
HDFS helps by:
- distributing files across many servers
- replicating data
- enabling parallel processing
If one server fails:
- videos are still available from replicas.
Key Takeaways
Important Points
- HDFS is the backbone of Hadoop storage.
- Uses distributed architecture.
- NameNode manages metadata.
- DataNodes store actual blocks.
- Files are split into blocks and replicated.
- Optimized for:
- scalability
- fault tolerance
- high throughput
17. Processing Data with Hadoop
Hadoop is not only a storage system like HDFS, but also a complete platform for processing very large amounts of data.
It can process data across many computers together, which makes it:
- Fast
- Scalable
- Fault-tolerant
Hadoop uses different tools/frameworks for different types of processing.
Hadoop Data Processing Flow
General Flow
Data Source → Ingestion → HDFS Storage → Processing Engine → Output/Analysis
Example
Suppose a company collects:
- Website logs
- Customer data
- Sales records
- Social media data
This data is first stored in HDFS, then processed using tools like:
- MapReduce
- Hive
- Pig
- Spark
1. MapReduce – Core Hadoop Processing Framework
MapReduce is the original processing engine of Hadoop.
It processes huge datasets in parallel across many machines.
How MapReduce Works
Step 1: Input Splitting
HDFS divides large files into blocks (usually 128 MB).
Example
A 1 GB file is divided into:
128 MB + 128 MB + 128 MB + ...
Each block is processed separately.
Step 2: Map Phase
Mapper reads data and produces key-value pairs.
Example: Word Count
Input text:
Hadoop is big data
Big data is powerful
Mapper Output:
(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Big,1)
(data,1)
(is,1)
(powerful,1)
Step 3: Shuffle and Sort
Hadoop groups same keys together.
Output becomes:
(Hadoop,[1])
(is,[1,1])
(data,[1,1])
(big,[1])
(Big,[1])
(powerful,[1])
Step 4: Reduce Phase
Reducer combines values.
Final Output:
(Hadoop,1)
(is,2)
(data,2)
(big,1)
(Big,1)
(powerful,1)
Step 5: Output Storage
Final result is stored back in HDFS.
Advantages of MapReduce
a. Highly scalable
b. Fault tolerant
c. Handles huge data
Disadvantages
a. Slow for real-time work
b. Complex coding
2. Apache Hive – SQL-Like Processing
Hive allows users to write SQL-like queries on Hadoop data.
Instead of writing Java MapReduce code, analysts can use simple queries.
Hive Workflow
- User writes HiveQL query
- Hive converts it into MapReduce/Spark jobs
- Hadoop executes the job
- Results are returned
Example of Hive Query
Suppose we have a table named employees.
Query
SELECT department, COUNT(*)
FROM employees
GROUP BY department;
What Hive Does
Hive automatically:
- Reads data from HDFS
- Converts query into processing jobs
- Executes on cluster
- Returns result
Real-Life Example
An e-commerce company wants:
Total sales by city
Using Hive:
SELECT city, SUM(sales)
FROM orders
GROUP BY city;
Very easy compared to MapReduce coding.
Advantages of Hive
a. Easy for SQL users
b. No need to write Java code
c. Good for reporting
Disadvantages
a. Slower for real-time applications
3. Apache Pig – ETL and Data Transformation
Pig uses a scripting language called Pig Latin.
It is mainly used for:
- ETL (Extract, Transform, Load)
- Cleaning data
- Transforming data
Pig Workflow
- Write Pig script
- Pig converts it into MapReduce jobs
- Hadoop executes jobs
- Results stored in HDFS
Example of Pig Script
Suppose sales data contains:
id,name,sales
1,Ram,5000
2,Shyam,7000
Pig Script
A = LOAD 'sales.csv' USING PigStorage(',')
AS (id:int, name:chararray, sales:int);
B = FILTER A BY sales > 6000;
STORE B INTO 'output';
Result
2,Shyam,7000
Advantages of Pig
a. Easy data transformation
b. Less coding
c. Good for ETL jobs
Disadvantages
a. Not suitable for real-time processing
4. Apache Spark – Fast In-Memory Processing
Spark is much faster than MapReduce because it processes data in memory.
It supports:
- Batch processing
- Real-time processing
- Machine learning
- Graph processing
Spark Workflow
- Load data from HDFS/Hive/Kafka
- Apply transformations
- Execute tasks in parallel
- Save output
Example Using Spark
Suppose we want to count words.
PySpark Example
data = spark.read.text("input.txt")
words = data.selectExpr("explode(split(value,' ')) as word")
result = words.groupBy("word").count()
result.show()
Why Spark is Fast
MapReduce writes intermediate results to disk.
Spark keeps data in RAM (memory).
So processing becomes much faster.
Advantages of Spark
a. Very fast
b. Real-time processing
c. Supports Python, Java, Scala, R
d. Good for Machine Learning
Disadvantages
a. Requires more RAM
5. Real-Time and Streaming Tools
Hadoop ecosystem also supports streaming data.
Apache Flume
Used to collect log data and move it into HDFS.
Example
Web server logs → Flume → HDFS
Apache Kafka
Distributed messaging system for streaming data.
Example
Twitter messages streamed into Hadoop.
Spark Streaming / Storm
Processes live streaming data.
Example Use Cases
- Fraud detection
- IoT sensor monitoring
- Live analytics
- Stock market analysis
Complete Hadoop Ecosystem Example
Example: Online Shopping Website
Step 1: Data Generation
Users generate:
- Search logs
- Purchase records
- Clickstream data
Step 2: Data Ingestion
Tools like:
- Flume
- Kafka
collect data.
Step 3: Storage
Data stored in HDFS.
Step 4: Processing
Different tools used:
| Task | Tool |
|---|---|
| Batch report | MapReduce |
| SQL analytics | Hive |
| Data cleaning | Pig |
| Real-time analytics | Spark |
Step 5: Output
Results used for:
- Dashboards
- Reports
- Recommendations
- Alerts
Best Practices in Hadoop Processing
1. Use Large Files
Small files increase NameNode overhead.
Good practice:
Use files larger than 128 MB
2. Partition Hive Tables
Improves query speed.
Example
Partition by date:
sales/year=2026/month=05/
3. Choose Correct Tool
| Requirement | Best Tool |
|---|---|
| Batch Processing | MapReduce |
| SQL Queries | Hive |
| ETL/Data Cleaning | Pig |
| Real-Time Processing | Spark |
4. Compress Data
Compression improves:
- Storage efficiency
- Faster I/O
Formats:
- Snappy
- Gzip
- Parquet
5. Monitor with YARN
YARN manages cluster resources.
It helps avoid:
- Memory issues
- Task failures
Simple Comparison Table
| Feature | MapReduce | Hive | Pig | Spark |
|---|---|---|---|---|
| Type | Processing Engine | SQL Tool | Scripting Tool | Fast Processing Engine |
| Coding | Complex | Easy SQL | Simple Scripts | Moderate |
| Speed | Slow | Medium | Medium | Fast |
| Real-Time | No | No | No | Yes |
| Best Use | Batch Jobs | Data Analysis | ETL | Real-Time + ML |
18. Hadoop YARN
YARN (Yet Another Resource Negotiator) is the resource management and job scheduling system in Hadoop.
It was introduced in Hadoop 2.x to solve the limitations of Hadoop 1.x.
Before YARN (Hadoop 1.x)
In Hadoop 1.x:
- MapReduce handled:
- Data processing
- Resource management
Because of this:
a. Only MapReduce jobs could run
b. Cluster scalability was limited
c. Resource usage was inefficient
After YARN (Hadoop 2.x)
YARN separated:
Resource Management ≠ Data Processing
Now different frameworks can run together:
- MapReduce
- Spark
- Tez
- Flink
on the same Hadoop cluster.
Simple Definition of YARN
YARN is the brain of Hadoop cluster management.
It manages:
- CPU
- Memory
- Job scheduling
- Resource allocation
Real-Life Analogy
Imagine a company:
| Hadoop Component | Real-Life Example |
|---|---|
| ResourceManager | Company Manager |
| NodeManager | Employees |
| ApplicationMaster | Team Leader |
| Containers | Work desks/resources |
The manager assigns resources, employees do work, and team leaders manage specific projects.
Purpose of YARN
YARN is used to:
a. Manage cluster resources efficiently
b. Schedule applications/jobs
c. Support multiple processing engines
d. Improve scalability and performance
YARN Architecture
YARN has 4 main components:
- ResourceManager (RM)
- NodeManager (NM)
- ApplicationMaster (AM)
- Containers
1. ResourceManager (RM)
ResourceManager is the master daemon of YARN.
It manages the entire cluster.
Responsibilities of ResourceManager
- Tracks available resources in cluster
- Allocates CPU and memory
- Schedules jobs
- Monitors applications
Internal Components of RM
A. Scheduler
Allocates resources to jobs.
Scheduling methods:
- FIFO Scheduler
- Capacity Scheduler
- Fair Scheduler
B. ApplicationManager
- Accepts job submissions
- Starts ApplicationMaster
- Monitors application status
Simple Example
Suppose:
- Cluster has 100 GB RAM
- Job A needs 20 GB
- Job B needs 30 GB
ResourceManager decides:
Job A → 20 GB
Job B → 30 GB
and allocates resources accordingly.
Analogy
ResourceManager is like:
Office Boss
who assigns work and resources.
2. NodeManager (NM)
NodeManager runs on every worker node.
It manages resources of that specific machine.
Responsibilities of NodeManager
- Manages CPU and memory of node
- Launches containers
- Monitors task execution
- Sends reports to ResourceManager
Example
Suppose one node has:
16 GB RAM
8 CPU cores
NodeManager tracks how much is used and available.
Analogy
NodeManager is like:
Employee/Supervisor on each machine
3. ApplicationMaster (AM)
Each application/job gets its own ApplicationMaster.
It manages that specific job.
Responsibilities of ApplicationMaster
- Requests resources from RM
- Monitors tasks
- Handles retries if task fails
- Tracks job progress
Important Point
Every application has a separate AM.
Examples:
- Spark job → Spark AM
- MapReduce job → MR AM
Example
Suppose a Spark application needs:
- 5 containers
- 10 GB memory
ApplicationMaster requests these resources from RM.
Analogy
ApplicationMaster is like:
Project Team Leader
4. Containers
Containers are resource units allocated by YARN.
A container includes:
- CPU
- Memory
- Disk resources
Example
A task may require:
2 CPU cores + 4 GB RAM
YARN creates a container with these resources.
Purpose of Containers
Containers run:
- Map tasks
- Reduce tasks
- Spark executors
- Other application tasks
Analogy
Container is like:
A workspace or desk given to an employee
Complete YARN Workflow
Step 1: User Submits Job
Example:
Run Spark job
Job is submitted to ResourceManager.
Step 2: RM Starts ApplicationMaster
ResourceManager launches ApplicationMaster for that job.
Step 3: AM Requests Resources
AM asks RM:
I need 5 containers
Step 4: RM Allocates Containers
RM allocates resources across nodes.
Step 5: NodeManagers Launch Containers
NodeManagers start containers and execute tasks.
Step 6: Progress Monitoring
Containers report progress to ApplicationMaster.
Step 7: Job Completion
After completion:
- Resources are released
- Containers stop
- RM updates cluster status
Full Workflow Diagram
User
↓
ResourceManager
↓
ApplicationMaster
↓
NodeManagers
↓
Containers Execute Tasks
↓
Results Returned
Practical Example of YARN
Example: Video Processing Company
Suppose YouTube-like company processes videos.
Cluster Resources
| Node | RAM | CPU |
|---|---|---|
| Node1 | 32 GB | 16 Cores |
| Node2 | 32 GB | 16 Cores |
| Node3 | 64 GB | 32 Cores |
User Runs Spark Job
Task:
Process 1 million videos
What Happens?
Step 1
ResourceManager receives job.
Step 2
ApplicationMaster starts.
Step 3
AM requests:
20 containers
Step 4
RM allocates resources across nodes.
Step 5
NodeManagers launch containers.
Step 6
Spark tasks execute in parallel.
Step 7
Output stored in HDFS.
YARN Advantages
1. Scalability
Supports thousands of nodes.
Very large clusters can work efficiently.
2. Better Resource Utilization
Resources allocated dynamically.
No waste of CPU or memory.
3. Multi-Framework Support
Can run:
- Spark
- MapReduce
- Tez
- Flink
together.
4. Fault Tolerance
If a node fails:
a. Tasks restart automatically
b. Jobs continue running
5. Better Performance
Separates:
Resource Management
AND
Job Execution
which removes bottlenecks.
YARN vs Hadoop 1.x
| Feature | Hadoop 1.x | Hadoop 2.x (YARN) |
|---|---|---|
| Resource Management | Part of MapReduce | Separate via YARN |
| Framework Support | Only MapReduce | Spark, Tez, Flink, etc. |
| Scalability | Limited | Very High |
| Resource Allocation | Fixed Slots | Dynamic Containers |
| Fault Tolerance | Limited | Better Automatic Recovery |
Important Terms Summary
| Term | Meaning |
|---|---|
| YARN | Resource manager of Hadoop |
| RM | Master resource controller |
| NM | Worker node manager |
| AM | Per-application manager |
| Container | CPU + Memory resource unit |
Easy Memory Trick
YARN Components
RM → Gives resources
NM → Runs resources
AM → Manages application
Container → Executes tasks
19. MapReduce Programming
MapReduce is the main data processing model in Hadoop.
It is used to process very large datasets across many computers in a Hadoop cluster.
MapReduce works in a:
- Parallel way
- Fault-tolerant way
- Scalable way
Simple Definition
MapReduce divides a big job into two parts:
- Map → Filtering, transforming, sorting
- Reduce → Combining and summarizing
Real-Life Example
Imagine a teacher wants to count how many times each word appears in 10,000 exam papers.
Instead of checking alone:
- Different students count words on different papers (Map phase)
- Final counts are combined (Reduce phase)
This is exactly how MapReduce works.
Main Components of MapReduce
There are 3 important stages:
- Map Function
- Shuffle and Sort
- Reduce Function
1. Map Function
Purpose
The Mapper reads input data and produces intermediate key-value pairs.
Input and Output
| Input | Output |
|---|---|
| Raw data | Key-value pairs |
Example: Word Count Problem
Suppose input file contains:
Hadoop is big data
Hadoop is scalable
Mapper Processing
Mapper reads one word at a time and emits:
(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Hadoop,1)
(is,1)
(scalable,1)
Explanation
Each word becomes:
(word,1)
because each word appeared once.
2. Shuffle and Sort
This is the middle phase of MapReduce.
Purpose
Shuffle and Sort:
a. Groups same keys together
b. Sorts data for reducer
Example
Input from Mapper:
(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Hadoop,1)
(is,1)
After Shuffle and Sort:
(Hadoop,[1,1])
(is,[1,1])
(big,[1])
(data,[1])
Meaning
All values of the same key are grouped together.
3. Reduce Function
Reducer combines all values for each key.
Example
Reducer receives:
(Hadoop,[1,1])
Reducer adds:
1 + 1 = 2
Final Output:
(Hadoop,2)
(is,2)
(big,1)
(data,1)
(scalable,1)
Complete MapReduce Workflow
Input Data
↓
Map Phase
↓
Shuffle & Sort
↓
Reduce Phase
↓
Final Output in HDFS
Step-by-Step Workflow
Step 1: User Submits Job
Example:
Run Word Count Program
Step 2: Hadoop Splits Input Data
Large files are divided into blocks.
Example:
1 GB file → multiple 128 MB blocks
Each block goes to different Mapper.
Step 3: Mapper Executes
Each mapper processes records independently.
Produces intermediate key-value pairs.
Step 4: Shuffle and Sort
Hadoop automatically:
- Groups same keys
- Sorts keys
Step 5: Reducer Executes
Reducer aggregates values.
Example:
(Hadoop,[1,1,1]) → (Hadoop,3)
Step 6: Output Stored
Final results stored in HDFS.
Complete Real-Life Example
Example: Counting Product Sales
Suppose an e-commerce company has sales records:
Laptop
Mobile
Laptop
Tablet
Mobile
Laptop
Mapper Output
(Laptop,1)
(Mobile,1)
(Laptop,1)
(Tablet,1)
(Mobile,1)
(Laptop,1)
Shuffle and Sort
(Laptop,[1,1,1])
(Mobile,[1,1])
(Tablet,[1])
Reducer Output
(Laptop,3)
(Mobile,2)
(Tablet,1)
Java MapReduce Program Structure
A MapReduce program mainly has:
- Mapper Class
- Reducer Class
- Driver Class
1. Mapper Class
Mapper defines Map logic.
Example
public class WordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for(String w : words) {
word.set(w);
context.write(word, one);
}
}
}
What This Mapper Does
For every word:
(word,1)
is generated.
2. Reducer Class
Reducer combines counts.
Example
public class WordCountReducer
extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key,
Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values) {
sum += val.get();
}
context.write(key,
new IntWritable(sum));
}
}
What Reducer Does
Adds all counts for each word.
Example:
(Hadoop,[1,1,1]) → (Hadoop,3)
3. Driver Class
Driver configures and starts the job.
Example
public class WordCountDriver {
public static void main(String[] args)
throws Exception {
Configuration conf =
new Configuration();
Job job = Job.getInstance(conf,
"word count");
job.setJarByClass(
WordCountDriver.class);
job.setMapperClass(
WordCountMapper.class);
job.setReducerClass(
WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(
IntWritable.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
System.exit(
job.waitForCompletion(true)
? 0 : 1);
}
}
Features of MapReduce
1. Parallel Processing
Multiple tasks run simultaneously on different nodes.
2. Fault Tolerance
If one node fails:
Hadoop automatically reruns task on another node.
3. Scalability
Can process:
TBs → PBs of data
using thousands of machines.
4. Data Locality
Mapper runs near the data block.
This reduces network traffic.
5. High Throughput
Optimized for large-scale batch processing.
Advantages of MapReduce
a. Handles massive data efficiently
b. Highly scalable
c. Reliable and fault tolerant
d. Works well with HDFS
Limitations of MapReduce
a. High latency
b. Slow for real-time processing
c. Complex Java coding
d. Not efficient for iterative ML algorithms
e. Poor handling of many small files
Why Spark Became Popular
MapReduce writes intermediate data to disk repeatedly.
Spark keeps data in memory.
So Spark is much faster for:
- Machine Learning
- Real-time analytics
- Graph processing
MapReduce Use Cases
| Use Case | Example |
|---|---|
| Word Count | Text analysis |
| Log Analysis | Website logs |
| ETL | Data transformation |
| Analytics | Aggregation and grouping |
| Preprocessing | Machine learning datasets |
Hadoop Ecosystem Relation
| Tool | Purpose |
|---|---|
| HDFS | Storage |
| YARN | Resource management |
| MapReduce | Processing |
| Hive | SQL queries |
| Pig | ETL scripting |
| Spark | Fast processing |
Simple Memory Trick
MapReduce Formula
Map → Break & Transform
Shuffle → Group
Reduce → Combine