Big Data Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (19)

  • 1. What is Big Data?
  • 2. Big Data Characteristics
  • 3. Types of Big Data
  • 4. Traditional Data vs Big Data
  • 5. Evolution of Big Data
  • 6. Challenges with Big Data
  • 7. Technologies Available for Big Data
  • 7. Infrastructure for Big Data
  • 9. Uses of Data Analytics
  • 10. Hadoop
  • 11. Hadoop Core Components
  • 12. Hadoop Ecosystem
  • 13. Hive Physical Architecture
  • 14. Hadoop Limitations
  • 15. RDBMS vs Hadoop
  • 16. Hadoop Distributed File System (HDFS)
  • 17. Processing Data with Hadoop
  • 18. Hadoop YARN
  • 19. MapReduce Programming

16. Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used in Apache Hadoop.

It is specially designed to:

  • store very large datasets
  • work across multiple machines
  • provide fault tolerance
  • support Big Data processing

HDFS is inspired by:

Google File System (GFS)

and is one of the most important components of the Hadoop ecosystem.

What is HDFS?

HDFS stands for:

Hadoop Distributed File System

It stores huge files by:

  1. Splitting files into blocks
  2. Distributing blocks across many computers (DataNodes)

This makes storage:

  • scalable
  • reliable
  • fault tolerant

Key Features of HDFS

1. Distributed Storage

Explanation

HDFS divides large files into smaller blocks and stores them on different nodes in the cluster.

Example

Suppose a file size is:

1 TB

HDFS splits it into:

128 MB blocks

and distributes blocks across many machines.

2. Fault Tolerance

Explanation

Each block is replicated multiple times.

Default replication factor:

3 copies

If one node fails, data is still available from another node.

Example

Suppose:

  • Block A stored on Node1
  • Replica stored on Node2 and Node3

If Node1 crashes:

  • Hadoop reads block from Node2 or Node3.

3. Scalability

Explanation

HDFS supports horizontal scaling.

You can increase storage by simply adding more DataNodes.

Example

A company storing:

5 PB customer data

can expand storage by adding more machines to the cluster.

4. High Throughput

Explanation

HDFS is optimized for:

  • large sequential reads
  • large batch writes

It is not optimized for:

  • small random reads/writes

Example

Processing:

100 TB log files

is very efficient in HDFS.

5. Cost-Effective

Explanation

HDFS works on:

commodity hardware

which means low-cost ordinary servers can be used.

This reduces infrastructure cost.

6. Flexibility

Explanation

HDFS can store:

  • structured data
  • semi-structured data
  • unstructured data

Examples

HDFS can store:

  • CSV files
  • JSON logs
  • videos
  • images
  • social media data

HDFS Architecture

HDFS follows:

Master-Slave Architecture

Main Components:

  1. NameNode (Master)
  2. DataNode (Slave)
  3. Secondary NameNode

HDFS Architecture Diagram (Conceptual)

              Client
|
NameNode
/ | \
DataNode DataNode DataNode
 

1. NameNode

Role

The NameNode is the:

Master Server

It manages metadata of HDFS.

Responsibilities of NameNode

The NameNode stores:

  • file names
  • directory structure
  • block locations
  • permissions

It also:

  • tracks DataNodes
  • manages cluster health
  • handles file operations

Example

Suppose file:

sales_data.csv

is divided into blocks.

NameNode stores information like:

Block1 → DataNode1
Block2 → DataNode5
Block3 → DataNode2
 

Important Note

If NameNode fails:

HDFS becomes inaccessible

because metadata is unavailable.

Modern Hadoop uses:

  • High Availability (HA)
  • Backup NameNode

to reduce this problem.

2. DataNode

Role

DataNodes are:

Slave Nodes

that store actual data blocks.

Responsibilities

DataNodes:

  • store data blocks
  • read/write data
  • send heartbeat signals
  • perform replication

Heartbeat Mechanism

Each DataNode regularly sends:

heartbeat messages

to NameNode.

If heartbeat stops:

  • NameNode assumes node failure.

Example

Suppose:

  • DataNode2 crashes

NameNode automatically creates another replica on another node.

3. Secondary NameNode

Role

The Secondary NameNode assists the NameNode.

Important:

It is NOT a backup NameNode
 

Functions

It:

  • merges edit logs
  • creates checkpoints
  • reduces NameNode restart time

Example

Over time:

  • NameNode metadata grows large

Secondary NameNode periodically merges:

  • FSImage
  • Edit Logs

to optimize metadata management.

HDFS File Storage Mechanism

Step 1: File Splitting

Large files are divided into blocks.

Default block size:

128 MB
 

Step 2: Replication

Each block is copied multiple times.

Default replication:

3 replicas
 

Step 3: Metadata Management

NameNode stores:

  • block information
  • DataNode locations

Step 4: Data Storage

Actual blocks are stored on DataNodes.

Example

Suppose file size:

1 TB

Number of blocks:

~8000 blocks

With replication factor 3:

Total storage needed ≈ 3 TB

distributed across cluster nodes.

HDFS Read Process

Step-by-Step

Step 1

Client requests file from NameNode.

Step 2

NameNode provides block locations.

Example:

Block1 → DataNode3
Block2 → DataNode7
 

Step 3

Client directly reads blocks from DataNodes.

Blocks can be read in parallel.

Example

Reading:

100 GB video file

becomes faster because multiple nodes serve data simultaneously.

HDFS Write Process

Step-by-Step

Step 1

Client requests write operation from NameNode.

Step 2

NameNode selects DataNodes.

Step 3

Client writes block to first DataNode.

Step 4

Block is replicated to other DataNodes.

Step 5

DataNodes confirm successful storage.

Example

Suppose replication factor:

3

Data flow:

Client → DataNode1 → DataNode2 → DataNode3
 

Advantages of HDFS

1. Fault Tolerance

Automatic replication protects against node failure.

2. High Throughput

Efficient for Big Data batch processing.

3. Scalability

Storage grows by adding more nodes.

4. Cost-Effective

Uses low-cost hardware.

5. Supports Multiple Data Types

Handles:

  • structured
  • semi-structured
  • unstructured data

Limitations of HDFS

1. Not Suitable for Real-Time Processing

HDFS is batch-oriented.

Not ideal for:

  • low latency
  • instant querying

2. Small File Problem

Millions of small files overload NameNode memory.

Example

Suppose:

10 million 1 KB files

Metadata storage becomes a problem.

3. Single NameNode Dependency

Traditional HDFS depends heavily on NameNode.

Failure can stop the cluster.

4. High Storage Usage

Replication consumes extra storage.

Example:

1 TB data with replication factor 3 = 3 TB storage

5. Limited Transaction Support

HDFS is not fully ACID compliant.

Not suitable for:

  • banking systems
  • OLTP applications

Real-Life Example of HDFS

Suppose YouTube stores:

  • videos
  • user logs
  • comments

Data size:

Petabytes of data

HDFS helps by:

  • distributing files across many servers
  • replicating data
  • enabling parallel processing

If one server fails:

  • videos are still available from replicas.

Key Takeaways

Important Points

  • HDFS is the backbone of Hadoop storage.
  • Uses distributed architecture.
  • NameNode manages metadata.
  • DataNodes store actual blocks.
  • Files are split into blocks and replicated.
  • Optimized for:
    • scalability
    • fault tolerance
    • high throughput

17. Processing Data with Hadoop

Hadoop is not only a storage system like HDFS, but also a complete platform for processing very large amounts of data.
It can process data across many computers together, which makes it:

  • Fast
  • Scalable
  • Fault-tolerant

Hadoop uses different tools/frameworks for different types of processing.

Hadoop Data Processing Flow

General Flow

Data Source → Ingestion → HDFS Storage → Processing Engine → Output/Analysis

Example

Suppose a company collects:

  • Website logs
  • Customer data
  • Sales records
  • Social media data

This data is first stored in HDFS, then processed using tools like:

  • MapReduce
  • Hive
  • Pig
  • Spark

1. MapReduce – Core Hadoop Processing Framework

MapReduce is the original processing engine of Hadoop.

It processes huge datasets in parallel across many machines.

How MapReduce Works

Step 1: Input Splitting

HDFS divides large files into blocks (usually 128 MB).

Example

A 1 GB file is divided into:

128 MB + 128 MB + 128 MB + ...

Each block is processed separately.

Step 2: Map Phase

Mapper reads data and produces key-value pairs.

Example: Word Count

Input text:

Hadoop is big data
Big data is powerful

Mapper Output:

(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Big,1)
(data,1)
(is,1)
(powerful,1)
 

Step 3: Shuffle and Sort

Hadoop groups same keys together.

Output becomes:

(Hadoop,[1])
(is,[1,1])
(data,[1,1])
(big,[1])
(Big,[1])
(powerful,[1])
 

Step 4: Reduce Phase

Reducer combines values.

Final Output:

(Hadoop,1)
(is,2)
(data,2)
(big,1)
(Big,1)
(powerful,1)
 

Step 5: Output Storage

Final result is stored back in HDFS.

Advantages of MapReduce

a. Highly scalable
b. Fault tolerant
c. Handles huge data

Disadvantages

a. Slow for real-time work
b. Complex coding

2. Apache Hive – SQL-Like Processing

Hive allows users to write SQL-like queries on Hadoop data.

Instead of writing Java MapReduce code, analysts can use simple queries.

Hive Workflow

  1. User writes HiveQL query
  2. Hive converts it into MapReduce/Spark jobs
  3. Hadoop executes the job
  4. Results are returned

Example of Hive Query

Suppose we have a table named employees.

Query

SELECT department, COUNT(*)
FROM employees
GROUP BY department;

What Hive Does

Hive automatically:

  • Reads data from HDFS
  • Converts query into processing jobs
  • Executes on cluster
  • Returns result

Real-Life Example

An e-commerce company wants:

Total sales by city

Using Hive:

SELECT city, SUM(sales)
FROM orders
GROUP BY city;

Very easy compared to MapReduce coding.

Advantages of Hive

a. Easy for SQL users
b. No need to write Java code
c. Good for reporting

Disadvantages

a. Slower for real-time applications

3. Apache Pig – ETL and Data Transformation

Pig uses a scripting language called Pig Latin.

It is mainly used for:

  • ETL (Extract, Transform, Load)
  • Cleaning data
  • Transforming data

Pig Workflow

  1. Write Pig script
  2. Pig converts it into MapReduce jobs
  3. Hadoop executes jobs
  4. Results stored in HDFS

Example of Pig Script

Suppose sales data contains:

id,name,sales
1,Ram,5000
2,Shyam,7000

Pig Script

A = LOAD 'sales.csv' USING PigStorage(',')
AS (id:int, name:chararray, sales:int);

B = FILTER A BY sales > 6000;

STORE B INTO 'output';

Result

2,Shyam,7000
 

Advantages of Pig

a. Easy data transformation
b. Less coding
c. Good for ETL jobs

Disadvantages

a. Not suitable for real-time processing

4. Apache Spark – Fast In-Memory Processing

Spark is much faster than MapReduce because it processes data in memory.

It supports:

  • Batch processing
  • Real-time processing
  • Machine learning
  • Graph processing

Spark Workflow

  1. Load data from HDFS/Hive/Kafka
  2. Apply transformations
  3. Execute tasks in parallel
  4. Save output

Example Using Spark

Suppose we want to count words.

PySpark Example

data = spark.read.text("input.txt")

words = data.selectExpr("explode(split(value,' ')) as word")

result = words.groupBy("word").count()

result.show()

Why Spark is Fast

MapReduce writes intermediate results to disk.

Spark keeps data in RAM (memory).

So processing becomes much faster.

Advantages of Spark

a. Very fast
b. Real-time processing
c. Supports Python, Java, Scala, R
d. Good for Machine Learning

Disadvantages

a. Requires more RAM

5. Real-Time and Streaming Tools

Hadoop ecosystem also supports streaming data.

Apache Flume

Used to collect log data and move it into HDFS.

Example

Web server logs → Flume → HDFS

Apache Kafka

Distributed messaging system for streaming data.

Example

Twitter messages streamed into Hadoop.

Spark Streaming / Storm

Processes live streaming data.

Example Use Cases

  • Fraud detection
  • IoT sensor monitoring
  • Live analytics
  • Stock market analysis

Complete Hadoop Ecosystem Example

Example: Online Shopping Website

Step 1: Data Generation

Users generate:

  • Search logs
  • Purchase records
  • Clickstream data

Step 2: Data Ingestion

Tools like:

  • Flume
  • Kafka

collect data.

Step 3: Storage

Data stored in HDFS.

Step 4: Processing

Different tools used:

Task Tool
Batch report MapReduce
SQL analytics Hive
Data cleaning Pig
Real-time analytics Spark

Step 5: Output

Results used for:

  • Dashboards
  • Reports
  • Recommendations
  • Alerts

Best Practices in Hadoop Processing

1. Use Large Files

Small files increase NameNode overhead.

Good practice:

Use files larger than 128 MB

2. Partition Hive Tables

Improves query speed.

Example

Partition by date:

sales/year=2026/month=05/

3. Choose Correct Tool

Requirement Best Tool
Batch Processing MapReduce
SQL Queries Hive
ETL/Data Cleaning Pig
Real-Time Processing Spark

4. Compress Data

Compression improves:

  • Storage efficiency
  • Faster I/O

Formats:

  • Snappy
  • Gzip
  • Parquet

5. Monitor with YARN

YARN manages cluster resources.

It helps avoid:

  • Memory issues
  • Task failures

Simple Comparison Table

Feature MapReduce Hive Pig Spark
Type Processing Engine SQL Tool Scripting Tool Fast Processing Engine
Coding Complex Easy SQL Simple Scripts Moderate
Speed Slow Medium Medium Fast
Real-Time No No No Yes
Best Use Batch Jobs Data Analysis ETL Real-Time + ML

 

18. Hadoop YARN

YARN (Yet Another Resource Negotiator) is the resource management and job scheduling system in Hadoop.

It was introduced in Hadoop 2.x to solve the limitations of Hadoop 1.x.

Before YARN (Hadoop 1.x)

In Hadoop 1.x:

  • MapReduce handled:
    • Data processing
    • Resource management

Because of this:

a. Only MapReduce jobs could run
b. Cluster scalability was limited
c. Resource usage was inefficient

After YARN (Hadoop 2.x)

YARN separated:

Resource Management ≠ Data Processing

Now different frameworks can run together:

  • MapReduce
  • Spark
  • Tez
  • Flink

on the same Hadoop cluster.

Simple Definition of YARN

YARN is the brain of Hadoop cluster management.

It manages:

  • CPU
  • Memory
  • Job scheduling
  • Resource allocation

Real-Life Analogy

Imagine a company:

Hadoop Component Real-Life Example
ResourceManager Company Manager
NodeManager Employees
ApplicationMaster Team Leader
Containers Work desks/resources

The manager assigns resources, employees do work, and team leaders manage specific projects.

Purpose of YARN

YARN is used to:

a. Manage cluster resources efficiently
b. Schedule applications/jobs
c. Support multiple processing engines
d. Improve scalability and performance

YARN Architecture

YARN has 4 main components:

  1. ResourceManager (RM)
  2. NodeManager (NM)
  3. ApplicationMaster (AM)
  4. Containers

1. ResourceManager (RM)

ResourceManager is the master daemon of YARN.

It manages the entire cluster.

Responsibilities of ResourceManager

  • Tracks available resources in cluster
  • Allocates CPU and memory
  • Schedules jobs
  • Monitors applications

Internal Components of RM

A. Scheduler

Allocates resources to jobs.

Scheduling methods:

  • FIFO Scheduler
  • Capacity Scheduler
  • Fair Scheduler

B. ApplicationManager

  • Accepts job submissions
  • Starts ApplicationMaster
  • Monitors application status

Simple Example

Suppose:

  • Cluster has 100 GB RAM
  • Job A needs 20 GB
  • Job B needs 30 GB

ResourceManager decides:

Job A → 20 GB
Job B → 30 GB

and allocates resources accordingly.

Analogy

ResourceManager is like:

Office Boss

who assigns work and resources.

2. NodeManager (NM)

NodeManager runs on every worker node.

It manages resources of that specific machine.

Responsibilities of NodeManager

  • Manages CPU and memory of node
  • Launches containers
  • Monitors task execution
  • Sends reports to ResourceManager

Example

Suppose one node has:

16 GB RAM
8 CPU cores

NodeManager tracks how much is used and available.

Analogy

NodeManager is like:

Employee/Supervisor on each machine

3. ApplicationMaster (AM)

Each application/job gets its own ApplicationMaster.

It manages that specific job.

Responsibilities of ApplicationMaster

  • Requests resources from RM
  • Monitors tasks
  • Handles retries if task fails
  • Tracks job progress

Important Point

Every application has a separate AM.

Examples:

  • Spark job → Spark AM
  • MapReduce job → MR AM

Example

Suppose a Spark application needs:

  • 5 containers
  • 10 GB memory

ApplicationMaster requests these resources from RM.

Analogy

ApplicationMaster is like:

Project Team Leader

4. Containers

Containers are resource units allocated by YARN.

A container includes:

  • CPU
  • Memory
  • Disk resources

Example

A task may require:

2 CPU cores + 4 GB RAM

YARN creates a container with these resources.

Purpose of Containers

Containers run:

  • Map tasks
  • Reduce tasks
  • Spark executors
  • Other application tasks

Analogy

Container is like:

A workspace or desk given to an employee

Complete YARN Workflow

Step 1: User Submits Job

Example:

Run Spark job

Job is submitted to ResourceManager.

Step 2: RM Starts ApplicationMaster

ResourceManager launches ApplicationMaster for that job.

Step 3: AM Requests Resources

AM asks RM:

I need 5 containers
 

Step 4: RM Allocates Containers

RM allocates resources across nodes.

Step 5: NodeManagers Launch Containers

NodeManagers start containers and execute tasks.

Step 6: Progress Monitoring

Containers report progress to ApplicationMaster.

Step 7: Job Completion

After completion:

  • Resources are released
  • Containers stop
  • RM updates cluster status

Full Workflow Diagram

User

ResourceManager

ApplicationMaster

NodeManagers

Containers Execute Tasks

Results Returned
 

Practical Example of YARN

Example: Video Processing Company

Suppose YouTube-like company processes videos.

Cluster Resources

Node RAM CPU
Node1 32 GB 16 Cores
Node2 32 GB 16 Cores
Node3 64 GB 32 Cores

User Runs Spark Job

Task:

Process 1 million videos

What Happens?

Step 1

ResourceManager receives job.

Step 2

ApplicationMaster starts.

Step 3

AM requests:

20 containers

Step 4

RM allocates resources across nodes.

Step 5

NodeManagers launch containers.

Step 6

Spark tasks execute in parallel.

Step 7

Output stored in HDFS.

YARN Advantages

1. Scalability

Supports thousands of nodes.

Very large clusters can work efficiently.

2. Better Resource Utilization

Resources allocated dynamically.

No waste of CPU or memory.

3. Multi-Framework Support

Can run:

  • Spark
  • MapReduce
  • Tez
  • Flink

together.

4. Fault Tolerance

If a node fails:

a. Tasks restart automatically
b. Jobs continue running

5. Better Performance

Separates:

Resource Management
AND
Job Execution

which removes bottlenecks.

YARN vs Hadoop 1.x

Feature Hadoop 1.x Hadoop 2.x (YARN)
Resource Management Part of MapReduce Separate via YARN
Framework Support Only MapReduce Spark, Tez, Flink, etc.
Scalability Limited Very High
Resource Allocation Fixed Slots Dynamic Containers
Fault Tolerance Limited Better Automatic Recovery

Important Terms Summary

Term Meaning
YARN Resource manager of Hadoop
RM Master resource controller
NM Worker node manager
AM Per-application manager
Container CPU + Memory resource unit

Easy Memory Trick

YARN Components

RM → Gives resources
NM → Runs resources
AM → Manages application
Container → Executes tasks

19. MapReduce Programming

MapReduce is the main data processing model in Hadoop.

It is used to process very large datasets across many computers in a Hadoop cluster.

MapReduce works in a:

  • Parallel way
  • Fault-tolerant way
  • Scalable way

Simple Definition

MapReduce divides a big job into two parts:

  1. Map → Filtering, transforming, sorting
  2. Reduce → Combining and summarizing

Real-Life Example

Imagine a teacher wants to count how many times each word appears in 10,000 exam papers.

Instead of checking alone:

  • Different students count words on different papers (Map phase)
  • Final counts are combined (Reduce phase)

This is exactly how MapReduce works.

Main Components of MapReduce

There are 3 important stages:

  1. Map Function
  2. Shuffle and Sort
  3. Reduce Function

1. Map Function

Purpose

The Mapper reads input data and produces intermediate key-value pairs.

Input and Output

Input Output
Raw data Key-value pairs

Example: Word Count Problem

Suppose input file contains:

Hadoop is big data
Hadoop is scalable

Mapper Processing

Mapper reads one word at a time and emits:

(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Hadoop,1)
(is,1)
(scalable,1)

Explanation

Each word becomes:

(word,1)

because each word appeared once.

2. Shuffle and Sort

This is the middle phase of MapReduce.

Purpose

Shuffle and Sort:

a. Groups same keys together
b. Sorts data for reducer

Example

Input from Mapper:

(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Hadoop,1)
(is,1)

After Shuffle and Sort:

(Hadoop,[1,1])
(is,[1,1])
(big,[1])
(data,[1])
 

Meaning

All values of the same key are grouped together.

3. Reduce Function

Reducer combines all values for each key.

Example

Reducer receives:

(Hadoop,[1,1])

Reducer adds:

1 + 1 = 2

Final Output:

(Hadoop,2)
(is,2)
(big,1)
(data,1)
(scalable,1)
 

Complete MapReduce Workflow

Input Data

Map Phase

Shuffle & Sort

Reduce Phase

Final Output in HDFS
 

Step-by-Step Workflow

Step 1: User Submits Job

Example:

Run Word Count Program
 

Step 2: Hadoop Splits Input Data

Large files are divided into blocks.

Example:

1 GB file → multiple 128 MB blocks

Each block goes to different Mapper.

Step 3: Mapper Executes

Each mapper processes records independently.

Produces intermediate key-value pairs.

Step 4: Shuffle and Sort

Hadoop automatically:

  • Groups same keys
  • Sorts keys

Step 5: Reducer Executes

Reducer aggregates values.

Example:

(Hadoop,[1,1,1]) → (Hadoop,3)
 

Step 6: Output Stored

Final results stored in HDFS.

Complete Real-Life Example

Example: Counting Product Sales

Suppose an e-commerce company has sales records:

Laptop
Mobile
Laptop
Tablet
Mobile
Laptop
 

Mapper Output

(Laptop,1)
(Mobile,1)
(Laptop,1)
(Tablet,1)
(Mobile,1)
(Laptop,1)
 

Shuffle and Sort

(Laptop,[1,1,1])
(Mobile,[1,1])
(Tablet,[1])
 

Reducer Output

(Laptop,3)
(Mobile,2)
(Tablet,1)
 

Java MapReduce Program Structure

A MapReduce program mainly has:

  1. Mapper Class
  2. Reducer Class
  3. Driver Class

1. Mapper Class

Mapper defines Map logic.

Example

public class WordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for(String w : words) {
word.set(w);
context.write(word, one);
}
}
}
 

What This Mapper Does

For every word:

(word,1)

is generated.

2. Reducer Class

Reducer combines counts.

Example

public class WordCountReducer
extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key,
Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values) {
sum += val.get();
}
context.write(key,
new IntWritable(sum));
}
}
 

What Reducer Does

Adds all counts for each word.

Example:

(Hadoop,[1,1,1]) → (Hadoop,3)
 

3. Driver Class

Driver configures and starts the job.

Example

public class WordCountDriver {
public static void main(String[] args)
throws Exception {
Configuration conf =
new Configuration();
Job job = Job.getInstance(conf,
"word count");
job.setJarByClass(
WordCountDriver.class);
job.setMapperClass(
WordCountMapper.class);
job.setReducerClass(
WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(
IntWritable.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
System.exit(
job.waitForCompletion(true)
? 0 : 1);
}
}
 

Features of MapReduce

1. Parallel Processing

Multiple tasks run simultaneously on different nodes.

2. Fault Tolerance

If one node fails:

 Hadoop automatically reruns task on another node.

3. Scalability

Can process:

TBs → PBs of data

using thousands of machines.

4. Data Locality

Mapper runs near the data block.

This reduces network traffic.

5. High Throughput

Optimized for large-scale batch processing.

Advantages of MapReduce

a. Handles massive data efficiently
b. Highly scalable
c. Reliable and fault tolerant
d. Works well with HDFS

Limitations of MapReduce

a. High latency
b. Slow for real-time processing
c. Complex Java coding
d. Not efficient for iterative ML algorithms
e. Poor handling of many small files

Why Spark Became Popular

MapReduce writes intermediate data to disk repeatedly.

Spark keeps data in memory.

So Spark is much faster for:

  • Machine Learning
  • Real-time analytics
  • Graph processing

MapReduce Use Cases

Use Case Example
Word Count Text analysis
Log Analysis Website logs
ETL Data transformation
Analytics Aggregation and grouping
Preprocessing Machine learning datasets

Hadoop Ecosystem Relation

Tool Purpose
HDFS Storage
YARN Resource management
MapReduce Processing
Hive SQL queries
Pig ETL scripting
Spark Fast processing

Simple Memory Trick

MapReduce Formula

Map → Break & Transform
Shuffle → Group
Reduce → Combine
Page 4 of 4