Big Data Notes

DSA

Software Engineering

Software Architecture

Operating System

Big Data

Data Mining and Warehousing

TOC

Ada

CPP

DBMS

All Topics (19)

1. What is Big Data?
2. Big Data Characteristics
3. Types of Big Data
4. Traditional Data vs Big Data
5. Evolution of Big Data
6. Challenges with Big Data
7. Technologies Available for Big Data
7. Infrastructure for Big Data
9. Uses of Data Analytics
10. Hadoop
11. Hadoop Core Components
12. Hadoop Ecosystem
13. Hive Physical Architecture
14. Hadoop Limitations
15. RDBMS vs Hadoop
16. Hadoop Distributed File System (HDFS)
17. Processing Data with Hadoop
18. Hadoop YARN
19. MapReduce Programming

16. Hadoop Distributed File System (HDFS)

HDFS is the primary storage system used in Apache Hadoop.

It is specially designed to:

store very large datasets
work across multiple machines
provide fault tolerance
support Big Data processing

HDFS is inspired by:

Google File System (GFS)

and is one of the most important components of the Hadoop ecosystem.

What is HDFS?

HDFS stands for:

Hadoop Distributed File System

It stores huge files by:

Splitting files into blocks
Distributing blocks across many computers (DataNodes)

This makes storage:

scalable
reliable
fault tolerant

Key Features of HDFS

1. Distributed Storage

Explanation

HDFS divides large files into smaller blocks and stores them on different nodes in the cluster.

Example

Suppose a file size is:

1 TB

HDFS splits it into:

128 MB blocks

and distributes blocks across many machines.

2. Fault Tolerance

Explanation

Each block is replicated multiple times.

Default replication factor:

3 copies

If one node fails, data is still available from another node.

Example

Suppose:

Block A stored on Node1
Replica stored on Node2 and Node3

If Node1 crashes:

Hadoop reads block from Node2 or Node3.

3. Scalability

Explanation

HDFS supports horizontal scaling.

You can increase storage by simply adding more DataNodes.

Example

A company storing:

5 PB customer data

can expand storage by adding more machines to the cluster.

4. High Throughput

Explanation

HDFS is optimized for:

large sequential reads
large batch writes

It is not optimized for:

small random reads/writes

Example

Processing:

100 TB log files

is very efficient in HDFS.

5. Cost-Effective

Explanation

HDFS works on:

commodity hardware

which means low-cost ordinary servers can be used.

This reduces infrastructure cost.

6. Flexibility

Explanation

HDFS can store:

structured data
semi-structured data
unstructured data

Examples

HDFS can store:

CSV files
JSON logs
videos
images
social media data

HDFS Architecture

HDFS follows:

Master-Slave Architecture

Main Components:

NameNode (Master)
DataNode (Slave)
Secondary NameNode

HDFS Architecture Diagram (Conceptual)

              Client
                 |
            NameNode
          /     |     \
     DataNode DataNode DataNode

1. NameNode

Role

The NameNode is the:

Master Server

It manages metadata of HDFS.

Responsibilities of NameNode

The NameNode stores:

file names
directory structure
block locations
permissions

It also:

tracks DataNodes
manages cluster health
handles file operations

Example

Suppose file:

sales_data.csv

is divided into blocks.

NameNode stores information like:

Block1 → DataNode1
Block2 → DataNode5
Block3 → DataNode2

Important Note

If NameNode fails:

HDFS becomes inaccessible

because metadata is unavailable.

Modern Hadoop uses:

High Availability (HA)
Backup NameNode

to reduce this problem.

2. DataNode

Role

DataNodes are:

Slave Nodes

that store actual data blocks.

Responsibilities

DataNodes:

store data blocks
read/write data
send heartbeat signals
perform replication

Heartbeat Mechanism

Each DataNode regularly sends:

heartbeat messages

to NameNode.

If heartbeat stops:

NameNode assumes node failure.

Example

Suppose:

DataNode2 crashes

NameNode automatically creates another replica on another node.

3. Secondary NameNode

Role

The Secondary NameNode assists the NameNode.

Important:

It is NOT a backup NameNode

Functions

It:

merges edit logs
creates checkpoints
reduces NameNode restart time

Example

Over time:

NameNode metadata grows large

Secondary NameNode periodically merges:

FSImage
Edit Logs

to optimize metadata management.

HDFS File Storage Mechanism

Step 1: File Splitting

Large files are divided into blocks.

Default block size:

128 MB

Step 2: Replication

Each block is copied multiple times.

Default replication:

3 replicas

Step 3: Metadata Management

NameNode stores:

block information
DataNode locations

Step 4: Data Storage

Actual blocks are stored on DataNodes.

Example

Suppose file size:

1 TB

Number of blocks:

~8000 blocks

With replication factor 3:

Total storage needed ≈ 3 TB

distributed across cluster nodes.

HDFS Read Process

Step-by-Step

Step 1

Client requests file from NameNode.

Step 2

NameNode provides block locations.

Example:

Block1 → DataNode3
Block2 → DataNode7

Step 3

Client directly reads blocks from DataNodes.

Blocks can be read in parallel.

Example

Reading:

100 GB video file

becomes faster because multiple nodes serve data simultaneously.

HDFS Write Process

Step-by-Step

Step 1

Client requests write operation from NameNode.

Step 2

NameNode selects DataNodes.

Step 3

Client writes block to first DataNode.

Step 4

Block is replicated to other DataNodes.

Step 5

DataNodes confirm successful storage.

Example

Suppose replication factor:

Data flow:

Client → DataNode1 → DataNode2 → DataNode3

Advantages of HDFS

1. Fault Tolerance

Automatic replication protects against node failure.

2. High Throughput

Efficient for Big Data batch processing.

3. Scalability

Storage grows by adding more nodes.

4. Cost-Effective

Uses low-cost hardware.

5. Supports Multiple Data Types

Handles:

structured
semi-structured
unstructured data

Limitations of HDFS

1. Not Suitable for Real-Time Processing

HDFS is batch-oriented.

Not ideal for:

low latency
instant querying

2. Small File Problem

Millions of small files overload NameNode memory.

Example

Suppose:

10 million 1 KB files

Metadata storage becomes a problem.

3. Single NameNode Dependency

Traditional HDFS depends heavily on NameNode.

Failure can stop the cluster.

4. High Storage Usage

Replication consumes extra storage.

Example:

1 TB data with replication factor 3 = 3 TB storage

5. Limited Transaction Support

HDFS is not fully ACID compliant.

Not suitable for:

banking systems
OLTP applications

Real-Life Example of HDFS

Suppose YouTube stores:

videos
user logs
comments

Data size:

Petabytes of data

HDFS helps by:

distributing files across many servers
replicating data
enabling parallel processing

If one server fails:

videos are still available from replicas.

Key Takeaways

Important Points

HDFS is the backbone of Hadoop storage.
Uses distributed architecture.
NameNode manages metadata.
DataNodes store actual blocks.
Files are split into blocks and replicated.
Optimized for:
- scalability
- fault tolerance
- high throughput

17. Processing Data with Hadoop

Hadoop is not only a storage system like HDFS, but also a complete platform for processing very large amounts of data.
It can process data across many computers together, which makes it:

Fast
Scalable
Fault-tolerant

Hadoop uses different tools/frameworks for different types of processing.

Hadoop Data Processing Flow

General Flow

Data Source → Ingestion → HDFS Storage → Processing Engine → Output/Analysis

Example

Suppose a company collects:

Website logs
Customer data
Sales records
Social media data

This data is first stored in HDFS, then processed using tools like:

MapReduce
Hive
Pig
Spark

1. MapReduce – Core Hadoop Processing Framework

MapReduce is the original processing engine of Hadoop.

It processes huge datasets in parallel across many machines.

How MapReduce Works

Step 1: Input Splitting

HDFS divides large files into blocks (usually 128 MB).

Example

A 1 GB file is divided into:

128 MB + 128 MB + 128 MB + ...

Each block is processed separately.

Step 2: Map Phase

Mapper reads data and produces key-value pairs.

Example: Word Count

Input text:

Hadoop is big data
Big data is powerful

Mapper Output:

(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Big,1)
(data,1)
(is,1)
(powerful,1)

Step 3: Shuffle and Sort

Hadoop groups same keys together.

Output becomes:

(Hadoop,[1])
(is,[1,1])
(data,[1,1])
(big,[1])
(Big,[1])
(powerful,[1])

Step 4: Reduce Phase

Reducer combines values.

Final Output:

(Hadoop,1)
(is,2)
(data,2)
(big,1)
(Big,1)
(powerful,1)

Step 5: Output Storage

Final result is stored back in HDFS.

Advantages of MapReduce

a. Highly scalable
b. Fault tolerant
c. Handles huge data

Disadvantages

a. Slow for real-time work
b. Complex coding

2. Apache Hive – SQL-Like Processing

Hive allows users to write SQL-like queries on Hadoop data.

Instead of writing Java MapReduce code, analysts can use simple queries.

Hive Workflow

User writes HiveQL query
Hive converts it into MapReduce/Spark jobs
Hadoop executes the job
Results are returned

Example of Hive Query

Suppose we have a table named employees.

Query

SELECT department, COUNT(*)
FROM employees
GROUP BY department;

What Hive Does

Hive automatically:

Reads data from HDFS
Converts query into processing jobs
Executes on cluster
Returns result

Real-Life Example

An e-commerce company wants:

Total sales by city

Using Hive:

SELECT city, SUM(sales)
FROM orders
GROUP BY city;

Very easy compared to MapReduce coding.

Advantages of Hive

a. Easy for SQL users
b. No need to write Java code
c. Good for reporting

Disadvantages

a. Slower for real-time applications

3. Apache Pig – ETL and Data Transformation

Pig uses a scripting language called Pig Latin.

It is mainly used for:

ETL (Extract, Transform, Load)
Cleaning data
Transforming data

Pig Workflow

Write Pig script
Pig converts it into MapReduce jobs
Hadoop executes jobs
Results stored in HDFS

Example of Pig Script

Suppose sales data contains:

id,name,sales
1,Ram,5000
2,Shyam,7000

Pig Script

A = LOAD 'sales.csv' USING PigStorage(',')
    AS (id:int, name:chararray, sales:int);

B = FILTER A BY sales > 6000;

STORE B INTO 'output';

Result

2,Shyam,7000

Advantages of Pig

a. Easy data transformation
b. Less coding
c. Good for ETL jobs

Disadvantages

a. Not suitable for real-time processing

4. Apache Spark – Fast In-Memory Processing

Spark is much faster than MapReduce because it processes data in memory.

It supports:

Batch processing
Real-time processing
Machine learning
Graph processing

Spark Workflow

Load data from HDFS/Hive/Kafka
Apply transformations
Execute tasks in parallel
Save output

Example Using Spark

Suppose we want to count words.

PySpark Example

data = spark.read.text("input.txt")

words = data.selectExpr("explode(split(value,' ')) as word")

result = words.groupBy("word").count()

result.show()

Why Spark is Fast

MapReduce writes intermediate results to disk.

Spark keeps data in RAM (memory).

So processing becomes much faster.

Advantages of Spark

a. Very fast
b. Real-time processing
c. Supports Python, Java, Scala, R
d. Good for Machine Learning

Disadvantages

a. Requires more RAM

5. Real-Time and Streaming Tools

Hadoop ecosystem also supports streaming data.

Apache Flume

Used to collect log data and move it into HDFS.

Example

Web server logs → Flume → HDFS

Apache Kafka

Distributed messaging system for streaming data.

Example

Twitter messages streamed into Hadoop.

Spark Streaming / Storm

Processes live streaming data.

Example Use Cases

Fraud detection
IoT sensor monitoring
Live analytics
Stock market analysis

Complete Hadoop Ecosystem Example

Example: Online Shopping Website

Step 1: Data Generation

Users generate:

Search logs
Purchase records
Clickstream data

Step 2: Data Ingestion

Tools like:

Flume
Kafka

collect data.

Step 3: Storage

Data stored in HDFS.

Step 4: Processing

Different tools used:

Task	Tool
Batch report	MapReduce
SQL analytics	Hive
Data cleaning	Pig
Real-time analytics	Spark

Step 5: Output

Results used for:

Dashboards
Reports
Recommendations
Alerts

Best Practices in Hadoop Processing

1. Use Large Files

Small files increase NameNode overhead.

Good practice:

Use files larger than 128 MB

2. Partition Hive Tables

Improves query speed.

Example

Partition by date:

sales/year=2026/month=05/

3. Choose Correct Tool

Requirement	Best Tool
Batch Processing	MapReduce
SQL Queries	Hive
ETL/Data Cleaning	Pig
Real-Time Processing	Spark

4. Compress Data

Compression improves:

Storage efficiency
Faster I/O

Formats:

Snappy
Gzip
Parquet

5. Monitor with YARN

YARN manages cluster resources.

It helps avoid:

Memory issues
Task failures

Simple Comparison Table

Feature	MapReduce	Hive	Pig	Spark
Type	Processing Engine	SQL Tool	Scripting Tool	Fast Processing Engine
Coding	Complex	Easy SQL	Simple Scripts	Moderate
Speed	Slow	Medium	Medium	Fast
Real-Time	No	No	No	Yes
Best Use	Batch Jobs	Data Analysis	ETL	Real-Time + ML

18. Hadoop YARN

YARN (Yet Another Resource Negotiator) is the resource management and job scheduling system in Hadoop.

It was introduced in Hadoop 2.x to solve the limitations of Hadoop 1.x.

Before YARN (Hadoop 1.x)

In Hadoop 1.x:

MapReduce handled:
- Data processing
- Resource management

Because of this:

a. Only MapReduce jobs could run
b. Cluster scalability was limited
c. Resource usage was inefficient

After YARN (Hadoop 2.x)

YARN separated:

Resource Management ≠ Data Processing

Now different frameworks can run together:

MapReduce
Spark
Tez
Flink

on the same Hadoop cluster.

Simple Definition of YARN

YARN is the brain of Hadoop cluster management.

It manages:

CPU
Memory
Job scheduling
Resource allocation

Real-Life Analogy

Imagine a company:

Hadoop Component	Real-Life Example
ResourceManager	Company Manager
NodeManager	Employees
ApplicationMaster	Team Leader
Containers	Work desks/resources

The manager assigns resources, employees do work, and team leaders manage specific projects.

Purpose of YARN

YARN is used to:

a. Manage cluster resources efficiently
b. Schedule applications/jobs
c. Support multiple processing engines
d. Improve scalability and performance

YARN Architecture

YARN has 4 main components:

ResourceManager (RM)
NodeManager (NM)
ApplicationMaster (AM)
Containers

1. ResourceManager (RM)

ResourceManager is the master daemon of YARN.

It manages the entire cluster.

Responsibilities of ResourceManager

Tracks available resources in cluster
Allocates CPU and memory
Schedules jobs
Monitors applications

Internal Components of RM

A. Scheduler

Allocates resources to jobs.

Scheduling methods:

FIFO Scheduler
Capacity Scheduler
Fair Scheduler

B. ApplicationManager

Accepts job submissions
Starts ApplicationMaster
Monitors application status

Simple Example

Suppose:

Cluster has 100 GB RAM
Job A needs 20 GB
Job B needs 30 GB

ResourceManager decides:

Job A → 20 GB
Job B → 30 GB

and allocates resources accordingly.

Analogy

ResourceManager is like:

Office Boss

who assigns work and resources.

2. NodeManager (NM)

NodeManager runs on every worker node.

It manages resources of that specific machine.

Responsibilities of NodeManager

Manages CPU and memory of node
Launches containers
Monitors task execution
Sends reports to ResourceManager

Example

Suppose one node has:

16 GB RAM
8 CPU cores

NodeManager tracks how much is used and available.

Analogy

NodeManager is like:

Employee/Supervisor on each machine

3. ApplicationMaster (AM)

Each application/job gets its own ApplicationMaster.

It manages that specific job.

Responsibilities of ApplicationMaster

Requests resources from RM
Monitors tasks
Handles retries if task fails
Tracks job progress

Important Point

Every application has a separate AM.

Examples:

Spark job → Spark AM
MapReduce job → MR AM

Example

Suppose a Spark application needs:

5 containers
10 GB memory

ApplicationMaster requests these resources from RM.

Analogy

ApplicationMaster is like:

Project Team Leader

4. Containers

Containers are resource units allocated by YARN.

A container includes:

CPU
Memory
Disk resources

Example

A task may require:

2 CPU cores + 4 GB RAM

YARN creates a container with these resources.

Purpose of Containers

Containers run:

Map tasks
Reduce tasks
Spark executors
Other application tasks

Analogy

Container is like:

A workspace or desk given to an employee

Complete YARN Workflow

Step 1: User Submits Job

Example:

Run Spark job

Job is submitted to ResourceManager.

Step 2: RM Starts ApplicationMaster

ResourceManager launches ApplicationMaster for that job.

Step 3: AM Requests Resources

AM asks RM:

I need 5 containers

Step 4: RM Allocates Containers

RM allocates resources across nodes.

Step 5: NodeManagers Launch Containers

NodeManagers start containers and execute tasks.

Step 6: Progress Monitoring

Containers report progress to ApplicationMaster.

Step 7: Job Completion

After completion:

Resources are released
Containers stop
RM updates cluster status

Full Workflow Diagram

User
  ↓
ResourceManager
  ↓
ApplicationMaster
  ↓
NodeManagers
  ↓
Containers Execute Tasks
  ↓
Results Returned

Practical Example of YARN

Example: Video Processing Company

Suppose YouTube-like company processes videos.

Cluster Resources

Node	RAM	CPU
Node1	32 GB	16 Cores
Node2	32 GB	16 Cores
Node3	64 GB	32 Cores

User Runs Spark Job

Task:

Process 1 million videos

What Happens?

Step 1

ResourceManager receives job.

Step 2

ApplicationMaster starts.

Step 3

AM requests:

20 containers

Step 4

RM allocates resources across nodes.

Step 5

NodeManagers launch containers.

Step 6

Spark tasks execute in parallel.

Step 7

Output stored in HDFS.

YARN Advantages

1. Scalability

Supports thousands of nodes.

Very large clusters can work efficiently.

2. Better Resource Utilization

Resources allocated dynamically.

No waste of CPU or memory.

3. Multi-Framework Support

Can run:

Spark
MapReduce
Tez
Flink

together.

4. Fault Tolerance

If a node fails:

a. Tasks restart automatically
b. Jobs continue running

5. Better Performance

Separates:

Resource Management
AND
Job Execution

which removes bottlenecks.

YARN vs Hadoop 1.x

Feature	Hadoop 1.x	Hadoop 2.x (YARN)
Resource Management	Part of MapReduce	Separate via YARN
Framework Support	Only MapReduce	Spark, Tez, Flink, etc.
Scalability	Limited	Very High
Resource Allocation	Fixed Slots	Dynamic Containers
Fault Tolerance	Limited	Better Automatic Recovery

Important Terms Summary

Term	Meaning
YARN	Resource manager of Hadoop
RM	Master resource controller
NM	Worker node manager
AM	Per-application manager
Container	CPU + Memory resource unit

Easy Memory Trick

YARN Components

RM → Gives resources
NM → Runs resources
AM → Manages application
Container → Executes tasks

19. MapReduce Programming

MapReduce is the main data processing model in Hadoop.

It is used to process very large datasets across many computers in a Hadoop cluster.

MapReduce works in a:

Parallel way
Fault-tolerant way
Scalable way

Simple Definition

MapReduce divides a big job into two parts:

Map → Filtering, transforming, sorting
Reduce → Combining and summarizing

Real-Life Example

Imagine a teacher wants to count how many times each word appears in 10,000 exam papers.

Instead of checking alone:

Different students count words on different papers (Map phase)
Final counts are combined (Reduce phase)

This is exactly how MapReduce works.

Main Components of MapReduce

There are 3 important stages:

Map Function
Shuffle and Sort
Reduce Function

1. Map Function

Purpose

The Mapper reads input data and produces intermediate key-value pairs.

Input and Output

Input	Output
Raw data	Key-value pairs

Example: Word Count Problem

Suppose input file contains:

Hadoop is big data
Hadoop is scalable

Mapper Processing

Mapper reads one word at a time and emits:

(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Hadoop,1)
(is,1)
(scalable,1)

Explanation

Each word becomes:

(word,1)

because each word appeared once.

2. Shuffle and Sort

This is the middle phase of MapReduce.

Purpose

Shuffle and Sort:

a. Groups same keys together
b. Sorts data for reducer

Example

Input from Mapper:

(Hadoop,1)
(is,1)
(big,1)
(data,1)
(Hadoop,1)
(is,1)

After Shuffle and Sort:

(Hadoop,[1,1])
(is,[1,1])
(big,[1])
(data,[1])

Meaning

All values of the same key are grouped together.

3. Reduce Function

Reducer combines all values for each key.

Example

Reducer receives:

(Hadoop,[1,1])

Reducer adds:

1 + 1 = 2

Final Output:

(Hadoop,2)
(is,2)
(big,1)
(data,1)
(scalable,1)

Complete MapReduce Workflow

Input Data
    ↓
Map Phase
    ↓
Shuffle & Sort
    ↓
Reduce Phase
    ↓
Final Output in HDFS

Step-by-Step Workflow

Step 1: User Submits Job

Example:

Run Word Count Program

Step 2: Hadoop Splits Input Data

Large files are divided into blocks.

Example:

1 GB file → multiple 128 MB blocks

Each block goes to different Mapper.

Step 3: Mapper Executes

Each mapper processes records independently.

Produces intermediate key-value pairs.

Step 4: Shuffle and Sort

Hadoop automatically:

Groups same keys
Sorts keys

Step 5: Reducer Executes

Reducer aggregates values.

Example:

(Hadoop,[1,1,1]) → (Hadoop,3)

Step 6: Output Stored

Final results stored in HDFS.

Complete Real-Life Example

Example: Counting Product Sales

Suppose an e-commerce company has sales records:

Laptop
Mobile
Laptop
Tablet
Mobile
Laptop

Mapper Output

(Laptop,1)
(Mobile,1)
(Laptop,1)
(Tablet,1)
(Mobile,1)
(Laptop,1)

Shuffle and Sort

(Laptop,[1,1,1])
(Mobile,[1,1])
(Tablet,[1])

Reducer Output

(Laptop,3)
(Mobile,2)
(Tablet,1)

Java MapReduce Program Structure

A MapReduce program mainly has:

Mapper Class
Reducer Class
Driver Class

1. Mapper Class

Mapper defines Map logic.

Example

public class WordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one =
 new IntWritable(1);
 private Text word = new Text();
 public void map(LongWritable key,
 Text value,
 Context context)
 throws IOException, InterruptedException {
   String[] words = value.toString().split(" ");
   for(String w : words) {
      word.set(w);
      context.write(word, one);
   }
 }
}

What This Mapper Does

For every word:

(word,1)

is generated.

2. Reducer Class

Reducer combines counts.

Example

public class WordCountReducer
extends Reducer<Text, IntWritable,
Text, IntWritable> {
 public void reduce(Text key,
 Iterable<IntWritable> values,
 Context context)
 throws IOException, InterruptedException {
   int sum = 0;
   for(IntWritable val : values) {
      sum += val.get();
   }
   context.write(key,
   new IntWritable(sum));
 }
}

What Reducer Does

Adds all counts for each word.

Example:

(Hadoop,[1,1,1]) → (Hadoop,3)

3. Driver Class

Driver configures and starts the job.

Example

public class WordCountDriver {
 public static void main(String[] args)
 throws Exception {
   Configuration conf =
   new Configuration();
   Job job = Job.getInstance(conf,
   "word count");
   job.setJarByClass(
   WordCountDriver.class);
   job.setMapperClass(
   WordCountMapper.class);
   job.setReducerClass(
   WordCountReducer.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(
   IntWritable.class);
   FileInputFormat.addInputPath(
   job, new Path(args[0]));
   FileOutputFormat.setOutputPath(
   job, new Path(args[1]));
   System.exit(
   job.waitForCompletion(true)
   ? 0 : 1);
 }
}

Features of MapReduce

1. Parallel Processing

Multiple tasks run simultaneously on different nodes.

2. Fault Tolerance

If one node fails:

Hadoop automatically reruns task on another node.

3. Scalability

Can process:

TBs → PBs of data

using thousands of machines.

4. Data Locality

Mapper runs near the data block.

This reduces network traffic.

5. High Throughput

Optimized for large-scale batch processing.

Advantages of MapReduce

a. Handles massive data efficiently
b. Highly scalable
c. Reliable and fault tolerant
d. Works well with HDFS

Limitations of MapReduce

a. High latency
b. Slow for real-time processing
c. Complex Java coding
d. Not efficient for iterative ML algorithms
e. Poor handling of many small files

Why Spark Became Popular

MapReduce writes intermediate data to disk repeatedly.

Spark keeps data in memory.

So Spark is much faster for:

Machine Learning
Real-time analytics
Graph processing

MapReduce Use Cases

Use Case	Example
Word Count	Text analysis
Log Analysis	Website logs
ETL	Data transformation
Analytics	Aggregation and grouping
Preprocessing	Machine learning datasets

Hadoop Ecosystem Relation

Tool	Purpose
HDFS	Storage
YARN	Resource management
MapReduce	Processing
Hive	SQL queries
Pig	ETL scripting
Spark	Fast processing

Simple Memory Trick

MapReduce Formula

Map → Break & Transform
Shuffle → Group
Reduce → Combine

Page 4 of 4