Big Data Notes

DSA

Software Engineering

Software Architecture

Operating System

Big Data

Data Mining and Warehousing

TOC

Ada

CPP

DBMS

All Topics (12)

1. What is Big Data?
2. Big Data Characteristics
3. Types of Big Data
4. Traditional Data vs Big Data
5. Evolution of Big Data
6. Challenges with Big Data
7. Technologies Available for Big Data
7. Infrastructure for Big Data
9. Uses of Data Analytics
10. Hadoop
11. Hadoop Core Components
12. Hadoop Ecosystem

11. Hadoop Core Components

Hadoop is a framework used for storing and processing huge amounts of data in a distributed environment.
Its core components work together to handle big data efficiently.

The four core components of Hadoop are:

HDFS (Hadoop Distributed File System) – Storage Layer
MapReduce – Processing Layer
YARN (Yet Another Resource Negotiator) – Resource Management Layer
Hadoop Common – Shared Utilities and Libraries

1. HDFS (Hadoop Distributed File System)

What is HDFS?

HDFS is a distributed file system designed to store very large files across multiple machines.
It provides:

High storage capacity
Fault tolerance
Scalability

It is built to run on commodity hardware.

Key Features of HDFS

Fault Tolerance: Data is replicated across multiple nodes.
Scalability: More nodes can be added easily.
High Throughput: Optimized for large-scale data processing.
Flexibility: Stores structured, semi-structured, and unstructured data.

HDFS Architecture

1. NameNode (Master Node)

The NameNode manages the file system metadata such as:

File names
Directories
Permissions
Block locations

It controls all DataNodes.

2. DataNode (Slave Node)

DataNodes store the actual data blocks.

Responsibilities:

Store data
Handle read/write operations
Send heartbeat signals to the NameNode

How HDFS Works

Large files are divided into blocks.
Default block size = 128 MB
Each block is replicated (usually 3 copies).

This ensures data safety even if a node fails.

Example of HDFS

Suppose you have a 1 TB video file.

HDFS will:

Split it into 128 MB blocks
Create approximately 8000 blocks
Store each block on 3 different DataNodes

So if one machine crashes, data can still be recovered from another copy.

Real-Life Example of HDFS

Imagine keeping 3 photocopies of an important document in different rooms.
If one room is damaged, the document is still safe in the other rooms.

2. MapReduce

What is MapReduce?

MapReduce is a programming model used to process large datasets in parallel across a Hadoop cluster.

It works in three phases:

Map Phase
Shuffle and Sort Phase
Reduce Phase

Phases of MapReduce

1. Map Phase

The mapper processes input data and converts it into key-value pairs.

Example

Input sentence:

Hadoop is fast Hadoop is scalable

Mapper Output:

(Hadoop,1)
(is,1)
(fast,1)
(Hadoop,1)
(is,1)
(scalable,1)

2. Shuffle and Sort Phase

The system groups all similar keys together.

(Hadoop,[1,1])
(is,[1,1])
(fast,[1])
(scalable,[1])

3. Reduce Phase

The reducer combines values and produces the final result.

Final Output:

(Hadoop,2)
(is,2)
(fast,1)
(scalable,1)

Advantages of MapReduce

Parallel processing
Faster execution
Fault tolerance
Handles petabytes of data

Real-Life Example of MapReduce

Imagine exam papers being checked by multiple teachers:

Map: Teachers check papers separately
Shuffle: Papers are grouped subject-wise
Reduce: Final marks are calculated

3. YARN (Yet Another Resource Negotiator)

What is YARN?

YARN is the resource management framework in Hadoop.

It manages:

CPU usage
Memory allocation
Task scheduling

YARN allows multiple applications like MapReduce, Spark, and Hive to run together.

Components of YARN

1. ResourceManager (Master)

Responsibilities:

Allocates cluster resources
Schedules applications
Monitors resource usage

2. NodeManager (Slave)

Responsibilities:

Manages resources on each node
Executes tasks
Reports status to ResourceManager

Example of YARN

Suppose:

One user runs a Spark job
Another user runs a MapReduce job

YARN allocates CPU and memory resources efficiently to both applications.

Real-Life Example of YARN

Think of a school principal:

Assigns classrooms to teachers
Ensures resources are properly used

4. Hadoop Common

What is Hadoop Common?

Hadoop Common is a collection of shared libraries and utilities required by all Hadoop modules.

It provides:

Java libraries
Configuration files
Scripts for starting Hadoop services
APIs for Hadoop operations

Features of Hadoop Common

Supports communication between Hadoop modules
Provides operating system utilities
Helps integrate tools like Hive, Pig, HBase, and Sqoop

Example of Hadoop Common

Just like common system files support all applications in Windows, Hadoop Common supports all Hadoop components.

Summary Table

Component	Purpose	Example
HDFS	Stores data	File storage across nodes
MapReduce	Processes data	Word count program
YARN	Manages resources	CPU and memory allocation
Hadoop Common	Shared utilities	Libraries and APIs

Simple Real-Life Analogy

Hadoop Component	Real-Life Example
HDFS	Warehouse for storing goods
MapReduce	Workers processing tasks
YARN	Manager assigning resources
Hadoop Common	Common tools used by everyone

12. Hadoop Ecosystem

The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together to store, process, analyze, and manage Big Data.

While the core Hadoop components (HDFS and MapReduce) handle storage and batch processing, the ecosystem adds powerful tools for:

Real-time processing
Data analytics
Data integration
Workflow automation
Machine learning

It can handle:

Structured data (tables, SQL data)
Semi-structured data (JSON, XML)
Unstructured data (logs, images, videos)

Major Components of Hadoop Ecosystem

The ecosystem is modular, meaning you can use only the tools you need.

1. HDFS (Storage Layer)

Role

HDFS stores huge amounts of data across multiple machines.

Function

Splits files into blocks
Stores blocks on different nodes
Keeps multiple copies for safety

Example

A 1 TB video file is split into smaller blocks and stored across many machines. If one machine fails, data is still available.

2. MapReduce (Processing Layer)

Role

Batch processing framework for large-scale data.

Function

Processes data in parallel across cluster machines.

Example

Word count program:

Input: Large text file
Output: Frequency of each word

Used in:

Log analysis
Clickstream analysis
Data summarization

3. YARN (Resource Management Layer)

Role

Manages resources and schedules jobs in Hadoop cluster.

Function

Allocates CPU and memory
Schedules multiple applications
Manages cluster workload

Example

Running both Spark and MapReduce jobs on the same cluster without conflict.

4. Hive

Type

Data Warehouse tool (SQL-like system)

Function

Provides HiveQL (SQL-like language) to query big data.

Example

Instead of writing MapReduce code, you can write:

SELECT * FROM sales WHERE amount > 1000;

Use Case

Business reports
Sales analysis
Data summarization

5. Pig

Type

Data processing scripting tool

Function

Uses Pig Latin language for data transformation.

Example

Convert raw logs into structured format.

Use Case

ETL (Extract, Transform, Load) operations
Data cleaning
Data preparation

6. HBase

Type

NoSQL database (Column-oriented)

Function

Provides real-time read/write access to big data.

Example

Social media user profiles
IoT sensor data
Banking transaction records

Feature

Very fast for random data access.

7. Sqoop

Type

Data integration tool

Function

Transfers data between:

RDBMS (MySQL, Oracle)
Hadoop (HDFS, Hive)

Example

Import customer data from MySQL into Hadoop for analysis.

8. Flume

Type

Data ingestion tool

Function

Collects and moves streaming data into HDFS.

Example

Twitter feeds
Web server logs
Application logs

9. Oozie

Type

Workflow scheduler

Function

Automates Hadoop jobs.

Example

A daily pipeline:

Import data (Sqoop)
Clean data (Pig)
Query data (Hive)
Generate report

Oozie runs all steps automatically.

10. Zookeeper

Type

Coordination service

Function

Manages:

Cluster synchronization
Configuration
Naming services

Example

Used by HBase and Kafka to coordinate distributed systems.

11. Mahout

Type

Machine Learning library

Function

Provides scalable ML algorithms.

Example

Recommendation systems (Netflix/Amazon style)
Customer segmentation
Clustering data

12. Spark

Type

Distributed processing engine

Function

Processes data in-memory for faster performance than MapReduce.

Example

Real-time analytics
Machine learning tasks
Graph processing

13. Kafka

Type

Streaming platform

Function

Handles real-time data streams.

Example

Live user activity tracking
Log streaming
Event-driven systems

Applications of Hadoop Ecosystem

1. E-Commerce

Product recommendations
Customer behavior analysis

2. Social Media

Sentiment analysis
Trend detection

3. Banking & Finance

Fraud detection
Risk analysis

4. Healthcare

Disease prediction
Patient data analysis

5. Telecommunications

Call data analysis
Customer churn prediction

6. Government

Census analysis
Crime and traffic monitoring

Simple Summary Table

Tool	Purpose	Example
HDFS	Storage	Distributed file storage
MapReduce	Processing	Word count
YARN	Resource management	Job scheduling
Hive	SQL querying	Sales reports
Pig	Data transformation	ETL jobs
HBase	NoSQL DB	Real-time data
Sqoop	Data transfer	MySQL → Hadoop
Flume	Data ingestion	Logs collection
Oozie	Workflow automation	Daily pipelines
Zookeeper	Coordination	Cluster sync
Mahout	Machine learning	Recommendations
Spark	Fast processing	Real-time analytics
Kafka	Streaming	Live data flow

Page 3 of 3