Big Data Notes
All Topics (12)
- 1. What is Big Data?
- 2. Big Data Characteristics
- 3. Types of Big Data
- 4. Traditional Data vs Big Data
- 5. Evolution of Big Data
- 6. Challenges with Big Data
- 7. Technologies Available for Big Data
- 7. Infrastructure for Big Data
- 9. Uses of Data Analytics
- 10. Hadoop
- 11. Hadoop Core Components
- 12. Hadoop Ecosystem
11. Hadoop Core Components
Hadoop is a framework used for storing and processing huge amounts of data in a distributed environment.
Its core components work together to handle big data efficiently.
The four core components of Hadoop are:
- HDFS (Hadoop Distributed File System) – Storage Layer
- MapReduce – Processing Layer
- YARN (Yet Another Resource Negotiator) – Resource Management Layer
- Hadoop Common – Shared Utilities and Libraries
1. HDFS (Hadoop Distributed File System)
What is HDFS?
HDFS is a distributed file system designed to store very large files across multiple machines.
It provides:
- High storage capacity
- Fault tolerance
- Scalability
It is built to run on commodity hardware.
Key Features of HDFS
- Fault Tolerance: Data is replicated across multiple nodes.
- Scalability: More nodes can be added easily.
- High Throughput: Optimized for large-scale data processing.
- Flexibility: Stores structured, semi-structured, and unstructured data.
HDFS Architecture
1. NameNode (Master Node)
The NameNode manages the file system metadata such as:
- File names
- Directories
- Permissions
- Block locations
It controls all DataNodes.
2. DataNode (Slave Node)
DataNodes store the actual data blocks.
Responsibilities:
- Store data
- Handle read/write operations
- Send heartbeat signals to the NameNode
How HDFS Works
- Large files are divided into blocks.
- Default block size = 128 MB
- Each block is replicated (usually 3 copies).
This ensures data safety even if a node fails.
Example of HDFS
Suppose you have a 1 TB video file.
HDFS will:
- Split it into 128 MB blocks
- Create approximately 8000 blocks
- Store each block on 3 different DataNodes
So if one machine crashes, data can still be recovered from another copy.
Real-Life Example of HDFS
Imagine keeping 3 photocopies of an important document in different rooms.
If one room is damaged, the document is still safe in the other rooms.
2. MapReduce
What is MapReduce?
MapReduce is a programming model used to process large datasets in parallel across a Hadoop cluster.
It works in three phases:
- Map Phase
- Shuffle and Sort Phase
- Reduce Phase
Phases of MapReduce
1. Map Phase
The mapper processes input data and converts it into key-value pairs.
Example
Input sentence:
Hadoop is fast Hadoop is scalable
Mapper Output:
(Hadoop,1)
(is,1)
(fast,1)
(Hadoop,1)
(is,1)
(scalable,1)
2. Shuffle and Sort Phase
The system groups all similar keys together.
(Hadoop,[1,1])
(is,[1,1])
(fast,[1])
(scalable,[1])
3. Reduce Phase
The reducer combines values and produces the final result.
Final Output:
(Hadoop,2)
(is,2)
(fast,1)
(scalable,1)
Advantages of MapReduce
- Parallel processing
- Faster execution
- Fault tolerance
- Handles petabytes of data
Real-Life Example of MapReduce
Imagine exam papers being checked by multiple teachers:
- Map: Teachers check papers separately
- Shuffle: Papers are grouped subject-wise
- Reduce: Final marks are calculated
3. YARN (Yet Another Resource Negotiator)
What is YARN?
YARN is the resource management framework in Hadoop.
It manages:
- CPU usage
- Memory allocation
- Task scheduling
YARN allows multiple applications like MapReduce, Spark, and Hive to run together.
Components of YARN
1. ResourceManager (Master)
Responsibilities:
- Allocates cluster resources
- Schedules applications
- Monitors resource usage
2. NodeManager (Slave)
Responsibilities:
- Manages resources on each node
- Executes tasks
- Reports status to ResourceManager
Example of YARN
Suppose:
- One user runs a Spark job
- Another user runs a MapReduce job
YARN allocates CPU and memory resources efficiently to both applications.
Real-Life Example of YARN
Think of a school principal:
- Assigns classrooms to teachers
- Ensures resources are properly used
4. Hadoop Common
What is Hadoop Common?
Hadoop Common is a collection of shared libraries and utilities required by all Hadoop modules.
It provides:
- Java libraries
- Configuration files
- Scripts for starting Hadoop services
- APIs for Hadoop operations
Features of Hadoop Common
- Supports communication between Hadoop modules
- Provides operating system utilities
- Helps integrate tools like Hive, Pig, HBase, and Sqoop
Example of Hadoop Common
Just like common system files support all applications in Windows, Hadoop Common supports all Hadoop components.
Summary Table
| Component | Purpose | Example |
|---|---|---|
| HDFS | Stores data | File storage across nodes |
| MapReduce | Processes data | Word count program |
| YARN | Manages resources | CPU and memory allocation |
| Hadoop Common | Shared utilities | Libraries and APIs |
Simple Real-Life Analogy
| Hadoop Component | Real-Life Example |
|---|---|
| HDFS | Warehouse for storing goods |
| MapReduce | Workers processing tasks |
| YARN | Manager assigning resources |
| Hadoop Common | Common tools used by everyone |
12. Hadoop Ecosystem
The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together to store, process, analyze, and manage Big Data.
While the core Hadoop components (HDFS and MapReduce) handle storage and batch processing, the ecosystem adds powerful tools for:
- Real-time processing
- Data analytics
- Data integration
- Workflow automation
- Machine learning
It can handle:
- Structured data (tables, SQL data)
- Semi-structured data (JSON, XML)
- Unstructured data (logs, images, videos)
Major Components of Hadoop Ecosystem
The ecosystem is modular, meaning you can use only the tools you need.
1. HDFS (Storage Layer)
Role
HDFS stores huge amounts of data across multiple machines.
Function
- Splits files into blocks
- Stores blocks on different nodes
- Keeps multiple copies for safety
Example
A 1 TB video file is split into smaller blocks and stored across many machines. If one machine fails, data is still available.
2. MapReduce (Processing Layer)
Role
Batch processing framework for large-scale data.
Function
Processes data in parallel across cluster machines.
Example
Word count program:
- Input: Large text file
- Output: Frequency of each word
Used in:
- Log analysis
- Clickstream analysis
- Data summarization
3. YARN (Resource Management Layer)
Role
Manages resources and schedules jobs in Hadoop cluster.
Function
- Allocates CPU and memory
- Schedules multiple applications
- Manages cluster workload
Example
Running both Spark and MapReduce jobs on the same cluster without conflict.
4. Hive
Type
Data Warehouse tool (SQL-like system)
Function
Provides HiveQL (SQL-like language) to query big data.
Example
Instead of writing MapReduce code, you can write:
SELECT * FROM sales WHERE amount > 1000;
Use Case
- Business reports
- Sales analysis
- Data summarization
5. Pig
Type
Data processing scripting tool
Function
Uses Pig Latin language for data transformation.
Example
Convert raw logs into structured format.
Use Case
- ETL (Extract, Transform, Load) operations
- Data cleaning
- Data preparation
6. HBase
Type
NoSQL database (Column-oriented)
Function
Provides real-time read/write access to big data.
Example
- Social media user profiles
- IoT sensor data
- Banking transaction records
Feature
Very fast for random data access.
7. Sqoop
Type
Data integration tool
Function
Transfers data between:
- RDBMS (MySQL, Oracle)
- Hadoop (HDFS, Hive)
Example
Import customer data from MySQL into Hadoop for analysis.
8. Flume
Type
Data ingestion tool
Function
Collects and moves streaming data into HDFS.
Example
- Twitter feeds
- Web server logs
- Application logs
9. Oozie
Type
Workflow scheduler
Function
Automates Hadoop jobs.
Example
A daily pipeline:
- Import data (Sqoop)
- Clean data (Pig)
- Query data (Hive)
- Generate report
Oozie runs all steps automatically.
10. Zookeeper
Type
Coordination service
Function
Manages:
- Cluster synchronization
- Configuration
- Naming services
Example
Used by HBase and Kafka to coordinate distributed systems.
11. Mahout
Type
Machine Learning library
Function
Provides scalable ML algorithms.
Example
- Recommendation systems (Netflix/Amazon style)
- Customer segmentation
- Clustering data
12. Spark
Type
Distributed processing engine
Function
Processes data in-memory for faster performance than MapReduce.
Example
- Real-time analytics
- Machine learning tasks
- Graph processing
13. Kafka
Type
Streaming platform
Function
Handles real-time data streams.
Example
- Live user activity tracking
- Log streaming
- Event-driven systems
Applications of Hadoop Ecosystem
1. E-Commerce
- Product recommendations
- Customer behavior analysis
2. Social Media
- Sentiment analysis
- Trend detection
3. Banking & Finance
- Fraud detection
- Risk analysis
4. Healthcare
- Disease prediction
- Patient data analysis
5. Telecommunications
- Call data analysis
- Customer churn prediction
6. Government
- Census analysis
- Crime and traffic monitoring
Simple Summary Table
| Tool | Purpose | Example |
|---|---|---|
| HDFS | Storage | Distributed file storage |
| MapReduce | Processing | Word count |
| YARN | Resource management | Job scheduling |
| Hive | SQL querying | Sales reports |
| Pig | Data transformation | ETL jobs |
| HBase | NoSQL DB | Real-time data |
| Sqoop | Data transfer | MySQL → Hadoop |
| Flume | Data ingestion | Logs collection |
| Oozie | Workflow automation | Daily pipelines |
| Zookeeper | Coordination | Cluster sync |
| Mahout | Machine learning | Recommendations |
| Spark | Fast processing | Real-time analytics |
| Kafka | Streaming | Live data flow |