Big Data Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (12)

  • 1. What is Big Data?
  • 2. Big Data Characteristics
  • 3. Types of Big Data
  • 4. Traditional Data vs Big Data
  • 5. Evolution of Big Data
  • 6. Challenges with Big Data
  • 7. Technologies Available for Big Data
  • 7. Infrastructure for Big Data
  • 9. Uses of Data Analytics
  • 10. Hadoop
  • 11. Hadoop Core Components
  • 12. Hadoop Ecosystem

11. Hadoop Core Components

Hadoop is a framework used for storing and processing huge amounts of data in a distributed environment.
Its core components work together to handle big data efficiently.

The four core components of Hadoop are:

  1. HDFS (Hadoop Distributed File System) – Storage Layer
  2. MapReduce – Processing Layer
  3. YARN (Yet Another Resource Negotiator) – Resource Management Layer
  4. Hadoop Common – Shared Utilities and Libraries

1. HDFS (Hadoop Distributed File System)

 What is HDFS?

HDFS is a distributed file system designed to store very large files across multiple machines.
It provides:

  • High storage capacity
  • Fault tolerance
  • Scalability

It is built to run on commodity hardware.

 Key Features of HDFS

  • Fault Tolerance: Data is replicated across multiple nodes.
  • Scalability: More nodes can be added easily.
  • High Throughput: Optimized for large-scale data processing.
  • Flexibility: Stores structured, semi-structured, and unstructured data.

 HDFS Architecture

1. NameNode (Master Node)

The NameNode manages the file system metadata such as:

  • File names
  • Directories
  • Permissions
  • Block locations

It controls all DataNodes.

2. DataNode (Slave Node)

DataNodes store the actual data blocks.

Responsibilities:

  • Store data
  • Handle read/write operations
  • Send heartbeat signals to the NameNode

 How HDFS Works

  • Large files are divided into blocks.
  • Default block size = 128 MB
  • Each block is replicated (usually 3 copies).

This ensures data safety even if a node fails.

 Example of HDFS

Suppose you have a 1 TB video file.

HDFS will:

  • Split it into 128 MB blocks
  • Create approximately 8000 blocks
  • Store each block on 3 different DataNodes

So if one machine crashes, data can still be recovered from another copy.

 Real-Life Example of HDFS

Imagine keeping 3 photocopies of an important document in different rooms.
If one room is damaged, the document is still safe in the other rooms.

2. MapReduce

 What is MapReduce?

MapReduce is a programming model used to process large datasets in parallel across a Hadoop cluster.

It works in three phases:

  1. Map Phase
  2. Shuffle and Sort Phase
  3. Reduce Phase

 Phases of MapReduce

1. Map Phase

The mapper processes input data and converts it into key-value pairs.

Example

Input sentence:

Hadoop is fast Hadoop is scalable

Mapper Output:

(Hadoop,1)
(is,1)
(fast,1)
(Hadoop,1)
(is,1)
(scalable,1)
 

2. Shuffle and Sort Phase

The system groups all similar keys together.

(Hadoop,[1,1])
(is,[1,1])
(fast,[1])
(scalable,[1])

3. Reduce Phase

The reducer combines values and produces the final result.

Final Output:

(Hadoop,2)
(is,2)
(fast,1)
(scalable,1)
 

 Advantages of MapReduce

  • Parallel processing
  • Faster execution
  • Fault tolerance
  • Handles petabytes of data

 Real-Life Example of MapReduce

Imagine exam papers being checked by multiple teachers:

  • Map: Teachers check papers separately
  • Shuffle: Papers are grouped subject-wise
  • Reduce: Final marks are calculated

3. YARN (Yet Another Resource Negotiator)

 What is YARN?

YARN is the resource management framework in Hadoop.

It manages:

  • CPU usage
  • Memory allocation
  • Task scheduling

YARN allows multiple applications like MapReduce, Spark, and Hive to run together.

 Components of YARN

1. ResourceManager (Master)

Responsibilities:

  • Allocates cluster resources
  • Schedules applications
  • Monitors resource usage

2. NodeManager (Slave)

Responsibilities:

  • Manages resources on each node
  • Executes tasks
  • Reports status to ResourceManager

 Example of YARN

Suppose:

  • One user runs a Spark job
  • Another user runs a MapReduce job

YARN allocates CPU and memory resources efficiently to both applications.

 Real-Life Example of YARN

Think of a school principal:

  • Assigns classrooms to teachers
  • Ensures resources are properly used

4. Hadoop Common

 What is Hadoop Common?

Hadoop Common is a collection of shared libraries and utilities required by all Hadoop modules.

It provides:

  • Java libraries
  • Configuration files
  • Scripts for starting Hadoop services
  • APIs for Hadoop operations

 Features of Hadoop Common

  • Supports communication between Hadoop modules
  • Provides operating system utilities
  • Helps integrate tools like Hive, Pig, HBase, and Sqoop

 Example of Hadoop Common

Just like common system files support all applications in Windows, Hadoop Common supports all Hadoop components.

Summary Table

Component Purpose Example
HDFS Stores data File storage across nodes
MapReduce Processes data Word count program
YARN Manages resources CPU and memory allocation
Hadoop Common Shared utilities Libraries and APIs

Simple Real-Life Analogy

Hadoop Component Real-Life Example
HDFS Warehouse for storing goods
MapReduce Workers processing tasks
YARN Manager assigning resources
Hadoop Common Common tools used by everyone

 

12. Hadoop Ecosystem

The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together to store, process, analyze, and manage Big Data.

While the core Hadoop components (HDFS and MapReduce) handle storage and batch processing, the ecosystem adds powerful tools for:

  • Real-time processing
  • Data analytics
  • Data integration
  • Workflow automation
  • Machine learning

It can handle:

  • Structured data (tables, SQL data)
  • Semi-structured data (JSON, XML)
  • Unstructured data (logs, images, videos)

Major Components of Hadoop Ecosystem

The ecosystem is modular, meaning you can use only the tools you need.

1. HDFS (Storage Layer)

 Role

HDFS stores huge amounts of data across multiple machines.

 Function

  • Splits files into blocks
  • Stores blocks on different nodes
  • Keeps multiple copies for safety

 Example

A 1 TB video file is split into smaller blocks and stored across many machines. If one machine fails, data is still available.

2. MapReduce (Processing Layer)

 Role

Batch processing framework for large-scale data.

 Function

Processes data in parallel across cluster machines.

 Example

Word count program:

  • Input: Large text file
  • Output: Frequency of each word

Used in:

  • Log analysis
  • Clickstream analysis
  • Data summarization

3. YARN (Resource Management Layer)

 Role

Manages resources and schedules jobs in Hadoop cluster.

 Function

  • Allocates CPU and memory
  • Schedules multiple applications
  • Manages cluster workload

 Example

Running both Spark and MapReduce jobs on the same cluster without conflict.

4. Hive

 Type

Data Warehouse tool (SQL-like system)

 Function

Provides HiveQL (SQL-like language) to query big data.

 Example

Instead of writing MapReduce code, you can write:

SELECT * FROM sales WHERE amount > 1000;

 Use Case

  • Business reports
  • Sales analysis
  • Data summarization

5. Pig

 Type

Data processing scripting tool

 Function

Uses Pig Latin language for data transformation.

 Example

Convert raw logs into structured format.

 Use Case

  • ETL (Extract, Transform, Load) operations
  • Data cleaning
  • Data preparation

6. HBase

 Type

NoSQL database (Column-oriented)

 Function

Provides real-time read/write access to big data.

 Example

  • Social media user profiles
  • IoT sensor data
  • Banking transaction records

 Feature

Very fast for random data access.

7. Sqoop

 Type

Data integration tool

 Function

Transfers data between:

  • RDBMS (MySQL, Oracle)
  • Hadoop (HDFS, Hive)

 Example

Import customer data from MySQL into Hadoop for analysis.

8. Flume

 Type

Data ingestion tool

 Function

Collects and moves streaming data into HDFS.

 Example

  • Twitter feeds
  • Web server logs
  • Application logs

9. Oozie

 Type

Workflow scheduler

 Function

Automates Hadoop jobs.

 Example

A daily pipeline:

  1. Import data (Sqoop)
  2. Clean data (Pig)
  3. Query data (Hive)
  4. Generate report

Oozie runs all steps automatically.

10. Zookeeper

 Type

Coordination service

 Function

Manages:

  • Cluster synchronization
  • Configuration
  • Naming services

 Example

Used by HBase and Kafka to coordinate distributed systems.

11. Mahout

 Type

Machine Learning library

 Function

Provides scalable ML algorithms.

 Example

  • Recommendation systems (Netflix/Amazon style)
  • Customer segmentation
  • Clustering data

12. Spark

 Type

Distributed processing engine

 Function

Processes data in-memory for faster performance than MapReduce.

 Example

  • Real-time analytics
  • Machine learning tasks
  • Graph processing

13. Kafka

 Type

Streaming platform

 Function

Handles real-time data streams.

 Example

  • Live user activity tracking
  • Log streaming
  • Event-driven systems

Applications of Hadoop Ecosystem

1. E-Commerce

  • Product recommendations
  • Customer behavior analysis

2. Social Media

  • Sentiment analysis
  • Trend detection

3. Banking & Finance

  • Fraud detection
  • Risk analysis

4. Healthcare

  • Disease prediction
  • Patient data analysis

5. Telecommunications

  • Call data analysis
  • Customer churn prediction

6. Government

  • Census analysis
  • Crime and traffic monitoring

Simple Summary Table

Tool Purpose Example
HDFS Storage Distributed file storage
MapReduce Processing Word count
YARN Resource management Job scheduling
Hive SQL querying Sales reports
Pig Data transformation ETL jobs
HBase NoSQL DB Real-time data
Sqoop Data transfer MySQL → Hadoop
Flume Data ingestion Logs collection
Oozie Workflow automation Daily pipelines
Zookeeper Coordination Cluster sync
Mahout Machine learning Recommendations
Spark Fast processing Real-time analytics
Kafka Streaming Live data flow

 

Page 3 of 3