Big Data Notes

DSA

Software Engineering

Software Architecture

Operating System

Big Data

Data Mining and Warehousing

TOC

Ada

CPP

DBMS

All Topics (16)

1. What is Big Data?
2. Big Data Characteristics
3. Types of Big Data
4. Traditional Data vs Big Data
5. Evolution of Big Data
6. Challenges with Big Data
7. Technologies Available for Big Data
7. Infrastructure for Big Data
9. Uses of Data Analytics
10. Hadoop
11. Hadoop Core Components
12. Hadoop Ecosystem
13. Hive Physical Architecture
14. Hadoop Limitations
15. RDBMS vs Hadoop
16. Hadoop Distributed File System (HDFS)

11. Hadoop Core Components

Hadoop is a framework used for storing and processing huge amounts of data in a distributed environment.
Its core components work together to handle big data efficiently.

The four core components of Hadoop are:

HDFS (Hadoop Distributed File System) – Storage Layer
MapReduce – Processing Layer
YARN (Yet Another Resource Negotiator) – Resource Management Layer
Hadoop Common – Shared Utilities and Libraries

1. HDFS (Hadoop Distributed File System)

What is HDFS?

HDFS is a distributed file system designed to store very large files across multiple machines.
It provides:

High storage capacity
Fault tolerance
Scalability

It is built to run on commodity hardware.

Key Features of HDFS

Fault Tolerance: Data is replicated across multiple nodes.
Scalability: More nodes can be added easily.
High Throughput: Optimized for large-scale data processing.
Flexibility: Stores structured, semi-structured, and unstructured data.

HDFS Architecture

1. NameNode (Master Node)

The NameNode manages the file system metadata such as:

File names
Directories
Permissions
Block locations

It controls all DataNodes.

2. DataNode (Slave Node)

DataNodes store the actual data blocks.

Responsibilities:

Store data
Handle read/write operations
Send heartbeat signals to the NameNode

How HDFS Works

Large files are divided into blocks.
Default block size = 128 MB
Each block is replicated (usually 3 copies).

This ensures data safety even if a node fails.

Example of HDFS

Suppose you have a 1 TB video file.

HDFS will:

Split it into 128 MB blocks
Create approximately 8000 blocks
Store each block on 3 different DataNodes

So if one machine crashes, data can still be recovered from another copy.

Real-Life Example of HDFS

Imagine keeping 3 photocopies of an important document in different rooms.
If one room is damaged, the document is still safe in the other rooms.

2. MapReduce

What is MapReduce?

MapReduce is a programming model used to process large datasets in parallel across a Hadoop cluster.

It works in three phases:

Map Phase
Shuffle and Sort Phase
Reduce Phase

Phases of MapReduce

1. Map Phase

The mapper processes input data and converts it into key-value pairs.

Example

Input sentence:

Hadoop is fast Hadoop is scalable

Mapper Output:

(Hadoop,1)
(is,1)
(fast,1)
(Hadoop,1)
(is,1)
(scalable,1)

2. Shuffle and Sort Phase

The system groups all similar keys together.

(Hadoop,[1,1])
(is,[1,1])
(fast,[1])
(scalable,[1])

3. Reduce Phase

The reducer combines values and produces the final result.

Final Output:

(Hadoop,2)
(is,2)
(fast,1)
(scalable,1)

Advantages of MapReduce

Parallel processing
Faster execution
Fault tolerance
Handles petabytes of data

Real-Life Example of MapReduce

Imagine exam papers being checked by multiple teachers:

Map: Teachers check papers separately
Shuffle: Papers are grouped subject-wise
Reduce: Final marks are calculated

3. YARN (Yet Another Resource Negotiator)

What is YARN?

YARN is the resource management framework in Hadoop.

It manages:

CPU usage
Memory allocation
Task scheduling

YARN allows multiple applications like MapReduce, Spark, and Hive to run together.

Components of YARN

1. ResourceManager (Master)

Responsibilities:

Allocates cluster resources
Schedules applications
Monitors resource usage

2. NodeManager (Slave)

Responsibilities:

Manages resources on each node
Executes tasks
Reports status to ResourceManager

Example of YARN

Suppose:

One user runs a Spark job
Another user runs a MapReduce job

YARN allocates CPU and memory resources efficiently to both applications.

Real-Life Example of YARN

Think of a school principal:

Assigns classrooms to teachers
Ensures resources are properly used

4. Hadoop Common

What is Hadoop Common?

Hadoop Common is a collection of shared libraries and utilities required by all Hadoop modules.

It provides:

Java libraries
Configuration files
Scripts for starting Hadoop services
APIs for Hadoop operations

Features of Hadoop Common

Supports communication between Hadoop modules
Provides operating system utilities
Helps integrate tools like Hive, Pig, HBase, and Sqoop

Example of Hadoop Common

Just like common system files support all applications in Windows, Hadoop Common supports all Hadoop components.

Summary Table

Component	Purpose	Example
HDFS	Stores data	File storage across nodes
MapReduce	Processes data	Word count program
YARN	Manages resources	CPU and memory allocation
Hadoop Common	Shared utilities	Libraries and APIs

Simple Real-Life Analogy

Hadoop Component	Real-Life Example
HDFS	Warehouse for storing goods
MapReduce	Workers processing tasks
YARN	Manager assigning resources
Hadoop Common	Common tools used by everyone

12. Hadoop Ecosystem

The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together to store, process, analyze, and manage Big Data.

While the core Hadoop components (HDFS and MapReduce) handle storage and batch processing, the ecosystem adds powerful tools for:

Real-time processing
Data analytics
Data integration
Workflow automation
Machine learning

It can handle:

Structured data (tables, SQL data)
Semi-structured data (JSON, XML)
Unstructured data (logs, images, videos)

Major Components of Hadoop Ecosystem

The ecosystem is modular, meaning you can use only the tools you need.

1. HDFS (Storage Layer)

Role

HDFS stores huge amounts of data across multiple machines.

Function

Splits files into blocks
Stores blocks on different nodes
Keeps multiple copies for safety

Example

A 1 TB video file is split into smaller blocks and stored across many machines. If one machine fails, data is still available.

2. MapReduce (Processing Layer)

Role

Batch processing framework for large-scale data.

Function

Processes data in parallel across cluster machines.

Example

Word count program:

Input: Large text file
Output: Frequency of each word

Used in:

Log analysis
Clickstream analysis
Data summarization

3. YARN (Resource Management Layer)

Role

Manages resources and schedules jobs in Hadoop cluster.

Function

Allocates CPU and memory
Schedules multiple applications
Manages cluster workload

Example

Running both Spark and MapReduce jobs on the same cluster without conflict.

4. Hive

Type

Data Warehouse tool (SQL-like system)

Function

Provides HiveQL (SQL-like language) to query big data.

Example

Instead of writing MapReduce code, you can write:

SELECT * FROM sales WHERE amount > 1000;

Use Case

Business reports
Sales analysis
Data summarization

5. Pig

Type

Data processing scripting tool

Function

Uses Pig Latin language for data transformation.

Example

Convert raw logs into structured format.

Use Case

ETL (Extract, Transform, Load) operations
Data cleaning
Data preparation

6. HBase

Type

NoSQL database (Column-oriented)

Function

Provides real-time read/write access to big data.

Example

Social media user profiles
IoT sensor data
Banking transaction records

Feature

Very fast for random data access.

7. Sqoop

Type

Data integration tool

Function

Transfers data between:

RDBMS (MySQL, Oracle)
Hadoop (HDFS, Hive)

Example

Import customer data from MySQL into Hadoop for analysis.

8. Flume

Type

Data ingestion tool

Function

Collects and moves streaming data into HDFS.

Example

Twitter feeds
Web server logs
Application logs

9. Oozie

Type

Workflow scheduler

Function

Automates Hadoop jobs.

Example

A daily pipeline:

Import data (Sqoop)
Clean data (Pig)
Query data (Hive)
Generate report

Oozie runs all steps automatically.

10. Zookeeper

Type

Coordination service

Function

Manages:

Cluster synchronization
Configuration
Naming services

Example

Live user activity tracking
Log streaming
Event-driven systems

Applications of Hadoop Ecosystem

1. E-Commerce

Product recommendations
Customer behavior analysis

2. Social Media

Sentiment analysis
Trend detection

3. Banking & Finance

Fraud detection
Risk analysis

4. Healthcare

Disease prediction
Patient data analysis

5. Telecommunications

Call data analysis
Customer churn prediction

6. Government

Census analysis
Crime and traffic monitoring

Simple Summary Table

Tool	Purpose	Example
HDFS	Storage	Distributed file storage
MapReduce	Processing	Word count
YARN	Resource management	Job scheduling
Hive	SQL querying	Sales reports
Pig	Data transformation	ETL jobs
HBase	NoSQL DB	Real-time data
Sqoop	Data transfer	MySQL → Hadoop
Flume	Data ingestion	Logs collection
Oozie	Workflow automation	Daily pipelines
Zookeeper	Coordination	Cluster sync
Mahout	Machine learning	Recommendations
Spark	Fast processing	Real-time analytics
Kafka	Streaming	Live data flow

13. Hive Physical Architecture

Apache Hive is a data warehouse system built on top of Apache Hadoop.
It allows users to write SQL-like queries called HiveQL to analyze huge datasets stored in HDFS (Hadoop Distributed File System).

Instead of writing complex Java MapReduce programs, users can simply write SQL queries.

Hive converts these queries into:

MapReduce jobs
Tez jobs
Spark jobs

which are executed on the Hadoop cluster.

Physical Architecture of Hive

The Physical Architecture explains how different Hive components work together to process a query and interact with Hadoop storage.

Main Components of Hive Physical Architecture

Hive Clients / User Interface
Driver
Compiler
Optimizer
Execution Engine
Metastore
HDFS / Hadoop Storage

1. Hive Clients / User Interface (UI)

Role

This is the entry point where users interact with Hive.

Users write HiveQL queries using different interfaces.

Types of Hive Clients

1. Command Line Interface (CLI)

Users execute Hive commands directly in terminal.

Example:

SELECT * FROM sales_data;

2. Web UI

Browser-based tools such as:

Hue
Ambari

allow users to run Hive queries visually.

3. JDBC / ODBC Clients

Applications connect to Hive using standard database drivers.

Example:

Java application using JDBC
BI tools like Tableau or Power BI

Function

The client sends the HiveQL query to the Hive Driver.

2. Hive Driver

Role

The Driver acts like the main controller of Hive.

It manages the entire lifecycle of a query.

Functions of Driver

Receives query from client
Creates session
Maintains execution context
Sends query to compiler
Tracks execution progress
Returns final result to user

Example

User writes:

SELECT product, SUM(price)
FROM sales_data
GROUP BY product;

The Driver receives this query and starts processing it.

3. Compiler

Role

The Compiler converts HiveQL into an execution plan.

Steps Performed by Compiler

Step 1: Parsing

The query is converted into an Abstract Syntax Tree (AST).

Example:

SELECT * FROM sales_data;

Hive checks:

SQL syntax
keywords
structure

Step 2: Semantic Analysis

Hive verifies:

Table exists or not
Column names are correct
Data types are valid

Example:
If table sales_data does not exist, Hive throws an error.

Step 3: Logical Plan Generation

Hive creates a logical workflow of operations.

Example operations:

Scan table
Filter rows
Group data
Join tables

This logical plan is represented as a DAG (Directed Acyclic Graph).

4. Optimizer

Role

The Optimizer improves query performance.

It converts the logical plan into the most efficient execution plan.

Types of Optimization

1. Rule-Based Optimization

Hive applies predefined rules.

Example:

Push filter conditions near the data source.

Instead of:

SELECT * FROM sales_data
WHERE price > 1000;

Hive reads only required rows.

This reduces I/O operations.

2. Cost-Based Optimization (CBO)

Hive calculates execution cost and chooses the best strategy.

Example:

For joining two tables:

Which table should be processed first?
Which join algorithm is faster?

Output

The optimizer generates a physical execution plan.

This plan may use:

MapReduce
Tez
Spark

5. Execution Engine

Role

The Execution Engine actually runs the query.

Functions

Divides query into tasks
Submits jobs to YARN
Monitors execution
Collects results

Interaction with Hadoop

The Execution Engine:

Reads data from HDFS
Processes data
Writes output back to HDFS

Example

Suppose query:

SELECT COUNT(*) FROM sales_data;

Execution Engine:

Creates MapReduce tasks
Sends them to Hadoop cluster
Each node processes data blocks
Final count is returned

6. Metastore

Role

The Metastore stores metadata about Hive tables.

Metadata means:

Table names
Columns
Data types
Partition info
HDFS file locations

Types of Metastore

1. Embedded Metastore

Uses Derby database.

Suitable for:

Single user
Testing

2. Standalone Metastore

Uses:

MySQL
PostgreSQL

Suitable for:

Multiple users
Production systems

Example Metadata

Table:

sales_data

Columns:

id, product, price, date

HDFS Location:

/user/hive/warehouse/sales_data

Why Metastore is Important

Whenever a query runs, Hive first checks metadata.

Without metastore:

Hive cannot locate table data
Query execution becomes impossible

7. HDFS / Hadoop Storage

Role

HDFS stores the actual data files.

Hive tables are physically stored inside Hadoop Distributed File System.

Supported File Formats

Hive supports many formats:

Text File
ORC
Parquet
Avro
RCFile

Example

Table:

sales_data

Stored in:

/user/hive/warehouse/sales_data

inside HDFS.

Complete Workflow of Hive Physical Architecture

Step-by-Step Flow

Step 1: User Submits Query

Example:

SELECT product, SUM(price)
FROM sales_data
GROUP BY product;

via:

CLI
JDBC
Web UI

Step 2: Driver Receives Query

Driver:

Creates session
Sends query to compiler

Step 3: Compiler Processes Query

Compiler:

Parses query
Performs semantic analysis
Creates logical plan

Step 4: Optimizer Improves Plan

Optimizer:

Reduces unnecessary operations
Chooses efficient execution strategy

Step 5: Execution Engine Executes Tasks

Execution Engine:

Converts plan into MapReduce/Tez/Spark jobs
Sends tasks to YARN

Step 6: HDFS Data Processing

Tasks:

Read data from HDFS
Process records
Store intermediate/final results

Step 7: Results Returned

Final output is sent back to user interface.

Example Output:

Laptop   50000
Mobile   30000
TV       20000

Real-Life Example of Hive Architecture

Suppose an e-commerce company stores 10 TB sales data in Hadoop.

A data analyst wants to know:

SELECT city, SUM(revenue)
FROM orders
GROUP BY city;

What Happens Internally?

1. User submits query

via Hive CLI.

2. Driver receives query

Creates execution environment.

3. Compiler checks:

Does table orders exist?
Does column revenue exist?

4. Optimizer:

Uses partition pruning
Reduces unnecessary scans

5. Execution Engine:

Creates Spark/MapReduce jobs
Sends jobs to cluster

6. Hadoop Nodes:

Process different blocks in parallel

7. Result returned:

Delhi      5,00,000
Mumbai     8,00,000
Bhopal     2,00,000

Advantages of Hive Physical Architecture

1. Scalability

Can process petabytes of data.

2. SQL Support

Easy for SQL users.

3. Distributed Processing

Uses Hadoop cluster for parallel execution.

4. Fault Tolerance

HDFS automatically handles failures.

5. Multiple Execution Engines

Supports:

MapReduce
Tez
Spark

14. Hadoop Limitations

Apache Hadoop is a powerful framework used for storing and processing huge amounts of data across distributed systems.

It is highly useful for:

Big Data analytics
Distributed storage
Batch data processing

However, Hadoop also has several limitations.
Understanding these limitations helps organizations decide when Hadoop is the right choice and when other technologies may perform better.

Limitations of Hadoop

1. Not Suitable for Small Data

Explanation

Hadoop is designed for processing very large datasets such as:

Terabytes (TB)
Petabytes (PB)

For small datasets, Hadoop becomes inefficient because:

Starting MapReduce jobs takes time
Cluster communication creates overhead
Task scheduling adds delay

Example

Suppose you want to process:

500 MB sales data

Using Hadoop may take more time than:

MySQL
PostgreSQL

because Hadoop first:

Divides tasks
Allocates cluster resources
Starts MapReduce jobs

This overhead is unnecessary for small data.

Conclusion

Traditional databases are faster for small-scale processing.

2. Complex Programming

Explanation

Native Hadoop programming uses:

Java
MapReduce model

Writing MapReduce code is difficult for beginners.

Developers must:

Write Mapper functions
Write Reducer functions
Handle key-value pairs
Debug distributed jobs

Example

A simple word count program in MapReduce may require:

Multiple Java classes
Configuration setup
JAR file creation

while the same task in SQL needs only:

SELECT word, COUNT(*)
FROM documents
GROUP BY word;

Problem

Debugging distributed systems is more complex than debugging traditional applications.

Solution

Tools like:

Apache Hive
Apache Pig

simplify Hadoop programming.

But internally they still generate MapReduce jobs.

3. High Latency / Batch Processing Only

Explanation

Hadoop is mainly designed for:

Batch processing
Long-running analytics

It is not suitable for:

Real-time systems
Instant query processing
Fast transactions

Example

Suppose a banking application needs:

instant account balance updates

Hadoop cannot provide millisecond-level response.

Why?
Because:

HDFS is optimized for large sequential reads
MapReduce jobs take time to initialize

Real-Time Alternatives

For low-latency processing:

Apache Spark
Apache Flink
Apache Storm

are better options.

4. Data Security Limitations

Explanation

Early Hadoop versions had weak security features.

Problems included:

No strong authentication
Weak authorization
Limited encryption

Modern Hadoop Security Improvements

Newer Hadoop versions support:

Kerberos authentication
Encryption
Access control systems

Tools used:

Apache Ranger
Apache Sentry

Example

In a healthcare system:

patient records must be secure
unauthorized access must be blocked

Configuring Hadoop security for such environments becomes complicated.

Limitation

Strong security exists, but configuration and management are complex.

5. Inefficient for Iterative Processing

Explanation

Machine learning and graph algorithms require:

repeated processing of same data
multiple iterations

MapReduce writes intermediate results to HDFS after every step.

This causes:

heavy disk I/O
slower performance

Example

Machine Learning Algorithm:

K-Means Clustering

requires repeated iterations.

In Hadoop:

Data is read from HDFS
Processed
Written back to HDFS
Re-read again

This becomes slow.

Better Alternative

Apache Spark processes data in memory and is much faster for iterative workloads.

6. Limited SQL Support

Explanation

Hive provides SQL-like querying using HiveQL.

But compared to traditional RDBMS:

joins can be slow
subqueries are less efficient
transactions are limited

Example

Complex SQL query:

SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.id;

may execute slowly on large Hadoop clusters.

Problem

Hadoop is not designed for:

OLTP systems
high-frequency transactions
real-time updates

Conclusion

Traditional databases perform better for transactional applications.

7. Difficulty in Handling Small Files

Explanation

HDFS works efficiently with large files.

Many small files create problems because:

each file metadata is stored in NameNode memory
NameNode memory gets overloaded

Example

Suppose there are:

10 million files of 1 KB each

The NameNode may run out of memory due to excessive metadata storage.

Result

Reduced performance
Slower processing
Lower throughput

Solutions

Combine small files using:

SequenceFiles
HAR (Hadoop Archive) files

8. Requires Skilled Workforce

Explanation

Managing Hadoop clusters is difficult.

Administrators must understand:

distributed systems
cluster management
YARN
HDFS
security configuration
node failure handling

Example

If one node fails:

data recovery
replication management
workload balancing

must be handled correctly.

Limitation

Organizations need experienced Hadoop engineers, which increases operational cost.

9. High Memory Usage

Explanation

Hadoop ecosystem tools consume large amounts of RAM.

Components requiring memory:

MapReduce
YARN
Spark
HBase

Improper memory allocation can cause:

task failure
slow execution
node crashes

Example

A Spark job processing huge datasets may require:

64 GB or more RAM

per node.

Limitation

High hardware requirements increase infrastructure cost.

10. Lack of Standardized Ecosystem

Explanation

The Hadoop ecosystem contains many independent tools.

Examples:

Hive
Pig
HBase
Spark
Kafka
Flume
Oozie

Integrating these tools can be difficult.

Example

Different tools may have:

dependency conflicts
version incompatibility
configuration mismatch

Result

Setup and maintenance become complex.

Summary Table of Hadoop Limitations

Limitation	Description
Not suitable for small data	Hadoop overhead makes small data processing slow
Complex programming	MapReduce coding is difficult
High latency	Not suitable for real-time systems
Security complexity	Advanced security setup is difficult
Poor iterative processing	Repeated HDFS reads/writes slow ML tasks
Limited SQL support	Slow joins and weak OLTP support
Small file problem	Millions of small files overload NameNode
Skilled workforce needed	Cluster management is complex
High memory usage	Requires large RAM resources
Fragmented ecosystem	Tool integration is difficult

Real-Life Scenario

Suppose an e-commerce company uses Hadoop to analyze:

500 TB customer clickstream data

Hadoop works well for:

daily reports
trend analysis
batch analytics

But Hadoop struggles with:

instant product recommendations
real-time fraud detection
live transactions

For these tasks, companies often use:

Spark
Kafka
Flink
NoSQL databases

along with Hadoop.

15. RDBMS vs Hadoop

An RDBMS (Relational Database Management System) is used to store and manage structured data in tables.

Examples:

MySQL
Oracle Database
PostgreSQL
Microsoft SQL Server

Apache Hadoop is a distributed framework designed for storing and processing huge amounts of Big Data across clusters.

Both are used for data management, but they differ greatly in:

architecture
scalability
performance
data handling
use cases

RDBMS vs Hadoop Comparison Table

Feature	RDBMS	Hadoop
Data Type	Structured data only	Structured, semi-structured, unstructured
Schema	Schema-on-write	Schema-on-read
Storage	Single server or limited clusters	Distributed storage using HDFS
Scalability	Vertical scaling	Horizontal scaling
Processing	OLTP and some OLAP	Batch and Big Data processing
Fault Tolerance	Backup and replication	Built-in replication in HDFS
Cost	Expensive hardware and licenses	Low-cost commodity hardware
Performance	Fast for small-medium data	High throughput for massive data
Data Volume	GB to low TB	TB to PB
Query Language	SQL	HiveQL, Pig Latin, Spark SQL
Latency	Low latency	Higher latency
Consistency	Full ACID support	Eventual consistency
Maintenance	Easier	Complex cluster management
Examples	Oracle, MySQL	Hadoop, Hive, Spark

1. Data Type

RDBMS

RDBMS handles only structured data.

Data is stored in:

rows
columns
tables

Example

ID	Name	Salary
1	Rahul	50000

Hadoop

Hadoop can handle:

structured data
semi-structured data
unstructured data

Examples:

text files
logs
images
videos
social media posts

Example

Hadoop can store:

Facebook posts + images + videos + chat logs

while RDBMS cannot efficiently handle such diverse data.

2. Schema

RDBMS → Schema-on-Write

Schema must be defined before inserting data.

Example:

CREATE TABLE employee(
 id INT,
 name VARCHAR(50),
 salary FLOAT
);

Data must match the schema.

Hadoop → Schema-on-Read

Data can be stored in raw format.

Schema is applied only during analysis.

Example

Raw JSON logs:

{
 "user":"Rahul",
 "action":"login"
}

can be stored directly in Hadoop and analyzed later.

3. Storage

RDBMS

Data is usually stored:

on a single server
or limited database clusters

Suitable for:

GBs
low TBs

Hadoop

Uses:

HDFS

Data is distributed across multiple nodes.

Can store:

terabytes
petabytes

Example

A company storing:

500 TB web logs

would prefer Hadoop over RDBMS.

4. Scalability

RDBMS → Vertical Scaling

Increase:

CPU
RAM
storage

on a single machine.

This becomes expensive.

Hadoop → Horizontal Scaling

Add more nodes to cluster.

Example:

Add 10 more commodity servers

This is cheaper and easier.

5. Processing Type

RDBMS

Best for:

transactional processing (OLTP)
real-time queries

Examples:

banking
ATM transactions
ERP systems

Hadoop

Best for:

batch processing
analytics
big data computation

Uses:

MapReduce
Spark
Flink

Example

Analyzing:

10 years of customer purchase history

is ideal for Hadoop.

6. Fault Tolerance

RDBMS

Uses:

backups
replication

for recovery.

Failure recovery may be expensive.

Hadoop

Provides built-in fault tolerance.

HDFS automatically replicates data blocks.

Usually:

3 copies of data

are stored on different nodes.

Example

If one node crashes:

Hadoop retrieves data from another node automatically.

7. Cost

RDBMS

Requires:

high-end servers
licensed software

Examples:

Oracle licensing can be expensive.

Hadoop

Uses:

open-source software
commodity hardware

which reduces infrastructure cost.

8. Performance

RDBMS

Very fast for:

small datasets
indexed queries
transactions

Hadoop

Optimized for:

large-scale parallel processing

Not ideal for quick single-record lookups.

Example

Finding one customer record:

faster in MySQL

Processing:

100 TB clickstream data

faster in Hadoop.

9. Data Volume Handling

RDBMS

Efficient for:

GBs
small TBs

Hadoop

Efficient for:

10. Query Language

RDBMS

Uses standardized SQL.

Example:

SELECT * FROM employee;

Hadoop

Uses:

HiveQL
Pig Latin
Spark SQL

Native MapReduce requires coding.

11. Latency

RDBMS

Provides:

low latency
fast response

Suitable for real-time applications.

Hadoop

Traditionally batch-oriented.

MapReduce jobs take time to start.

Improvement

Tools like:

Apache Spark
Hive on Tez

reduce latency.

12. Consistency

RDBMS

Fully ACID compliant.

ACID means:

Atomicity
Consistency
Isolation
Durability

Hadoop

Default Hadoop is not fully ACID compliant.

Some components like:

Apache HBase

provide better consistency support.

13. Maintenance

RDBMS

Managed by DBAs.

Maintenance is relatively easier.

Hadoop

Requires:

Hadoop administrators
cluster management skills
HDFS knowledge
YARN configuration expertise

Key Differences Explained

Use Case Comparison

Use Case	RDBMS	Hadoop
Banking transactions	✅	❌
Inventory management	✅	❌
Social media analytics	❌	✅
Web clickstream analysis	❌	✅
Fraud detection (batch)	❌	✅
IoT sensor data	❌	✅

Real-Life Example

Banking System

A bank needs:

instant transactions
account updates
ACID compliance

Best Choice:

RDBMS

because transactions must be real-time and consistent.

Social Media Company

A social media platform stores:

photos
videos
billions of user logs

Best Choice:

Hadoop

because it handles massive unstructured data efficiently.

Advantages of RDBMS

Fast transactions
Strong ACID properties
Low latency
Easy querying with SQL

Advantages of Hadoop

Massive scalability
Handles all data types
Cost-effective
Distributed processing
Fault tolerant

Use RDBMS when:

data is structured
transactions are frequent
real-time processing is required

Use Hadoop when:

data is huge
data is unstructured
large-scale analytics is needed

In modern systems, companies often use both together:

RDBMS for transactions
Hadoop for analytics and Big Data processing.

Page 3 of 4