Big Data Notes
All Topics (16)
- 1. What is Big Data?
- 2. Big Data Characteristics
- 3. Types of Big Data
- 4. Traditional Data vs Big Data
- 5. Evolution of Big Data
- 6. Challenges with Big Data
- 7. Technologies Available for Big Data
- 7. Infrastructure for Big Data
- 9. Uses of Data Analytics
- 10. Hadoop
- 11. Hadoop Core Components
- 12. Hadoop Ecosystem
- 13. Hive Physical Architecture
- 14. Hadoop Limitations
- 15. RDBMS vs Hadoop
- 16. Hadoop Distributed File System (HDFS)
11. Hadoop Core Components
Hadoop is a framework used for storing and processing huge amounts of data in a distributed environment.
Its core components work together to handle big data efficiently.
The four core components of Hadoop are:
- HDFS (Hadoop Distributed File System) – Storage Layer
- MapReduce – Processing Layer
- YARN (Yet Another Resource Negotiator) – Resource Management Layer
- Hadoop Common – Shared Utilities and Libraries
1. HDFS (Hadoop Distributed File System)
What is HDFS?
HDFS is a distributed file system designed to store very large files across multiple machines.
It provides:
- High storage capacity
- Fault tolerance
- Scalability
It is built to run on commodity hardware.
Key Features of HDFS
- Fault Tolerance: Data is replicated across multiple nodes.
- Scalability: More nodes can be added easily.
- High Throughput: Optimized for large-scale data processing.
- Flexibility: Stores structured, semi-structured, and unstructured data.
HDFS Architecture
1. NameNode (Master Node)
The NameNode manages the file system metadata such as:
- File names
- Directories
- Permissions
- Block locations
It controls all DataNodes.
2. DataNode (Slave Node)
DataNodes store the actual data blocks.
Responsibilities:
- Store data
- Handle read/write operations
- Send heartbeat signals to the NameNode
How HDFS Works
- Large files are divided into blocks.
- Default block size = 128 MB
- Each block is replicated (usually 3 copies).
This ensures data safety even if a node fails.
Example of HDFS
Suppose you have a 1 TB video file.
HDFS will:
- Split it into 128 MB blocks
- Create approximately 8000 blocks
- Store each block on 3 different DataNodes
So if one machine crashes, data can still be recovered from another copy.
Real-Life Example of HDFS
Imagine keeping 3 photocopies of an important document in different rooms.
If one room is damaged, the document is still safe in the other rooms.
2. MapReduce
What is MapReduce?
MapReduce is a programming model used to process large datasets in parallel across a Hadoop cluster.
It works in three phases:
- Map Phase
- Shuffle and Sort Phase
- Reduce Phase
Phases of MapReduce
1. Map Phase
The mapper processes input data and converts it into key-value pairs.
Example
Input sentence:
Hadoop is fast Hadoop is scalable
Mapper Output:
(Hadoop,1)
(is,1)
(fast,1)
(Hadoop,1)
(is,1)
(scalable,1)
2. Shuffle and Sort Phase
The system groups all similar keys together.
(Hadoop,[1,1])
(is,[1,1])
(fast,[1])
(scalable,[1])
3. Reduce Phase
The reducer combines values and produces the final result.
Final Output:
(Hadoop,2)
(is,2)
(fast,1)
(scalable,1)
Advantages of MapReduce
- Parallel processing
- Faster execution
- Fault tolerance
- Handles petabytes of data
Real-Life Example of MapReduce
Imagine exam papers being checked by multiple teachers:
- Map: Teachers check papers separately
- Shuffle: Papers are grouped subject-wise
- Reduce: Final marks are calculated
3. YARN (Yet Another Resource Negotiator)
What is YARN?
YARN is the resource management framework in Hadoop.
It manages:
- CPU usage
- Memory allocation
- Task scheduling
YARN allows multiple applications like MapReduce, Spark, and Hive to run together.
Components of YARN
1. ResourceManager (Master)
Responsibilities:
- Allocates cluster resources
- Schedules applications
- Monitors resource usage
2. NodeManager (Slave)
Responsibilities:
- Manages resources on each node
- Executes tasks
- Reports status to ResourceManager
Example of YARN
Suppose:
- One user runs a Spark job
- Another user runs a MapReduce job
YARN allocates CPU and memory resources efficiently to both applications.
Real-Life Example of YARN
Think of a school principal:
- Assigns classrooms to teachers
- Ensures resources are properly used
4. Hadoop Common
What is Hadoop Common?
Hadoop Common is a collection of shared libraries and utilities required by all Hadoop modules.
It provides:
- Java libraries
- Configuration files
- Scripts for starting Hadoop services
- APIs for Hadoop operations
Features of Hadoop Common
- Supports communication between Hadoop modules
- Provides operating system utilities
- Helps integrate tools like Hive, Pig, HBase, and Sqoop
Example of Hadoop Common
Just like common system files support all applications in Windows, Hadoop Common supports all Hadoop components.
Summary Table
| Component | Purpose | Example |
|---|---|---|
| HDFS | Stores data | File storage across nodes |
| MapReduce | Processes data | Word count program |
| YARN | Manages resources | CPU and memory allocation |
| Hadoop Common | Shared utilities | Libraries and APIs |
Simple Real-Life Analogy
| Hadoop Component | Real-Life Example |
|---|---|
| HDFS | Warehouse for storing goods |
| MapReduce | Workers processing tasks |
| YARN | Manager assigning resources |
| Hadoop Common | Common tools used by everyone |
12. Hadoop Ecosystem
The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together to store, process, analyze, and manage Big Data.
While the core Hadoop components (HDFS and MapReduce) handle storage and batch processing, the ecosystem adds powerful tools for:
- Real-time processing
- Data analytics
- Data integration
- Workflow automation
- Machine learning
It can handle:
- Structured data (tables, SQL data)
- Semi-structured data (JSON, XML)
- Unstructured data (logs, images, videos)
Major Components of Hadoop Ecosystem
The ecosystem is modular, meaning you can use only the tools you need.
1. HDFS (Storage Layer)
Role
HDFS stores huge amounts of data across multiple machines.
Function
- Splits files into blocks
- Stores blocks on different nodes
- Keeps multiple copies for safety
Example
A 1 TB video file is split into smaller blocks and stored across many machines. If one machine fails, data is still available.
2. MapReduce (Processing Layer)
Role
Batch processing framework for large-scale data.
Function
Processes data in parallel across cluster machines.
Example
Word count program:
- Input: Large text file
- Output: Frequency of each word
Used in:
- Log analysis
- Clickstream analysis
- Data summarization
3. YARN (Resource Management Layer)
Role
Manages resources and schedules jobs in Hadoop cluster.
Function
- Allocates CPU and memory
- Schedules multiple applications
- Manages cluster workload
Example
Running both Spark and MapReduce jobs on the same cluster without conflict.
4. Hive
Type
Data Warehouse tool (SQL-like system)
Function
Provides HiveQL (SQL-like language) to query big data.
Example
Instead of writing MapReduce code, you can write:
SELECT * FROM sales WHERE amount > 1000;
Use Case
- Business reports
- Sales analysis
- Data summarization
5. Pig
Type
Data processing scripting tool
Function
Uses Pig Latin language for data transformation.
Example
Convert raw logs into structured format.
Use Case
- ETL (Extract, Transform, Load) operations
- Data cleaning
- Data preparation
6. HBase
Type
NoSQL database (Column-oriented)
Function
Provides real-time read/write access to big data.
Example
- Social media user profiles
- IoT sensor data
- Banking transaction records
Feature
Very fast for random data access.
7. Sqoop
Type
Data integration tool
Function
Transfers data between:
- RDBMS (MySQL, Oracle)
- Hadoop (HDFS, Hive)
Example
Import customer data from MySQL into Hadoop for analysis.
8. Flume
Type
Data ingestion tool
Function
Collects and moves streaming data into HDFS.
Example
- Twitter feeds
- Web server logs
- Application logs
9. Oozie
Type
Workflow scheduler
Function
Automates Hadoop jobs.
Example
A daily pipeline:
- Import data (Sqoop)
- Clean data (Pig)
- Query data (Hive)
- Generate report
Oozie runs all steps automatically.
10. Zookeeper
Type
Coordination service
Function
Manages:
- Cluster synchronization
- Configuration
- Naming services
Example
Used by HBase and Kafka to coordinate distributed systems.
11. Mahout
Type
Machine Learning library
Function
Provides scalable ML algorithms.
Example
- Recommendation systems (Netflix/Amazon style)
- Customer segmentation
- Clustering data
12. Spark
Type
Distributed processing engine
Function
Processes data in-memory for faster performance than MapReduce.
Example
- Real-time analytics
- Machine learning tasks
- Graph processing
13. Kafka
Type
Streaming platform
Function
Handles real-time data streams.
Example
- Live user activity tracking
- Log streaming
- Event-driven systems
Applications of Hadoop Ecosystem
1. E-Commerce
- Product recommendations
- Customer behavior analysis
2. Social Media
- Sentiment analysis
- Trend detection
3. Banking & Finance
- Fraud detection
- Risk analysis
4. Healthcare
- Disease prediction
- Patient data analysis
5. Telecommunications
- Call data analysis
- Customer churn prediction
6. Government
- Census analysis
- Crime and traffic monitoring
Simple Summary Table
| Tool | Purpose | Example |
|---|---|---|
| HDFS | Storage | Distributed file storage |
| MapReduce | Processing | Word count |
| YARN | Resource management | Job scheduling |
| Hive | SQL querying | Sales reports |
| Pig | Data transformation | ETL jobs |
| HBase | NoSQL DB | Real-time data |
| Sqoop | Data transfer | MySQL → Hadoop |
| Flume | Data ingestion | Logs collection |
| Oozie | Workflow automation | Daily pipelines |
| Zookeeper | Coordination | Cluster sync |
| Mahout | Machine learning | Recommendations |
| Spark | Fast processing | Real-time analytics |
| Kafka | Streaming | Live data flow |
13. Hive Physical Architecture
Apache Hive is a data warehouse system built on top of Apache Hadoop.
It allows users to write SQL-like queries called HiveQL to analyze huge datasets stored in HDFS (Hadoop Distributed File System).
Instead of writing complex Java MapReduce programs, users can simply write SQL queries.
Hive converts these queries into:
- MapReduce jobs
- Tez jobs
- Spark jobs
which are executed on the Hadoop cluster.
Physical Architecture of Hive
The Physical Architecture explains how different Hive components work together to process a query and interact with Hadoop storage.
Main Components of Hive Physical Architecture
- Hive Clients / User Interface
- Driver
- Compiler
- Optimizer
- Execution Engine
- Metastore
- HDFS / Hadoop Storage
1. Hive Clients / User Interface (UI)
Role
This is the entry point where users interact with Hive.
Users write HiveQL queries using different interfaces.
Types of Hive Clients
1. Command Line Interface (CLI)
Users execute Hive commands directly in terminal.
Example:
SELECT * FROM sales_data;
2. Web UI
Browser-based tools such as:
- Hue
- Ambari
allow users to run Hive queries visually.
3. JDBC / ODBC Clients
Applications connect to Hive using standard database drivers.
Example:
- Java application using JDBC
- BI tools like Tableau or Power BI
Function
The client sends the HiveQL query to the Hive Driver.
2. Hive Driver
Role
The Driver acts like the main controller of Hive.
It manages the entire lifecycle of a query.
Functions of Driver
- Receives query from client
- Creates session
- Maintains execution context
- Sends query to compiler
- Tracks execution progress
- Returns final result to user
Example
User writes:
SELECT product, SUM(price)
FROM sales_data
GROUP BY product;
The Driver receives this query and starts processing it.
3. Compiler
Role
The Compiler converts HiveQL into an execution plan.
Steps Performed by Compiler
Step 1: Parsing
The query is converted into an Abstract Syntax Tree (AST).
Example:
SELECT * FROM sales_data;
Hive checks:
- SQL syntax
- keywords
- structure
Step 2: Semantic Analysis
Hive verifies:
- Table exists or not
- Column names are correct
- Data types are valid
Example:
If table sales_data does not exist, Hive throws an error.
Step 3: Logical Plan Generation
Hive creates a logical workflow of operations.
Example operations:
- Scan table
- Filter rows
- Group data
- Join tables
This logical plan is represented as a DAG (Directed Acyclic Graph).
4. Optimizer
Role
The Optimizer improves query performance.
It converts the logical plan into the most efficient execution plan.
Types of Optimization
1. Rule-Based Optimization
Hive applies predefined rules.
Example:
Push filter conditions near the data source.
Instead of:
SELECT * FROM sales_data
WHERE price > 1000;
Hive reads only required rows.
This reduces I/O operations.
2. Cost-Based Optimization (CBO)
Hive calculates execution cost and chooses the best strategy.
Example:
For joining two tables:
- Which table should be processed first?
- Which join algorithm is faster?
Output
The optimizer generates a physical execution plan.
This plan may use:
- MapReduce
- Tez
- Spark
5. Execution Engine
Role
The Execution Engine actually runs the query.
Functions
- Divides query into tasks
- Submits jobs to YARN
- Monitors execution
- Collects results
Interaction with Hadoop
The Execution Engine:
- Reads data from HDFS
- Processes data
- Writes output back to HDFS
Example
Suppose query:
SELECT COUNT(*) FROM sales_data;
Execution Engine:
- Creates MapReduce tasks
- Sends them to Hadoop cluster
- Each node processes data blocks
- Final count is returned
6. Metastore
Role
The Metastore stores metadata about Hive tables.
Metadata means:
- Table names
- Columns
- Data types
- Partition info
- HDFS file locations
Types of Metastore
1. Embedded Metastore
Uses Derby database.
Suitable for:
- Single user
- Testing
2. Standalone Metastore
Uses:
- MySQL
- PostgreSQL
Suitable for:
- Multiple users
- Production systems
Example Metadata
Table:
sales_data
Columns:
id, product, price, date
HDFS Location:
/user/hive/warehouse/sales_data
Why Metastore is Important
Whenever a query runs, Hive first checks metadata.
Without metastore:
- Hive cannot locate table data
- Query execution becomes impossible
7. HDFS / Hadoop Storage
Role
HDFS stores the actual data files.
Hive tables are physically stored inside Hadoop Distributed File System.
Supported File Formats
Hive supports many formats:
- Text File
- ORC
- Parquet
- Avro
- RCFile
Example
Table:
sales_data
Stored in:
/user/hive/warehouse/sales_data
inside HDFS.
Complete Workflow of Hive Physical Architecture
Step-by-Step Flow
Step 1: User Submits Query
Example:
SELECT product, SUM(price)
FROM sales_data
GROUP BY product;
via:
- CLI
- JDBC
- Web UI
Step 2: Driver Receives Query
Driver:
- Creates session
- Sends query to compiler
Step 3: Compiler Processes Query
Compiler:
- Parses query
- Performs semantic analysis
- Creates logical plan
Step 4: Optimizer Improves Plan
Optimizer:
- Reduces unnecessary operations
- Chooses efficient execution strategy
Step 5: Execution Engine Executes Tasks
Execution Engine:
- Converts plan into MapReduce/Tez/Spark jobs
- Sends tasks to YARN
Step 6: HDFS Data Processing
Tasks:
- Read data from HDFS
- Process records
- Store intermediate/final results
Step 7: Results Returned
Final output is sent back to user interface.
Example Output:
Laptop 50000
Mobile 30000
TV 20000
Real-Life Example of Hive Architecture
Suppose an e-commerce company stores 10 TB sales data in Hadoop.
A data analyst wants to know:
SELECT city, SUM(revenue)
FROM orders
GROUP BY city;
What Happens Internally?
1. User submits query
via Hive CLI.
2. Driver receives query
Creates execution environment.
3. Compiler checks:
- Does table
ordersexist? - Does column
revenueexist?
4. Optimizer:
- Uses partition pruning
- Reduces unnecessary scans
5. Execution Engine:
- Creates Spark/MapReduce jobs
- Sends jobs to cluster
6. Hadoop Nodes:
- Process different blocks in parallel
7. Result returned:
Delhi 5,00,000
Mumbai 8,00,000
Bhopal 2,00,000
Advantages of Hive Physical Architecture
1. Scalability
Can process petabytes of data.
2. SQL Support
Easy for SQL users.
3. Distributed Processing
Uses Hadoop cluster for parallel execution.
4. Fault Tolerance
HDFS automatically handles failures.
5. Multiple Execution Engines
Supports:
- MapReduce
- Tez
- Spark
14. Hadoop Limitations
Apache Hadoop is a powerful framework used for storing and processing huge amounts of data across distributed systems.
It is highly useful for:
- Big Data analytics
- Distributed storage
- Batch data processing
However, Hadoop also has several limitations.
Understanding these limitations helps organizations decide when Hadoop is the right choice and when other technologies may perform better.
Limitations of Hadoop
1. Not Suitable for Small Data
Explanation
Hadoop is designed for processing very large datasets such as:
- Terabytes (TB)
- Petabytes (PB)
For small datasets, Hadoop becomes inefficient because:
- Starting MapReduce jobs takes time
- Cluster communication creates overhead
- Task scheduling adds delay
Example
Suppose you want to process:
500 MB sales data
Using Hadoop may take more time than:
- MySQL
- PostgreSQL
because Hadoop first:
- Divides tasks
- Allocates cluster resources
- Starts MapReduce jobs
This overhead is unnecessary for small data.
Conclusion
Traditional databases are faster for small-scale processing.
2. Complex Programming
Explanation
Native Hadoop programming uses:
- Java
- MapReduce model
Writing MapReduce code is difficult for beginners.
Developers must:
- Write Mapper functions
- Write Reducer functions
- Handle key-value pairs
- Debug distributed jobs
Example
A simple word count program in MapReduce may require:
- Multiple Java classes
- Configuration setup
- JAR file creation
while the same task in SQL needs only:
SELECT word, COUNT(*)
FROM documents
GROUP BY word;
Problem
Debugging distributed systems is more complex than debugging traditional applications.
Solution
Tools like:
- Apache Hive
- Apache Pig
simplify Hadoop programming.
But internally they still generate MapReduce jobs.
3. High Latency / Batch Processing Only
Explanation
Hadoop is mainly designed for:
- Batch processing
- Long-running analytics
It is not suitable for:
- Real-time systems
- Instant query processing
- Fast transactions
Example
Suppose a banking application needs:
instant account balance updates
Hadoop cannot provide millisecond-level response.
Why?
Because:
- HDFS is optimized for large sequential reads
- MapReduce jobs take time to initialize
Real-Time Alternatives
For low-latency processing:
- Apache Spark
- Apache Flink
- Apache Storm
are better options.
4. Data Security Limitations
Explanation
Early Hadoop versions had weak security features.
Problems included:
- No strong authentication
- Weak authorization
- Limited encryption
Modern Hadoop Security Improvements
Newer Hadoop versions support:
- Kerberos authentication
- Encryption
- Access control systems
Tools used:
- Apache Ranger
- Apache Sentry
Example
In a healthcare system:
- patient records must be secure
- unauthorized access must be blocked
Configuring Hadoop security for such environments becomes complicated.
Limitation
Strong security exists, but configuration and management are complex.
5. Inefficient for Iterative Processing
Explanation
Machine learning and graph algorithms require:
- repeated processing of same data
- multiple iterations
MapReduce writes intermediate results to HDFS after every step.
This causes:
- heavy disk I/O
- slower performance
Example
Machine Learning Algorithm:
K-Means Clustering
requires repeated iterations.
In Hadoop:
- Data is read from HDFS
- Processed
- Written back to HDFS
- Re-read again
This becomes slow.
Better Alternative
Apache Spark processes data in memory and is much faster for iterative workloads.
6. Limited SQL Support
Explanation
Hive provides SQL-like querying using HiveQL.
But compared to traditional RDBMS:
- joins can be slow
- subqueries are less efficient
- transactions are limited
Example
Complex SQL query:
SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.id;
may execute slowly on large Hadoop clusters.
Problem
Hadoop is not designed for:
- OLTP systems
- high-frequency transactions
- real-time updates
Conclusion
Traditional databases perform better for transactional applications.
7. Difficulty in Handling Small Files
Explanation
HDFS works efficiently with large files.
Many small files create problems because:
- each file metadata is stored in NameNode memory
- NameNode memory gets overloaded
Example
Suppose there are:
10 million files of 1 KB each
The NameNode may run out of memory due to excessive metadata storage.
Result
- Reduced performance
- Slower processing
- Lower throughput
Solutions
Combine small files using:
- SequenceFiles
- HAR (Hadoop Archive) files
8. Requires Skilled Workforce
Explanation
Managing Hadoop clusters is difficult.
Administrators must understand:
- distributed systems
- cluster management
- YARN
- HDFS
- security configuration
- node failure handling
Example
If one node fails:
- data recovery
- replication management
- workload balancing
must be handled correctly.
Limitation
Organizations need experienced Hadoop engineers, which increases operational cost.
9. High Memory Usage
Explanation
Hadoop ecosystem tools consume large amounts of RAM.
Components requiring memory:
- MapReduce
- YARN
- Spark
- HBase
Improper memory allocation can cause:
- task failure
- slow execution
- node crashes
Example
A Spark job processing huge datasets may require:
64 GB or more RAM
per node.
Limitation
High hardware requirements increase infrastructure cost.
10. Lack of Standardized Ecosystem
Explanation
The Hadoop ecosystem contains many independent tools.
Examples:
- Hive
- Pig
- HBase
- Spark
- Kafka
- Flume
- Oozie
Integrating these tools can be difficult.
Example
Different tools may have:
- dependency conflicts
- version incompatibility
- configuration mismatch
Result
Setup and maintenance become complex.
Summary Table of Hadoop Limitations
| Limitation | Description |
|---|---|
| Not suitable for small data | Hadoop overhead makes small data processing slow |
| Complex programming | MapReduce coding is difficult |
| High latency | Not suitable for real-time systems |
| Security complexity | Advanced security setup is difficult |
| Poor iterative processing | Repeated HDFS reads/writes slow ML tasks |
| Limited SQL support | Slow joins and weak OLTP support |
| Small file problem | Millions of small files overload NameNode |
| Skilled workforce needed | Cluster management is complex |
| High memory usage | Requires large RAM resources |
| Fragmented ecosystem | Tool integration is difficult |
Real-Life Scenario
Suppose an e-commerce company uses Hadoop to analyze:
500 TB customer clickstream data
Hadoop works well for:
- daily reports
- trend analysis
- batch analytics
But Hadoop struggles with:
- instant product recommendations
- real-time fraud detection
- live transactions
For these tasks, companies often use:
- Spark
- Kafka
- Flink
- NoSQL databases
along with Hadoop.
15. RDBMS vs Hadoop
An RDBMS (Relational Database Management System) is used to store and manage structured data in tables.
Examples:
- MySQL
- Oracle Database
- PostgreSQL
- Microsoft SQL Server
Apache Hadoop is a distributed framework designed for storing and processing huge amounts of Big Data across clusters.
Both are used for data management, but they differ greatly in:
- architecture
- scalability
- performance
- data handling
- use cases
RDBMS vs Hadoop Comparison Table
| Feature | RDBMS | Hadoop |
|---|---|---|
| Data Type | Structured data only | Structured, semi-structured, unstructured |
| Schema | Schema-on-write | Schema-on-read |
| Storage | Single server or limited clusters | Distributed storage using HDFS |
| Scalability | Vertical scaling | Horizontal scaling |
| Processing | OLTP and some OLAP | Batch and Big Data processing |
| Fault Tolerance | Backup and replication | Built-in replication in HDFS |
| Cost | Expensive hardware and licenses | Low-cost commodity hardware |
| Performance | Fast for small-medium data | High throughput for massive data |
| Data Volume | GB to low TB | TB to PB |
| Query Language | SQL | HiveQL, Pig Latin, Spark SQL |
| Latency | Low latency | Higher latency |
| Consistency | Full ACID support | Eventual consistency |
| Maintenance | Easier | Complex cluster management |
| Examples | Oracle, MySQL | Hadoop, Hive, Spark |
1. Data Type
RDBMS
RDBMS handles only structured data.
Data is stored in:
- rows
- columns
- tables
Example
| ID | Name | Salary |
|---|---|---|
| 1 | Rahul | 50000 |
Hadoop
Hadoop can handle:
- structured data
- semi-structured data
- unstructured data
Examples:
- text files
- logs
- images
- videos
- social media posts
Example
Hadoop can store:
Facebook posts + images + videos + chat logs
while RDBMS cannot efficiently handle such diverse data.
2. Schema
RDBMS → Schema-on-Write
Schema must be defined before inserting data.
Example:
CREATE TABLE employee(
id INT,
name VARCHAR(50),
salary FLOAT
);
Data must match the schema.
Hadoop → Schema-on-Read
Data can be stored in raw format.
Schema is applied only during analysis.
Example
Raw JSON logs:
{
"user":"Rahul",
"action":"login"
}
can be stored directly in Hadoop and analyzed later.
3. Storage
RDBMS
Data is usually stored:
- on a single server
- or limited database clusters
Suitable for:
- GBs
- low TBs
Hadoop
Uses:
- HDFS
Data is distributed across multiple nodes.
Can store:
- terabytes
- petabytes
Example
A company storing:
500 TB web logs
would prefer Hadoop over RDBMS.
4. Scalability
RDBMS → Vertical Scaling
Increase:
- CPU
- RAM
- storage
on a single machine.
This becomes expensive.
Hadoop → Horizontal Scaling
Add more nodes to cluster.
Example:
- Add 10 more commodity servers
This is cheaper and easier.
5. Processing Type
RDBMS
Best for:
- transactional processing (OLTP)
- real-time queries
Examples:
- banking
- ATM transactions
- ERP systems
Hadoop
Best for:
- batch processing
- analytics
- big data computation
Uses:
- MapReduce
- Spark
- Flink
Example
Analyzing:
10 years of customer purchase history
is ideal for Hadoop.
6. Fault Tolerance
RDBMS
Uses:
- backups
- replication
for recovery.
Failure recovery may be expensive.
Hadoop
Provides built-in fault tolerance.
HDFS automatically replicates data blocks.
Usually:
3 copies of data
are stored on different nodes.
Example
If one node crashes:
- Hadoop retrieves data from another node automatically.
7. Cost
RDBMS
Requires:
- high-end servers
- licensed software
Examples:
- Oracle licensing can be expensive.
Hadoop
Uses:
- open-source software
- commodity hardware
which reduces infrastructure cost.
8. Performance
RDBMS
Very fast for:
- small datasets
- indexed queries
- transactions
Hadoop
Optimized for:
- large-scale parallel processing
Not ideal for quick single-record lookups.
Example
Finding one customer record:
- faster in MySQL
Processing:
100 TB clickstream data
faster in Hadoop.
9. Data Volume Handling
RDBMS
Efficient for:
- GBs
- small TBs
Hadoop
Efficient for:
- TBs
- PBs
10. Query Language
RDBMS
Uses standardized SQL.
Example:
SELECT * FROM employee;
Hadoop
Uses:
- HiveQL
- Pig Latin
- Spark SQL
Native MapReduce requires coding.
11. Latency
RDBMS
Provides:
- low latency
- fast response
Suitable for real-time applications.
Hadoop
Traditionally batch-oriented.
MapReduce jobs take time to start.
Improvement
Tools like:
- Apache Spark
- Hive on Tez
reduce latency.
12. Consistency
RDBMS
Fully ACID compliant.
ACID means:
- Atomicity
- Consistency
- Isolation
- Durability
Hadoop
Default Hadoop is not fully ACID compliant.
Some components like:
- Apache HBase
provide better consistency support.
13. Maintenance
RDBMS
Managed by DBAs.
Maintenance is relatively easier.
Hadoop
Requires:
- Hadoop administrators
- cluster management skills
- HDFS knowledge
- YARN configuration expertise
Key Differences Explained
1. Data Handling
RDBMS
Only structured tables.
Hadoop
Handles all data types.
2. Schema
RDBMS
Fixed schema before insertion.
Hadoop
Flexible schema during reading.
3. Scalability
RDBMS
Scale UP (bigger machine).
Hadoop
Scale OUT (more machines).
4. Fault Tolerance
RDBMS
Uses manual backup systems.
Hadoop
Automatic replication.
5. Cost
RDBMS
Expensive.
Hadoop
Cost-effective.
6. Processing
RDBMS
Transactional systems.
Hadoop
Big data analytics.
7. Latency
RDBMS
Real-time.
Hadoop
Mostly batch processing.
8. Maintenance
RDBMS
Simpler administration.
Hadoop
Complex ecosystem management.
Use Case Comparison
| Use Case | RDBMS | Hadoop |
|---|---|---|
| Banking transactions | ✅ | ❌ |
| Inventory management | ✅ | ❌ |
| Social media analytics | ❌ | ✅ |
| Web clickstream analysis | ❌ | ✅ |
| Fraud detection (batch) | ❌ | ✅ |
| IoT sensor data | ❌ | ✅ |
Real-Life Example
Banking System
A bank needs:
- instant transactions
- account updates
- ACID compliance
Best Choice:
RDBMS
because transactions must be real-time and consistent.
Social Media Company
A social media platform stores:
- photos
- videos
- billions of user logs
Best Choice:
Hadoop
because it handles massive unstructured data efficiently.
Advantages of RDBMS
- Fast transactions
- Strong ACID properties
- Low latency
- Easy querying with SQL
Advantages of Hadoop
- Massive scalability
- Handles all data types
- Cost-effective
- Distributed processing
- Fault tolerant
Use RDBMS when:
- data is structured
- transactions are frequent
- real-time processing is required
Use Hadoop when:
- data is huge
- data is unstructured
- large-scale analytics is needed
In modern systems, companies often use both together:
- RDBMS for transactions
- Hadoop for analytics and Big Data processing.