Big Data Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (16)

  • 1. What is Big Data?
  • 2. Big Data Characteristics
  • 3. Types of Big Data
  • 4. Traditional Data vs Big Data
  • 5. Evolution of Big Data
  • 6. Challenges with Big Data
  • 7. Technologies Available for Big Data
  • 7. Infrastructure for Big Data
  • 9. Uses of Data Analytics
  • 10. Hadoop
  • 11. Hadoop Core Components
  • 12. Hadoop Ecosystem
  • 13. Hive Physical Architecture
  • 14. Hadoop Limitations
  • 15. RDBMS vs Hadoop
  • 16. Hadoop Distributed File System (HDFS)

11. Hadoop Core Components

Hadoop is a framework used for storing and processing huge amounts of data in a distributed environment.
Its core components work together to handle big data efficiently.

The four core components of Hadoop are:

  1. HDFS (Hadoop Distributed File System) – Storage Layer
  2. MapReduce – Processing Layer
  3. YARN (Yet Another Resource Negotiator) – Resource Management Layer
  4. Hadoop Common – Shared Utilities and Libraries

1. HDFS (Hadoop Distributed File System)

 What is HDFS?

HDFS is a distributed file system designed to store very large files across multiple machines.
It provides:

  • High storage capacity
  • Fault tolerance
  • Scalability

It is built to run on commodity hardware.

 Key Features of HDFS

  • Fault Tolerance: Data is replicated across multiple nodes.
  • Scalability: More nodes can be added easily.
  • High Throughput: Optimized for large-scale data processing.
  • Flexibility: Stores structured, semi-structured, and unstructured data.

 HDFS Architecture

1. NameNode (Master Node)

The NameNode manages the file system metadata such as:

  • File names
  • Directories
  • Permissions
  • Block locations

It controls all DataNodes.

2. DataNode (Slave Node)

DataNodes store the actual data blocks.

Responsibilities:

  • Store data
  • Handle read/write operations
  • Send heartbeat signals to the NameNode

 How HDFS Works

  • Large files are divided into blocks.
  • Default block size = 128 MB
  • Each block is replicated (usually 3 copies).

This ensures data safety even if a node fails.

 Example of HDFS

Suppose you have a 1 TB video file.

HDFS will:

  • Split it into 128 MB blocks
  • Create approximately 8000 blocks
  • Store each block on 3 different DataNodes

So if one machine crashes, data can still be recovered from another copy.

 Real-Life Example of HDFS

Imagine keeping 3 photocopies of an important document in different rooms.
If one room is damaged, the document is still safe in the other rooms.

2. MapReduce

 What is MapReduce?

MapReduce is a programming model used to process large datasets in parallel across a Hadoop cluster.

It works in three phases:

  1. Map Phase
  2. Shuffle and Sort Phase
  3. Reduce Phase

 Phases of MapReduce

1. Map Phase

The mapper processes input data and converts it into key-value pairs.

Example

Input sentence:

Hadoop is fast Hadoop is scalable

Mapper Output:

(Hadoop,1)
(is,1)
(fast,1)
(Hadoop,1)
(is,1)
(scalable,1)
 

2. Shuffle and Sort Phase

The system groups all similar keys together.

(Hadoop,[1,1])
(is,[1,1])
(fast,[1])
(scalable,[1])

3. Reduce Phase

The reducer combines values and produces the final result.

Final Output:

(Hadoop,2)
(is,2)
(fast,1)
(scalable,1)
 

 Advantages of MapReduce

  • Parallel processing
  • Faster execution
  • Fault tolerance
  • Handles petabytes of data

 Real-Life Example of MapReduce

Imagine exam papers being checked by multiple teachers:

  • Map: Teachers check papers separately
  • Shuffle: Papers are grouped subject-wise
  • Reduce: Final marks are calculated

3. YARN (Yet Another Resource Negotiator)

 What is YARN?

YARN is the resource management framework in Hadoop.

It manages:

  • CPU usage
  • Memory allocation
  • Task scheduling

YARN allows multiple applications like MapReduce, Spark, and Hive to run together.

 Components of YARN

1. ResourceManager (Master)

Responsibilities:

  • Allocates cluster resources
  • Schedules applications
  • Monitors resource usage

2. NodeManager (Slave)

Responsibilities:

  • Manages resources on each node
  • Executes tasks
  • Reports status to ResourceManager

 Example of YARN

Suppose:

  • One user runs a Spark job
  • Another user runs a MapReduce job

YARN allocates CPU and memory resources efficiently to both applications.

 Real-Life Example of YARN

Think of a school principal:

  • Assigns classrooms to teachers
  • Ensures resources are properly used

4. Hadoop Common

 What is Hadoop Common?

Hadoop Common is a collection of shared libraries and utilities required by all Hadoop modules.

It provides:

  • Java libraries
  • Configuration files
  • Scripts for starting Hadoop services
  • APIs for Hadoop operations

 Features of Hadoop Common

  • Supports communication between Hadoop modules
  • Provides operating system utilities
  • Helps integrate tools like Hive, Pig, HBase, and Sqoop

 Example of Hadoop Common

Just like common system files support all applications in Windows, Hadoop Common supports all Hadoop components.

Summary Table

Component Purpose Example
HDFS Stores data File storage across nodes
MapReduce Processes data Word count program
YARN Manages resources CPU and memory allocation
Hadoop Common Shared utilities Libraries and APIs

Simple Real-Life Analogy

Hadoop Component Real-Life Example
HDFS Warehouse for storing goods
MapReduce Workers processing tasks
YARN Manager assigning resources
Hadoop Common Common tools used by everyone

 

12. Hadoop Ecosystem

The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together to store, process, analyze, and manage Big Data.

While the core Hadoop components (HDFS and MapReduce) handle storage and batch processing, the ecosystem adds powerful tools for:

  • Real-time processing
  • Data analytics
  • Data integration
  • Workflow automation
  • Machine learning

It can handle:

  • Structured data (tables, SQL data)
  • Semi-structured data (JSON, XML)
  • Unstructured data (logs, images, videos)

Major Components of Hadoop Ecosystem

The ecosystem is modular, meaning you can use only the tools you need.

1. HDFS (Storage Layer)

 Role

HDFS stores huge amounts of data across multiple machines.

 Function

  • Splits files into blocks
  • Stores blocks on different nodes
  • Keeps multiple copies for safety

 Example

A 1 TB video file is split into smaller blocks and stored across many machines. If one machine fails, data is still available.

2. MapReduce (Processing Layer)

 Role

Batch processing framework for large-scale data.

 Function

Processes data in parallel across cluster machines.

 Example

Word count program:

  • Input: Large text file
  • Output: Frequency of each word

Used in:

  • Log analysis
  • Clickstream analysis
  • Data summarization

3. YARN (Resource Management Layer)

 Role

Manages resources and schedules jobs in Hadoop cluster.

 Function

  • Allocates CPU and memory
  • Schedules multiple applications
  • Manages cluster workload

 Example

Running both Spark and MapReduce jobs on the same cluster without conflict.

4. Hive

 Type

Data Warehouse tool (SQL-like system)

 Function

Provides HiveQL (SQL-like language) to query big data.

 Example

Instead of writing MapReduce code, you can write:

SELECT * FROM sales WHERE amount > 1000;

 Use Case

  • Business reports
  • Sales analysis
  • Data summarization

5. Pig

 Type

Data processing scripting tool

 Function

Uses Pig Latin language for data transformation.

 Example

Convert raw logs into structured format.

 Use Case

  • ETL (Extract, Transform, Load) operations
  • Data cleaning
  • Data preparation

6. HBase

 Type

NoSQL database (Column-oriented)

 Function

Provides real-time read/write access to big data.

 Example

  • Social media user profiles
  • IoT sensor data
  • Banking transaction records

 Feature

Very fast for random data access.

7. Sqoop

 Type

Data integration tool

 Function

Transfers data between:

  • RDBMS (MySQL, Oracle)
  • Hadoop (HDFS, Hive)

 Example

Import customer data from MySQL into Hadoop for analysis.

8. Flume

 Type

Data ingestion tool

 Function

Collects and moves streaming data into HDFS.

 Example

  • Twitter feeds
  • Web server logs
  • Application logs

9. Oozie

 Type

Workflow scheduler

 Function

Automates Hadoop jobs.

 Example

A daily pipeline:

  1. Import data (Sqoop)
  2. Clean data (Pig)
  3. Query data (Hive)
  4. Generate report

Oozie runs all steps automatically.

10. Zookeeper

 Type

Coordination service

 Function

Manages:

  • Cluster synchronization
  • Configuration
  • Naming services

 Example

Used by HBase and Kafka to coordinate distributed systems.

11. Mahout

 Type

Machine Learning library

 Function

Provides scalable ML algorithms.

 Example

  • Recommendation systems (Netflix/Amazon style)
  • Customer segmentation
  • Clustering data

12. Spark

 Type

Distributed processing engine

 Function

Processes data in-memory for faster performance than MapReduce.

 Example

  • Real-time analytics
  • Machine learning tasks
  • Graph processing

13. Kafka

 Type

Streaming platform

 Function

Handles real-time data streams.

 Example

  • Live user activity tracking
  • Log streaming
  • Event-driven systems

Applications of Hadoop Ecosystem

1. E-Commerce

  • Product recommendations
  • Customer behavior analysis

2. Social Media

  • Sentiment analysis
  • Trend detection

3. Banking & Finance

  • Fraud detection
  • Risk analysis

4. Healthcare

  • Disease prediction
  • Patient data analysis

5. Telecommunications

  • Call data analysis
  • Customer churn prediction

6. Government

  • Census analysis
  • Crime and traffic monitoring

Simple Summary Table

Tool Purpose Example
HDFS Storage Distributed file storage
MapReduce Processing Word count
YARN Resource management Job scheduling
Hive SQL querying Sales reports
Pig Data transformation ETL jobs
HBase NoSQL DB Real-time data
Sqoop Data transfer MySQL → Hadoop
Flume Data ingestion Logs collection
Oozie Workflow automation Daily pipelines
Zookeeper Coordination Cluster sync
Mahout Machine learning Recommendations
Spark Fast processing Real-time analytics
Kafka Streaming Live data flow

 

13. Hive Physical Architecture

Apache Hive is a data warehouse system built on top of Apache Hadoop.
It allows users to write SQL-like queries called HiveQL to analyze huge datasets stored in HDFS (Hadoop Distributed File System).

Instead of writing complex Java MapReduce programs, users can simply write SQL queries.

Hive converts these queries into:

  • MapReduce jobs
  • Tez jobs
  • Spark jobs

which are executed on the Hadoop cluster.

Physical Architecture of Hive

The Physical Architecture explains how different Hive components work together to process a query and interact with Hadoop storage.

Main Components of Hive Physical Architecture

  1. Hive Clients / User Interface
  2. Driver
  3. Compiler
  4. Optimizer
  5. Execution Engine
  6. Metastore
  7. HDFS / Hadoop Storage

1. Hive Clients / User Interface (UI)

Role

This is the entry point where users interact with Hive.

Users write HiveQL queries using different interfaces.

Types of Hive Clients

1. Command Line Interface (CLI)

Users execute Hive commands directly in terminal.

Example:

SELECT * FROM sales_data;

2. Web UI

Browser-based tools such as:

  • Hue
  • Ambari

allow users to run Hive queries visually.

3. JDBC / ODBC Clients

Applications connect to Hive using standard database drivers.

Example:

  • Java application using JDBC
  • BI tools like Tableau or Power BI

Function

The client sends the HiveQL query to the Hive Driver.

2. Hive Driver

Role

The Driver acts like the main controller of Hive.

It manages the entire lifecycle of a query.

Functions of Driver

  • Receives query from client
  • Creates session
  • Maintains execution context
  • Sends query to compiler
  • Tracks execution progress
  • Returns final result to user

Example

User writes:

SELECT product, SUM(price)
FROM sales_data
GROUP BY product;

The Driver receives this query and starts processing it.

3. Compiler

Role

The Compiler converts HiveQL into an execution plan.

Steps Performed by Compiler

Step 1: Parsing

The query is converted into an Abstract Syntax Tree (AST).

Example:

SELECT * FROM sales_data;

Hive checks:

  • SQL syntax
  • keywords
  • structure

Step 2: Semantic Analysis

Hive verifies:

  • Table exists or not
  • Column names are correct
  • Data types are valid

Example:
If table sales_data does not exist, Hive throws an error.

Step 3: Logical Plan Generation

Hive creates a logical workflow of operations.

Example operations:

  • Scan table
  • Filter rows
  • Group data
  • Join tables

This logical plan is represented as a DAG (Directed Acyclic Graph).

4. Optimizer

Role

The Optimizer improves query performance.

It converts the logical plan into the most efficient execution plan.

Types of Optimization

1. Rule-Based Optimization

Hive applies predefined rules.

Example:

Push filter conditions near the data source.

Instead of:

SELECT * FROM sales_data
WHERE price > 1000;

Hive reads only required rows.

This reduces I/O operations.

2. Cost-Based Optimization (CBO)

Hive calculates execution cost and chooses the best strategy.

Example:

For joining two tables:

  • Which table should be processed first?
  • Which join algorithm is faster?

Output

The optimizer generates a physical execution plan.

This plan may use:

  • MapReduce
  • Tez
  • Spark

5. Execution Engine

Role

The Execution Engine actually runs the query.

Functions

  • Divides query into tasks
  • Submits jobs to YARN
  • Monitors execution
  • Collects results

Interaction with Hadoop

The Execution Engine:

  • Reads data from HDFS
  • Processes data
  • Writes output back to HDFS

Example

Suppose query:

SELECT COUNT(*) FROM sales_data;

Execution Engine:

  1. Creates MapReduce tasks
  2. Sends them to Hadoop cluster
  3. Each node processes data blocks
  4. Final count is returned

6. Metastore

Role

The Metastore stores metadata about Hive tables.

Metadata means:

  • Table names
  • Columns
  • Data types
  • Partition info
  • HDFS file locations

Types of Metastore

1. Embedded Metastore

Uses Derby database.

Suitable for:

  • Single user
  • Testing

2. Standalone Metastore

Uses:

  • MySQL
  • PostgreSQL

Suitable for:

  • Multiple users
  • Production systems

Example Metadata

Table:

sales_data

Columns:

id, product, price, date

HDFS Location:

/user/hive/warehouse/sales_data

Why Metastore is Important

Whenever a query runs, Hive first checks metadata.

Without metastore:

  • Hive cannot locate table data
  • Query execution becomes impossible

7. HDFS / Hadoop Storage

Role

HDFS stores the actual data files.

Hive tables are physically stored inside Hadoop Distributed File System.

Supported File Formats

Hive supports many formats:

  • Text File
  • ORC
  • Parquet
  • Avro
  • RCFile

Example

Table:

sales_data

Stored in:

/user/hive/warehouse/sales_data

inside HDFS.

Complete Workflow of Hive Physical Architecture

Step-by-Step Flow

Step 1: User Submits Query

Example:

SELECT product, SUM(price)
FROM sales_data
GROUP BY product;

via:

  • CLI
  • JDBC
  • Web UI

Step 2: Driver Receives Query

Driver:

  • Creates session
  • Sends query to compiler

Step 3: Compiler Processes Query

Compiler:

  1. Parses query
  2. Performs semantic analysis
  3. Creates logical plan

Step 4: Optimizer Improves Plan

Optimizer:

  • Reduces unnecessary operations
  • Chooses efficient execution strategy

Step 5: Execution Engine Executes Tasks

Execution Engine:

  • Converts plan into MapReduce/Tez/Spark jobs
  • Sends tasks to YARN

Step 6: HDFS Data Processing

Tasks:

  • Read data from HDFS
  • Process records
  • Store intermediate/final results

Step 7: Results Returned

Final output is sent back to user interface.

Example Output:

Laptop   50000
Mobile 30000
TV 20000

Real-Life Example of Hive Architecture

Suppose an e-commerce company stores 10 TB sales data in Hadoop.

A data analyst wants to know:

SELECT city, SUM(revenue)
FROM orders
GROUP BY city;
 

What Happens Internally?

1. User submits query

via Hive CLI.

2. Driver receives query

Creates execution environment.

3. Compiler checks:

  • Does table orders exist?
  • Does column revenue exist?

4. Optimizer:

  • Uses partition pruning
  • Reduces unnecessary scans

5. Execution Engine:

  • Creates Spark/MapReduce jobs
  • Sends jobs to cluster

6. Hadoop Nodes:

  • Process different blocks in parallel

7. Result returned:

Delhi      5,00,000
Mumbai 8,00,000
Bhopal 2,00,000

Advantages of Hive Physical Architecture

1. Scalability

Can process petabytes of data.

2. SQL Support

Easy for SQL users.

3. Distributed Processing

Uses Hadoop cluster for parallel execution.

4. Fault Tolerance

HDFS automatically handles failures.

5. Multiple Execution Engines

Supports:

  • MapReduce
  • Tez
  • Spark

14. Hadoop Limitations

Apache Hadoop is a powerful framework used for storing and processing huge amounts of data across distributed systems.

It is highly useful for:

  • Big Data analytics
  • Distributed storage
  • Batch data processing

However, Hadoop also has several limitations.
Understanding these limitations helps organizations decide when Hadoop is the right choice and when other technologies may perform better.

Limitations of Hadoop

1. Not Suitable for Small Data

Explanation

Hadoop is designed for processing very large datasets such as:

  • Terabytes (TB)
  • Petabytes (PB)

For small datasets, Hadoop becomes inefficient because:

  • Starting MapReduce jobs takes time
  • Cluster communication creates overhead
  • Task scheduling adds delay

Example

Suppose you want to process:

500 MB sales data

Using Hadoop may take more time than:

  • MySQL
  • PostgreSQL

because Hadoop first:

  1. Divides tasks
  2. Allocates cluster resources
  3. Starts MapReduce jobs

This overhead is unnecessary for small data.

Conclusion

Traditional databases are faster for small-scale processing.

2. Complex Programming

Explanation

Native Hadoop programming uses:

  • Java
  • MapReduce model

Writing MapReduce code is difficult for beginners.

Developers must:

  • Write Mapper functions
  • Write Reducer functions
  • Handle key-value pairs
  • Debug distributed jobs

Example

A simple word count program in MapReduce may require:

  • Multiple Java classes
  • Configuration setup
  • JAR file creation

while the same task in SQL needs only:

SELECT word, COUNT(*)
FROM documents
GROUP BY word;

Problem

Debugging distributed systems is more complex than debugging traditional applications.

Solution

Tools like:

  • Apache Hive
  • Apache Pig

simplify Hadoop programming.

But internally they still generate MapReduce jobs.

3. High Latency / Batch Processing Only

Explanation

Hadoop is mainly designed for:

  • Batch processing
  • Long-running analytics

It is not suitable for:

  • Real-time systems
  • Instant query processing
  • Fast transactions

Example

Suppose a banking application needs:

instant account balance updates

Hadoop cannot provide millisecond-level response.

Why?
Because:

  • HDFS is optimized for large sequential reads
  • MapReduce jobs take time to initialize

Real-Time Alternatives

For low-latency processing:

  • Apache Spark
  • Apache Flink
  • Apache Storm

are better options.

4. Data Security Limitations

Explanation

Early Hadoop versions had weak security features.

Problems included:

  • No strong authentication
  • Weak authorization
  • Limited encryption

Modern Hadoop Security Improvements

Newer Hadoop versions support:

  • Kerberos authentication
  • Encryption
  • Access control systems

Tools used:

  • Apache Ranger
  • Apache Sentry

Example

In a healthcare system:

  • patient records must be secure
  • unauthorized access must be blocked

Configuring Hadoop security for such environments becomes complicated.

Limitation

Strong security exists, but configuration and management are complex.

5. Inefficient for Iterative Processing

Explanation

Machine learning and graph algorithms require:

  • repeated processing of same data
  • multiple iterations

MapReduce writes intermediate results to HDFS after every step.

This causes:

  • heavy disk I/O
  • slower performance

Example

Machine Learning Algorithm:

K-Means Clustering

requires repeated iterations.

In Hadoop:

  1. Data is read from HDFS
  2. Processed
  3. Written back to HDFS
  4. Re-read again

This becomes slow.

Better Alternative

Apache Spark processes data in memory and is much faster for iterative workloads.

6. Limited SQL Support

Explanation

Hive provides SQL-like querying using HiveQL.

But compared to traditional RDBMS:

  • joins can be slow
  • subqueries are less efficient
  • transactions are limited

Example

Complex SQL query:

SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.id;

may execute slowly on large Hadoop clusters.

Problem

Hadoop is not designed for:

  • OLTP systems
  • high-frequency transactions
  • real-time updates

Conclusion

Traditional databases perform better for transactional applications.

7. Difficulty in Handling Small Files

Explanation

HDFS works efficiently with large files.

Many small files create problems because:

  • each file metadata is stored in NameNode memory
  • NameNode memory gets overloaded

Example

Suppose there are:

10 million files of 1 KB each

The NameNode may run out of memory due to excessive metadata storage.

Result

  • Reduced performance
  • Slower processing
  • Lower throughput

Solutions

Combine small files using:

  • SequenceFiles
  • HAR (Hadoop Archive) files

8. Requires Skilled Workforce

Explanation

Managing Hadoop clusters is difficult.

Administrators must understand:

  • distributed systems
  • cluster management
  • YARN
  • HDFS
  • security configuration
  • node failure handling

Example

If one node fails:

  • data recovery
  • replication management
  • workload balancing

must be handled correctly.

Limitation

Organizations need experienced Hadoop engineers, which increases operational cost.

9. High Memory Usage

Explanation

Hadoop ecosystem tools consume large amounts of RAM.

Components requiring memory:

  • MapReduce
  • YARN
  • Spark
  • HBase

Improper memory allocation can cause:

  • task failure
  • slow execution
  • node crashes

Example

A Spark job processing huge datasets may require:

64 GB or more RAM

per node.

Limitation

High hardware requirements increase infrastructure cost.

10. Lack of Standardized Ecosystem

Explanation

The Hadoop ecosystem contains many independent tools.

Examples:

  • Hive
  • Pig
  • HBase
  • Spark
  • Kafka
  • Flume
  • Oozie

Integrating these tools can be difficult.

Example

Different tools may have:

  • dependency conflicts
  • version incompatibility
  • configuration mismatch

Result

Setup and maintenance become complex.

Summary Table of Hadoop Limitations

Limitation Description
Not suitable for small data Hadoop overhead makes small data processing slow
Complex programming MapReduce coding is difficult
High latency Not suitable for real-time systems
Security complexity Advanced security setup is difficult
Poor iterative processing Repeated HDFS reads/writes slow ML tasks
Limited SQL support Slow joins and weak OLTP support
Small file problem Millions of small files overload NameNode
Skilled workforce needed Cluster management is complex
High memory usage Requires large RAM resources
Fragmented ecosystem Tool integration is difficult

Real-Life Scenario

Suppose an e-commerce company uses Hadoop to analyze:

500 TB customer clickstream data

Hadoop works well for:

  • daily reports
  • trend analysis
  • batch analytics

But Hadoop struggles with:

  • instant product recommendations
  • real-time fraud detection
  • live transactions

For these tasks, companies often use:

  • Spark
  • Kafka
  • Flink
  • NoSQL databases

along with Hadoop.

15. RDBMS vs Hadoop

An RDBMS (Relational Database Management System) is used to store and manage structured data in tables.

Examples:

  • MySQL
  • Oracle Database
  • PostgreSQL
  • Microsoft SQL Server

Apache Hadoop is a distributed framework designed for storing and processing huge amounts of Big Data across clusters.

Both are used for data management, but they differ greatly in:

  • architecture
  • scalability
  • performance
  • data handling
  • use cases

RDBMS vs Hadoop Comparison Table

Feature RDBMS Hadoop
Data Type Structured data only Structured, semi-structured, unstructured
Schema Schema-on-write Schema-on-read
Storage Single server or limited clusters Distributed storage using HDFS
Scalability Vertical scaling Horizontal scaling
Processing OLTP and some OLAP Batch and Big Data processing
Fault Tolerance Backup and replication Built-in replication in HDFS
Cost Expensive hardware and licenses Low-cost commodity hardware
Performance Fast for small-medium data High throughput for massive data
Data Volume GB to low TB TB to PB
Query Language SQL HiveQL, Pig Latin, Spark SQL
Latency Low latency Higher latency
Consistency Full ACID support Eventual consistency
Maintenance Easier Complex cluster management
Examples Oracle, MySQL Hadoop, Hive, Spark

1. Data Type

RDBMS

RDBMS handles only structured data.

Data is stored in:

  • rows
  • columns
  • tables

Example

ID Name Salary
1 Rahul 50000

Hadoop

Hadoop can handle:

  • structured data
  • semi-structured data
  • unstructured data

Examples:

  • text files
  • logs
  • images
  • videos
  • social media posts

Example

Hadoop can store:

Facebook posts + images + videos + chat logs

while RDBMS cannot efficiently handle such diverse data.

2. Schema

RDBMS → Schema-on-Write

Schema must be defined before inserting data.

Example:

CREATE TABLE employee(
id INT,
name VARCHAR(50),
salary FLOAT
);

Data must match the schema.

Hadoop → Schema-on-Read

Data can be stored in raw format.

Schema is applied only during analysis.

Example

Raw JSON logs:

{
"user":"Rahul",
"action":"login"
}

can be stored directly in Hadoop and analyzed later.

3. Storage

RDBMS

Data is usually stored:

  • on a single server
  • or limited database clusters

Suitable for:

  • GBs
  • low TBs

Hadoop

Uses:

  • HDFS

Data is distributed across multiple nodes.

Can store:

  • terabytes
  • petabytes

Example

A company storing:

500 TB web logs

would prefer Hadoop over RDBMS.

4. Scalability

RDBMS → Vertical Scaling

Increase:

  • CPU
  • RAM
  • storage

on a single machine.

This becomes expensive.

Hadoop → Horizontal Scaling

Add more nodes to cluster.

Example:

  • Add 10 more commodity servers

This is cheaper and easier.

5. Processing Type

RDBMS

Best for:

  • transactional processing (OLTP)
  • real-time queries

Examples:

  • banking
  • ATM transactions
  • ERP systems

Hadoop

Best for:

  • batch processing
  • analytics
  • big data computation

Uses:

  • MapReduce
  • Spark
  • Flink

Example

Analyzing:

10 years of customer purchase history

is ideal for Hadoop.

6. Fault Tolerance

RDBMS

Uses:

  • backups
  • replication

for recovery.

Failure recovery may be expensive.

Hadoop

Provides built-in fault tolerance.

HDFS automatically replicates data blocks.

Usually:

3 copies of data

are stored on different nodes.

Example

If one node crashes:

  • Hadoop retrieves data from another node automatically.

7. Cost

RDBMS

Requires:

  • high-end servers
  • licensed software

Examples:

  • Oracle licensing can be expensive.

Hadoop

Uses:

  • open-source software
  • commodity hardware

which reduces infrastructure cost.

8. Performance

RDBMS

Very fast for:

  • small datasets
  • indexed queries
  • transactions

Hadoop

Optimized for:

  • large-scale parallel processing

Not ideal for quick single-record lookups.

Example

Finding one customer record:

  • faster in MySQL

Processing:

100 TB clickstream data

faster in Hadoop.

9. Data Volume Handling

RDBMS

Efficient for:

  • GBs
  • small TBs

Hadoop

Efficient for:

  • TBs
  • PBs

10. Query Language

RDBMS

Uses standardized SQL.

Example:

SELECT * FROM employee;

Hadoop

Uses:

  • HiveQL
  • Pig Latin
  • Spark SQL

Native MapReduce requires coding.

11. Latency

RDBMS

Provides:

  • low latency
  • fast response

Suitable for real-time applications.

Hadoop

Traditionally batch-oriented.

MapReduce jobs take time to start.

Improvement

Tools like:

  • Apache Spark
  • Hive on Tez

reduce latency.

12. Consistency

RDBMS

Fully ACID compliant.

ACID means:

  • Atomicity
  • Consistency
  • Isolation
  • Durability

Hadoop

Default Hadoop is not fully ACID compliant.

Some components like:

  • Apache HBase

provide better consistency support.

13. Maintenance

RDBMS

Managed by DBAs.

Maintenance is relatively easier.

Hadoop

Requires:

  • Hadoop administrators
  • cluster management skills
  • HDFS knowledge
  • YARN configuration expertise

Key Differences Explained

1. Data Handling

RDBMS

Only structured tables.

Hadoop

Handles all data types.

2. Schema

RDBMS

Fixed schema before insertion.

Hadoop

Flexible schema during reading.

3. Scalability

RDBMS

Scale UP (bigger machine).

Hadoop

Scale OUT (more machines).

4. Fault Tolerance

RDBMS

Uses manual backup systems.

Hadoop

Automatic replication.

5. Cost

RDBMS

Expensive.

Hadoop

Cost-effective.

6. Processing

RDBMS

Transactional systems.

Hadoop

Big data analytics.

7. Latency

RDBMS

Real-time.

Hadoop

Mostly batch processing.

8. Maintenance

RDBMS

Simpler administration.

Hadoop

Complex ecosystem management.

Use Case Comparison

Use Case RDBMS Hadoop
Banking transactions
Inventory management
Social media analytics
Web clickstream analysis
Fraud detection (batch)
IoT sensor data

Real-Life Example

Banking System

A bank needs:

  • instant transactions
  • account updates
  • ACID compliance

Best Choice:

RDBMS

because transactions must be real-time and consistent.

Social Media Company

A social media platform stores:

  • photos
  • videos
  • billions of user logs

Best Choice:

Hadoop

because it handles massive unstructured data efficiently.

Advantages of RDBMS

  • Fast transactions
  • Strong ACID properties
  • Low latency
  • Easy querying with SQL

Advantages of Hadoop

  • Massive scalability
  • Handles all data types
  • Cost-effective
  • Distributed processing
  • Fault tolerant

Use RDBMS when:

  • data is structured
  • transactions are frequent
  • real-time processing is required

Use Hadoop when:

  • data is huge
  • data is unstructured
  • large-scale analytics is needed

In modern systems, companies often use both together:

  • RDBMS for transactions
  • Hadoop for analytics and Big Data processing.
Page 3 of 4