Big Data Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (10)

  • 1. What is Big Data?
  • 2. Big Data Characteristics
  • 3. Types of Big Data
  • 4. Traditional Data vs Big Data
  • 5. Evolution of Big Data
  • 6. Challenges with Big Data
  • 7. Technologies Available for Big Data
  • 7. Infrastructure for Big Data
  • 9. Uses of Data Analytics
  • 10. Hadoop

6. Challenges with Big Data

Big Data provides many benefits, but it also creates several challenges because of its:

  • Huge size
  • High speed
  • Complex nature

Let’s understand each challenge one by one with simple examples.

1. Data Storage and Management

Why is it a challenge?

Big Data is extremely large and can reach:

  • TBs (Terabytes)
  • PBs (Petabytes)
  • EBs (Exabytes)

Traditional storage systems cannot store such massive data efficiently.

Problems

  • Requires huge disk space
  • Needs distributed storage systems
  • Hardware cost increases
  • Data gets spread across many servers

Example

Facebook, Google, and Amazon generate enormous amounts of data every day.

A single server cannot store all this data.

2. Data Processing Speed

Why is this difficult?

Big Data is generated very quickly.

Examples of fast data generation

  • Online transactions
  • Social media posts
  • IoT sensor readings
  • GPS tracking

Traditional systems are too slow for real-time processing.

Real-Life Examples

  • Stock market prices change within milliseconds
  • Google Maps updates traffic every second

Organizations use fast technologies like Apache Spark for quick processing.

3. Data Variety

Why is variety a challenge?

Big Data comes in different formats:

Types of Data

  1. Structured data
  2. Semi-structured data
  3. Unstructured data

Problem with Unstructured Data

Images, videos, and audio files are difficult to:

  • Store
  • Process
  • Analyze

Example

Analyzing millions of YouTube videos requires powerful computing systems and advanced tools.

4. Data Quality (Veracity Issues)

What is the challenge?

Big Data often contains:

  • Incomplete data
  • Duplicate records
  • Incorrect information
  • Noise (unwanted data)

Poor-quality data can produce wrong results.

Example

Fake likes and comments on social media can mislead sentiment analysis.

If data is incorrect, business decisions may also become incorrect.

5. Data Security and Privacy

Why is this a major challenge?

Big Data includes sensitive information from:

  • Social media
  • Banks
  • Hospitals
  • IoT devices

This data is vulnerable to:

  • Hacking
  • Cyberattacks
  • Unauthorized access

Examples

  • Bank data breaches
  • Social media privacy leaks

Organizations must use:

  • Encryption
  • Authentication
  • Security monitoring

6. Data Integration

Why is integration difficult?

Data comes from many different sources:

  • Websites
  • Mobile apps
  • Sensors
  • Databases
  • Cloud platforms

Combining all this data correctly is very challenging.

Example

A company may collect customer data from:

  • Website purchases
  • Mobile app activity
  • Offline store transactions

Merging all records accurately is difficult.

7. Scalability Issues

What is the challenge?

As data grows, systems must also grow.

Traditional systems use:

  • Vertical scaling (upgrading one machine)

Big Data systems require:

  • Horizontal scaling (adding more machines)

Problems

  • Infrastructure becomes complex
  • Network management becomes difficult
  • Cost increases

Example

Netflix continuously adds more servers as the number of users increases worldwide.

8. High Cost of Big Data Technologies

Why is it expensive?

Big Data systems often require:

  • Large server clusters
  • Cloud storage
  • High-performance hardware
  • Skilled professionals

Example

Running Hadoop or Spark clusters requires many servers and maintenance teams.

Even cloud services can become costly for huge datasets.

9. Shortage of Skilled Professionals

Why is this difficult?

Big Data technologies are complex.

Companies need experts in:

  • Hadoop
  • Spark
  • NoSQL
  • Machine Learning
  • Cloud computing

Problem

Experienced professionals are:

  • Limited
  • Expensive

Example

Hiring a Big Data engineer or data scientist can be costly for small companies.

10. Real-Time Data Analysis

What is the challenge?

Many applications require instant data analysis.

This needs:

  • Fast processing engines
  • Low-latency networks
  • High availability systems

Example

Bank fraud detection systems must identify suspicious transactions immediately.

Even a few seconds of delay can cause financial loss.

11. Data Governance and Compliance

Why is this important?

Organizations must follow strict data laws and regulations.

Rules include:

  • GDPR
  • HIPAA

These laws control:

  • Data collection
  • Storage
  • Sharing
  • Usage

Problem

Failure to follow these laws can result in:

  • Heavy penalties
  • Legal issues

Example

Healthcare companies must protect patient medical records carefully.

12. Data Visualization

Why is visualization difficult?

Big Data is huge and complex.

Creating meaningful:

  • Charts
  • Dashboards
  • Graphs

becomes challenging.

Requirements

Visualizations must be:

  • Accurate
  • Easy to understand
  • Real-time

Example

Displaying live sales data from millions of online transactions requires advanced dashboards.

13. Fault Tolerance and System Failure

Why is this a challenge?

Big Data systems use many distributed machines.

If one machine fails:

  • Data may be lost
  • The system may crash

Required Solutions

  • Data replication
  • Backup systems
  • Recovery mechanisms

Example

Hadoop stores multiple copies of data so that if one server fails, data is still available from another server.

Summary Table of Big Data Challenges

Challenge Description Example
Storage Huge data requires distributed storage Facebook data
Processing Speed Fast data generation Stock market updates
Variety Multiple data formats Videos, images
Data Quality Incorrect or duplicate data Fake social media activity
Security Risk of hacking and privacy issues Bank data breaches
Integration Combining data from many sources Website + app data
Scalability Systems must grow with data Netflix servers
Cost Infrastructure and tools are expensive Hadoop clusters
Skilled Workforce Experts are limited Data scientists
Real-Time Analysis Instant processing required Fraud detection
Governance Legal compliance required GDPR rules
Visualization Difficult to display huge data Real-time dashboards
Fault Tolerance Machine failure risks Hadoop replication

One-Line Conclusion

The biggest challenge of Big Data is managing huge, fast, and complex data securely, accurately, and efficiently in real time.

7. Technologies Available for Big Data

Big Data technologies are used to:

  • Store huge data
  • Process data quickly
  • Analyze data
  • Visualize insights
  • Handle real-time streaming

These technologies are divided into different categories.

Categories of Big Data Technologies

  1. Storage Technologies
  2. Processing Technologies
  3. Databases (NoSQL & NewSQL)
  4. Analytics & Machine Learning Tools
  5. Data Ingestion & ETL Tools
  6. Data Visualization Tools
  7. Cloud-Based Big Data Platforms

1. Big Data Storage Technologies

These technologies store massive amounts of data across multiple machines.

A. Hadoop Distributed File System (HDFS)

What is HDFS?

HDFS is a distributed storage system used in Hadoop.

It stores data across many computers instead of one single machine.

Features

  • Fault tolerant
  • Highly scalable
  • Cost-effective
  • Stores structured and unstructured data

Example

If one server fails, HDFS automatically retrieves data from another server copy.

Real-Life Example

YouTube videos can be stored across thousands of servers using distributed storage.

B. Google File System (GFS)

What is GFS?

A distributed file system developed by Google.

It inspired the creation of Hadoop and HDFS.

Features

  • Highly scalable
  • Handles huge datasets
  • Distributed storage

Example

Google Search stores billions of web pages using distributed file systems.

C. Cloud Storage Systems

Modern Big Data storage often uses cloud platforms.

Examples

  • Amazon S3
  • Google Cloud Storage
  • Microsoft Azure Blob Storage

Features

  • On-demand storage
  • Highly scalable
  • Secure
  • No physical hardware needed

Example

Netflix stores huge video content using cloud storage systems.

2. Big Data Processing Technologies

These technologies process and analyze huge datasets.

A. MapReduce

What is MapReduce?

A processing model used in Hadoop.

It divides large tasks into smaller tasks and processes them across many machines.

Components

Map Phase

Splits the task into smaller parts.

Reduce Phase

Combines the results.

Use

Batch processing of large datasets.

Example

Counting the number of words in millions of documents.

B. Apache Spark

Why is Spark Important?

Spark is much faster than MapReduce because it uses in-memory processing.

Features

  • Real-time processing
  • Fast analytics
  • Machine learning support

Spark Components

  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX

Example

Netflix uses Spark for movie recommendation systems.

C. Apache Flink

What is Flink?

A real-time data streaming engine.

Features

  • Low latency
  • Real-time analytics
  • Fast stream processing

Uses

  • Banking
  • Fraud detection
  • IoT systems

Example

Detecting suspicious bank transactions instantly.

D. Apache Storm

What is Storm?

A real-time processing framework.

Used For

  • Twitter streams
  • Weather monitoring
  • Live analytics

Example

Analyzing live tweets during sports events.

E. Apache Samza

What is Samza?

A distributed stream-processing system that works with Kafka.

Uses

  • Real-time pipelines
  • Streaming analytics

Example

Processing live customer activity in e-commerce systems.

3. Databases for Big Data (NoSQL & NewSQL)

Traditional SQL databases struggle with Big Data, so special databases are used.

A. NoSQL Databases

NoSQL databases handle large, flexible, and unstructured data.

Types of NoSQL Databases

Type Example
Document-based MongoDB
Column-based Cassandra, HBase
Key-Value Store Redis
Graph Database Neo4j

MongoDB

Features

  • Document-oriented database
  • Stores JSON-like documents
  • Flexible schema

Example

{
"name": "Rahul",
"city": "Delhi"
}

Used in web and mobile applications.

Cassandra

Features

  • Highly scalable
  • Distributed database
  • High availability

Used By

  • Netflix
  • Facebook

Example

Handling millions of user requests simultaneously.

HBase

Features

  • Column-oriented database
  • Works with HDFS
  • Handles massive datasets

Example

Storing billions of records in Hadoop systems.

B. NewSQL Databases

What is NewSQL?

Combines:

  • SQL features
  • Big Data scalability
  • High performance

Examples

  • Google Spanner
  • VoltDB
  • CockroachDB

Example

Large banking systems needing both scalability and transaction safety.

4. Big Data Analytics & Machine Learning Tools

These tools analyze data and generate insights.

A. Apache Hive

What is Hive?

A SQL-like tool for Hadoop.

Features

  • Data warehousing
  • Converts SQL queries into MapReduce jobs

Example

Analyzing sales data using SQL queries on Hadoop.

B. Apache Pig

What is Pig?

A scripting platform for processing Big Data.

Uses a language called Pig Latin.

Example

Transforming and cleaning huge datasets.

C. R Programming

Used For

  • Statistical analysis
  • Data visualization
  • Research work

Example

Predicting election results using statistical models.

D. Python Libraries

Popular Python libraries include:

  • Pandas
  • NumPy
  • SciPy
  • Matplotlib
  • Scikit-learn

Uses

  • Data analysis
  • Machine learning
  • Visualization

Example

Building AI prediction models.

E. Apache Mahout

What is Mahout?

A machine learning framework for Hadoop.

Uses

  • Clustering
  • Classification
  • Recommendation systems

Example

Movie recommendation systems.

F. RapidMiner

Features

  • Drag-and-drop analytics tool
  • No coding required

Uses

  • Machine learning
  • Data mining

Example

Business analysts creating predictive models easily.

5. Data Ingestion & ETL Tools

These tools collect and move data into Big Data systems.

A. Apache Kafka

What is Kafka?

A high-speed messaging and streaming platform.

Used By

  • Uber
  • Netflix
  • LinkedIn

Example

Processing millions of real-time messages.

B. Apache Sqoop

What is Sqoop?

Transfers data between:

  • Hadoop
  • Relational databases

Example

Moving customer records from MySQL to Hadoop.

C. Apache Flume

What is Flume?

Collects log and event data from servers.

Example

Collecting website visitor logs.

D. Talend

What is Talend?

An ETL (Extract, Transform, Load) tool.

Uses

  • Data integration
  • Connecting multiple systems

Example

Combining data from websites, apps, and databases.

6. Big Data Visualization Tools

Visualization tools convert complex data into charts and dashboards.

A. Tableau

Features

  • Interactive dashboards
  • Business intelligence reports

Example

Company sales analysis dashboard.

B. Power BI

Developed By

Microsoft

Features

  • Data visualization
  • Excel integration
  • Cloud support

Example

Analyzing monthly business performance.

C. QlikView / Qlik Sense

Features

  • Enterprise reporting
  • Visual analytics

Example

Large company performance tracking.

D. Google Data Studio

Features

  • Free cloud-based visualization
  • Interactive reports

Example

Website traffic analysis.

7. Cloud Platforms for Big Data

Cloud platforms provide storage and processing services for Big Data.

A. Amazon Web Services (AWS)

Big Data Tools

  • EMR
  • Redshift
  • AWS Glue
  • Kinesis

Example

Streaming and analyzing online shopping data.

B. Google Cloud Platform (GCP)

Tools

  • BigQuery
  • Dataproc
  • Dataflow
  • Cloud Storage

Example

Analyzing petabytes of search data.

C. Microsoft Azure

Tools

  • Azure HDInsight
  • Azure Databricks
  • Data Lake Storage

Example

Enterprise-level Big Data analytics.

Summary Table

Category Technologies
Storage HDFS, GFS, Cloud Storage
Processing MapReduce, Spark, Flink
Databases MongoDB, Cassandra, HBase
Analytics Hive, Pig, Python, R
Ingestion Kafka, Sqoop, Flume
Visualization Tableau, Power BI
Cloud AWS, GCP, Azure

 

7. Infrastructure for Big Data

Big Data Infrastructure means the complete setup of hardware, software, storage, network, and tools used to store, process, manage, and analyze huge amounts of data.

Big Data infrastructure is designed to handle the 3Vs:

  1. Volume → Huge amount of data
  2. Velocity → Fast speed of data generation
  3. Variety → Different types of data (text, video, images, logs, etc.)

Example

Companies like Netflix and Amazon generate terabytes of data every day.
Normal databases cannot manage such huge data, so Big Data infrastructure is required.

1. Storage Infrastructure

Storage infrastructure stores massive amounts of data across many machines.

Traditional databases store data on one server, but Big Data uses distributed storage systems.

Main Storage Technologies

1. HDFS (Hadoop Distributed File System)

  • Stores data across many computers (nodes)
  • Fault tolerant (data is safe even if one machine fails)
  • Highly scalable

Example

If a company stores 500 TB of customer data, HDFS divides the data into small blocks and stores them on multiple servers.

2. Cloud Storage

Cloud platforms provide unlimited scalable storage.

Examples

  • Amazon Web Services S3
  • Google Cloud Storage
  • Microsoft Azure Blob Storage

Example

A video streaming company stores millions of videos in cloud storage instead of local hard disks.

3. Data Lakes

A Data Lake stores raw and unprocessed data.

It can store:

  • Structured data
  • Semi-structured data
  • Unstructured data

Example

A hospital stores:

  • Patient records
  • X-ray images
  • Audio reports
  • Sensor data
    all together in a data lake.

4. Distributed File Systems

Special systems designed for distributed storage.

Examples

  • Google File System (GFS)
  • GlusterFS
  • CephFS

Example

Google stores search engine data using distributed file systems spread across data centers.

2. Compute / Processing Infrastructure

This infrastructure processes and analyzes Big Data.

Instead of one computer, many computers work together in parallel.

Processing Frameworks

1. MapReduce

  • Batch processing framework
  • Breaks one large task into smaller tasks

Example

To count word frequency in 1 million documents:

  • Map phase counts words
  • Reduce phase combines results

2. Apache Spark

  • Fast in-memory processing
  • Supports real-time analytics

Example

Banks use Spark to detect fraudulent transactions instantly.

3. Apache Flink / Storm / Samza

Used for real-time stream processing.

Example

Stock market apps analyze live trading data every second using stream processing tools.

4. Distributed Clusters

Thousands of servers work together as one system.

Example

Facebook uses large server clusters to process user activity data.

3. Database Infrastructure

Big Data uses different databases because all data is not structured.

NoSQL Databases

Examples

  • MongoDB
  • Cassandra
  • HBase
  • Redis
  • Neo4j

Why NoSQL?

  • Handles unstructured data
  • Fast read/write operations
  • Easy horizontal scaling

Example

A social media platform stores posts, comments, and images using MongoDB.

SQL / NewSQL Databases

Examples

  • Google Spanner
  • VoltDB
  • CockroachDB

Example

Banking systems use NewSQL databases for fast and reliable transactions.

4. Data Ingestion Infrastructure

This layer collects data from multiple sources and sends it into Big Data systems.

Tools

1. Apache Kafka

  • High-speed data streaming platform

Example

Uber uses Kafka to process ride requests in real time.

2. Apache Flume

Used for collecting log data.

Example

Web server logs are collected continuously using Flume.

3. Apache Sqoop

Transfers data between SQL databases and Hadoop.

Example

A company transfers MySQL customer data into Hadoop for analytics.

4. Apache NiFi

Automates and manages data pipelines.

Example

IoT sensor data is automatically collected and transferred using NiFi.

5. Networking Infrastructure

Big Data systems require fast and secure networks.

Requirements

  • High bandwidth
  • Low latency
  • Secure communication
  • Load balancing

Example

In a Hadoop cluster, huge data blocks move between servers, so fast Ethernet networks are necessary.

6. Server & Hardware Infrastructure

Big Data requires many machines working together.

Hardware Components

1. Commodity Servers

Low-cost servers used in clusters.

Example

A Hadoop cluster may contain hundreds of low-cost servers.

2. CPU / GPU Servers

  • CPUs handle general processing
  • GPUs are used for AI and machine learning

Example

AI companies use GPU servers for deep learning.

3. Memory (RAM)

Large RAM is needed for Spark’s in-memory processing.

Example

Spark keeps data in RAM for faster analytics.

4. Storage Disks

  • SSD → Fast access
  • HDD → Large storage capacity

Example

SSDs are used for real-time analytics systems.

5. Clusters

Many machines connected together.

Example

A cluster of 100 servers processes data simultaneously.

7. Processing Framework Infrastructure

These tools manage cluster resources and job scheduling.

Tools

1. YARN

Manages Hadoop cluster resources.

Example

YARN decides which application gets CPU and memory resources.

2. Mesos

Shares resources among applications.

Example

Multiple Big Data applications run together using Mesos.

3. Kubernetes

Manages containerized applications.

Example

Companies deploy Spark applications on Kubernetes clusters.

8. Visualization Infrastructure

Visualization tools display Big Data insights in charts and dashboards.

Tools

  • Tableau
  • Microsoft Power BI
  • QlikView
  • Google Data Studio
  • Apache Superset

Example

A sales dashboard shows:

  • Monthly profit
  • Customer trends
  • Product performance

using Tableau or Power BI.

9. Security Infrastructure

Big Data contains sensitive information, so strong security is required.

Components

1. Data Encryption

Protects data from hackers.

Example

Bank transaction data is encrypted before storage.

2. Authentication

Example: Kerberos verifies user identity.

3. Authorization

Tools:

  • Ranger
  • Sentry

Example

Only managers can access financial reports.

4. Firewall Protection

Blocks unauthorized network access.

5. Auditing Systems

Tracks who accessed the data.

Example

Hospitals maintain audit logs of patient data access.

10. Cloud Infrastructure for Big Data

Cloud platforms are widely used because they provide:

  • Low cost
  • High scalability
  • Easy maintenance

Major Cloud Platforms

1. Amazon Web Services

Services:

  • EMR
  • S3
  • Redshift
  • Glue
  • Kinesis

Example

A company uses AWS EMR for Hadoop processing and S3 for storage.

2. Google Cloud

Services:

  • BigQuery
  • Dataflow
  • Dataproc

Example

BigQuery analyzes billions of records in seconds.

3. Microsoft Azure

Services:

  • HDInsight
  • Azure Databricks
  • Azure Data Lake

Example

Azure Databricks is used for AI and Big Data analytics.

Real-Life Example

Netflix uses:

  • Cloud storage for movies
  • Kafka for streaming data
  • Spark for recommendations
  • Visualization dashboards for analytics
  • Security systems for user privacy

This complete setup forms a Big Data infrastructure.

9. Uses of Data Analytics

Data Analytics is the process of examining raw data to discover:

  • Useful insights
  • Patterns
  • Trends
  • Hidden information

It helps organizations make better decisions, improve performance, reduce costs, and predict future outcomes.

Data Analytics is widely used in:

  • Business
  • Healthcare
  • Banking
  • Education
  • Government
  • Sports
  • Social Media
  • Transportation

1. Business Analytics

Purpose

  • Improve business performance
  • Make data-driven decisions
  • Increase profits

Applications

1. Customer Analytics

Studies customer behavior, interests, and buying patterns.

Example

Amazon recommends products based on previous purchases and search history.

2. Sales & Marketing Analytics

Used to:

  • Predict sales trends
  • Improve advertising
  • Optimize pricing

Example

Netflix suggests movies and shows based on viewing history.

3. Supply Chain & Inventory Management

Helps maintain proper stock levels.

Example

Walmart uses analytics to avoid overstocking and understocking products.

4. Fraud Detection

Detects suspicious activities in real time.

Example

Banks identify unusual credit card transactions using analytics.

2. Healthcare Analytics

Purpose

  • Improve patient care
  • Reduce medical costs
  • Predict diseases

Applications

1. Predictive Analytics

Predicts disease risks using patient data.

Example

Hospitals predict heart attack risk using patient health records.

2. Operational Analytics

Improves hospital management and staff allocation.

Example

Analytics helps hospitals reduce waiting times.

3. Medical Research

Analyzes:

  • Clinical trials
  • DNA data
  • Genetic information

Example

Researchers study cancer treatment effectiveness using analytics.

4. Patient Engagement

Provides personalized treatment plans and reminders.

Example

Health apps send medicine reminders to patients.

3. Financial Analytics

Purpose

  • Improve financial decisions
  • Reduce risk
  • Detect fraud

Applications

1. Credit Risk Management

Evaluates loan eligibility.

Example

Banks analyze credit scores before approving loans.

2. Fraud Detection

Detects suspicious banking transactions.

Example

If a card is used in another country suddenly, analytics systems may block it.

3. Investment & Portfolio Analytics

Predicts stock market trends.

Example

Investment firms analyze historical stock prices before investing.

4. Budgeting & Forecasting

Plans future expenses and revenue.

Example

Companies forecast next year’s profits using analytics.

4. Retail and E-commerce Analytics

Purpose

  • Improve customer experience
  • Increase sales

Applications

1. Recommendation Systems

Suggest products based on customer interests.

Example

Flipkart and Amazon recommend products to users.

2. Customer Sentiment Analysis

Analyzes reviews and feedback.

Example

Companies study customer reviews to improve products.

3. Inventory & Pricing Analytics

Optimizes stock and pricing strategies.

Example

Online stores change prices during high demand periods.

5. Manufacturing and Operations Analytics

Purpose

  • Improve efficiency
  • Reduce production costs

Applications

1. Predictive Maintenance

Predicts machine failures before breakdown.

Example

BMW uses analytics to monitor machine performance.

2. Supply Chain Optimization

Improves delivery routes and warehouse management.

Example

Factories optimize transportation routes using analytics.

3. Quality Control

Detects defective products.

Example

Manufacturing companies identify damaged products automatically.

6. Government and Public Sector Analytics

Purpose

  • Improve governance
  • Increase public safety

Applications

1. Crime Analytics

Predicts crime-prone areas.

Example

Police departments deploy officers based on crime analysis.

2. Urban Planning

Analyzes traffic and infrastructure needs.

Example

Governments use traffic data to build new roads.

3. Tax and Revenue Analytics

Detects tax fraud and improves revenue collection.

Example

Tax departments identify suspicious financial activities.

4. Disaster Management

Predicts natural disasters and manages relief efforts.

Example

Weather agencies predict floods using analytics.

7. Telecommunications Analytics

Purpose

  • Improve customer retention
  • Optimize network performance

Applications

1. Churn Prediction

Identifies customers likely to leave the service.

Example

Telecom companies offer discounts to customers planning to switch providers.

2. Network Optimization

Improves internet and call quality.

Example

Analytics helps reduce mobile network downtime.

3. Fraud Detection

Detects fake calls and SIM misuse.

Example

Telecom companies identify suspicious calling patterns.

8. Sports Analytics

Purpose

  • Improve player performance
  • Develop better game strategies

Applications

1. Player Performance Analytics

Tracks:

  • Fitness
  • Injuries
  • Performance

Example

Cricket teams analyze player statistics before matches.

2. Game Strategy Analytics

Studies opponent strengths and weaknesses.

Example

Football teams use analytics to plan defensive strategies.

3. Fan Engagement

Provides personalized content to fans.

Example

Sports apps recommend highlights based on user interests.

9. Education Analytics

Purpose

  • Improve student performance
  • Enhance learning systems

Applications

1. Student Performance Analysis

Identifies weak students.

Example

Schools provide extra support to low-performing students.

2. Course Recommendation

Suggests suitable courses.

Example

Online learning platforms recommend courses based on interests.

3. Administrative Planning

Optimizes:

  • Classroom allocation
  • Staff scheduling
  • Resource management

Example

Universities manage timetables using analytics.

10. Social Media Analytics

Purpose

  • Understand user behavior
  • Improve marketing

Applications

1. Sentiment Analysis

Determines public opinion.

Example

Companies analyze tweets and comments about products.

2. Trend Analysis

Identifies trending topics and hashtags.

Example

Social media platforms track viral topics.

3. Influencer Analytics

Measures influencer impact.

Example

Brands analyze engagement rates before collaborations.

11. Transportation & Logistics Analytics

Purpose

  • Reduce transportation costs
  • Improve delivery operations

Applications

1. Route Optimization

Finds shortest and fastest routes.

Example

Uber and Google Maps use traffic analytics for route suggestions.

2. Predictive Maintenance

Predicts vehicle failures.

Example

Logistics companies monitor truck conditions using analytics.

3. Supply Chain Analytics

Improves warehouse and delivery management.

Example

E-commerce companies optimize delivery schedules.

12. Energy & Utilities Analytics

Purpose

  • Optimize energy usage
  • Reduce operational costs

Applications

1. Smart Grid Analytics

Monitors electricity distribution.

Example

Power companies balance electricity supply using analytics.

2. Predictive Maintenance

Detects equipment faults early.

Example

Electricity plants monitor turbine health continuously.

3. Consumption Forecasting

Predicts future energy demand.

Example

Energy companies estimate summer electricity usage.

Real-Life Example

Amazon uses data analytics for:

  • Product recommendations
  • Customer behavior analysis
  • Inventory management
  • Fraud detection
  • Delivery optimization

This helps the company provide faster and smarter services to customers.

 
 

10. Hadoop

What is Hadoop?

Apache Hadoop is an open-source framework used for:

  • Storing huge amounts of data
  • Processing Big Data
  • Distributed computing

It was developed by:

  • Doug Cutting
  • Mike Cafarella

in 2005.

Hadoop was inspired by:

  • Google MapReduce
  • Google File System (GFS)

It is managed by the Apache Software Foundation.

Why Hadoop is Needed

Traditional databases cannot efficiently handle Big Data because:

  • Data size is extremely large (TBs to PBs)
  • Data comes in different formats
  • Data is generated very fast
  • Traditional systems are expensive and difficult to scale

Hadoop solves these problems by providing:

  • Scalability
  • Fault tolerance
  • Distributed storage
  • Parallel processing
  • Low-cost infrastructure

Example of Hadoop

Example

Facebook stores and analyzes billions of posts, likes, comments, and images using Hadoop clusters.

Without Hadoop, processing such huge data would be very slow and expensive.

Key Features of Hadoop

1. Open Source

  • Free to use
  • Anyone can modify and distribute it

Example

Companies can customize Hadoop according to their business needs without paying license fees.

2. Scalable

Hadoop can grow by adding more machines (nodes).

Example

If storage becomes full, new servers can simply be added to the cluster.

3. Fault Tolerant

Data is automatically copied to multiple nodes.

Example

If one server fails, data can still be accessed from another server copy.

4. Cost-Effective

Uses low-cost commodity hardware.

Example

Organizations use normal servers instead of expensive supercomputers.

5. High Throughput

Processes large volumes of data efficiently.

Example

Hadoop can process terabytes of log data in parallel.

6. Flexibility

Can handle:

  • Structured data
  • Semi-structured data
  • Unstructured data

Example

Hadoop stores:

  • Text files
  • Images
  • Videos
  • Sensor data
  • Social media posts

7. Distributed Processing

Processing happens near the data location.

This is called the Data Locality Principle.

Example

Instead of moving huge data across the network, Hadoop sends computation to the node where data already exists.

Hadoop Ecosystem / Components

Hadoop is a complete ecosystem with multiple components.

1. HDFS (Hadoop Distributed File System)

Purpose

Stores Big Data across multiple machines.

Features

  • Fault tolerant
  • Highly scalable
  • Stores any type of data

Example

A 1 TB file is divided into smaller blocks and stored across different DataNodes.

2. MapReduce

Purpose

Processes large datasets in parallel.

Phases

Map Phase

Converts input data into key-value pairs.

Reduce Phase

Combines and summarizes results.

Example of MapReduce

Suppose we count words in documents:

Input

"Big Data Hadoop Hadoop"

Map Output

(Big,1)
(Data,1)
(Hadoop,1)
(Hadoop,1)

Reduce Output

Big = 1
Data = 1
Hadoop = 2

3. YARN (Yet Another Resource Negotiator)

Purpose

Manages resources in the Hadoop cluster.

Functions

  • Allocates CPU and memory
  • Schedules jobs
  • Monitors tasks

Example

If multiple users submit jobs, YARN decides resource allocation.

4. Hadoop Common

Contains common libraries and utilities required by Hadoop modules.

Example

Provides APIs and tools used by HDFS and MapReduce.

Hadoop Architecture

Hadoop follows a Master-Slave Architecture.

1. HDFS Architecture

NameNode (Master)

Functions

  • Manages metadata
  • Controls file locations
  • Maintains permissions

Example

Tracks where file blocks are stored.

DataNode (Slave)

Functions

  • Stores actual data
  • Performs read/write operations

Example

Stores chunks of video files across servers.

2. MapReduce Architecture

JobTracker (Master)

  • Assigns tasks
  • Monitors job execution

TaskTracker (Slave)

  • Executes tasks on nodes

Example

TaskTrackers process data blocks simultaneously.

3. YARN Architecture

ResourceManager (Master)

Allocates cluster resources.

NodeManager (Slave)

Manages tasks on each node.

Hadoop Workflow

Step 1

Data is stored in HDFS.

Step 2

A MapReduce job is submitted.

Step 3

The job is divided into smaller tasks.

Step 4

Tasks run on nodes where data is stored.

Step 5

Reduce phase combines outputs.

Step 6

Final result is stored back in HDFS.

Advantages of Hadoop

Advantage Explanation
Scalability Easily add more nodes
Fault Tolerance Data replication prevents loss
Cost-Effective Uses cheap hardware
Flexibility Handles all data types
High Throughput Processes huge datasets efficiently
Open Source Free to use

Limitations of Hadoop

1. Not Good for Small Data

Overhead is high for small datasets.

Example

Using Hadoop for a few MBs of data is unnecessary.

2. Complex Programming

MapReduce programming can be difficult for beginners.

3. Limited Real-Time Processing

Hadoop mainly supports batch processing.

Example

It is slower for live streaming analytics.

4. High Latency

Slower compared to in-memory systems like Spark.

Hadoop Ecosystem Tools

1. Apache Hive

Used for SQL-like queries on Hadoop data.

Example

Analysts use Hive to query sales data using SQL syntax.

2. Apache Pig

Used for data transformation scripts.

Example

Converts raw logs into structured reports.

3. Apache HBase

Column-oriented NoSQL database built on Hadoop.

Example

Stores billions of user records.

4. Apache Sqoop

Transfers data between SQL databases and Hadoop.

Example

Imports MySQL data into HDFS.

5. Apache Flume

Collects log data into Hadoop.

Example

Collects website traffic logs continuously.

6. Apache Oozie

Schedules Hadoop jobs.

Example

Runs daily data processing automatically.

7. Apache Mahout

Provides machine learning algorithms.

Example

Recommendation systems use Mahout algorithms.

Applications of Hadoop

1. Social Media Analytics

Analyzes user activity and trends.

Example

Twitter analyzes tweets and hashtags.

2. E-commerce

Used for recommendations and customer analytics.

Example

Amazon suggests products based on customer behavior.

3. Banking & Finance

Used for:

  • Fraud detection
  • Risk analysis

Example

Banks monitor unusual transactions using Hadoop.

4. Healthcare

Analyzes patient records and disease patterns.

Example

Hospitals predict disease risks using medical data.

5. Telecom

Analyzes network traffic and call records.

Example

Telecom companies detect network failures using Hadoop analytics.

6. Government

Used for:

  • Census analysis
  • Crime analysis
  • Policy planning

Example

Governments analyze population data for development planning.

Page 2 of 2