Data Mining and Warehousing Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (21)

  • 1. What is Data Warehouse ?
  • 2. Key Characteristics of a Data Warehouse
  • 3. Need, Importance & Components of Data Warehousing
  • 4. Data Warehouse vs OLTP
  • 5. Architecture, Advantages & Applications of a Data Warehouse
  • 6. Data Warehouse (DW) Delivery Process
  • 7. Data Warehouse Architecture
  • 8. Data Cleaning
  • 9. Data Integration
  • 10. Data Transformation
  • 11. Data Reduction
  • 12. Data Warehouse Schema Design
  • 13. Partitioning Strategy in Data Warehouse Implementation
  • 14. Data Marts
  • 15. Metadata
  • 16. Example of a Multidimensional Data Model
  • 17. Pattern Warehousing
  • 18. OLAP Systems
  • 19. OLAP Queries
  • 20. OLAP Servers
  • 21. OLAP Operations

16. Example of a Multidimensional Data Model

A Multidimensional Data Model organizes data into:

  • Facts → Numeric/measurable data
  • Dimensions → Descriptive attributes

 It allows analysis from multiple perspectives using OLAP operations like:

  • Slice
  • Dice
  • Drill-down
  • Roll-up
  • Pivot

Scenario: Sales Analysis

We want to analyze sales across:

  • Time
  • Product
  • Location

 This forms a 3-Dimensional Data Cube

1. Fact Table: Sales_Fact

Attribute Description
product_id Foreign Key → Product Dimension
time_id Foreign Key → Time Dimension
location_id Foreign Key → Location Dimension
sales_amount Measure
quantity_sold Measure

Measures:

  • Sales Amount
  • Quantity Sold

2. Dimension Tables

A. Product Dimension

product_id product_name category brand price
1 Laptop Electronics Dell 60000
2 Mobile Electronics Samsung 20000

B. Time Dimension

time_id day month quarter year
101 12 March Q1 2024
102 15 March Q1 2024

C. Location Dimension (Example)

location_id city state country
201 Mumbai Maharashtra India
202 Delhi Delhi India

Multidimensional Cube Representation

                             Time (Year/Quarter/Month)
                                             ↑
                                             |
Location (City/State) ← Sales Cube → Product (Category/Brand)
 

 Each cell in the cube stores a value like:

Sales_Amount = f(Product, Time, Location)

Example Data Cube Values

Product Time Location Sales Amount Quantity Sold
Laptop March 2024 Mumbai 25,00,000 42
Mobile March 2024 Delhi 8,00,000 60

OLAP Operations on Data Cube

 1. Slice

Select a single dimension value.

Example:

  • Sales for March 2024
     Result: Product × Location

 2. Dice

Select multiple values from multiple dimensions.

Example:

  • Products: Laptop, Mobile
  • Time: Q1 2024
  • Location: Mumbai, Delhi

 3. Drill-Down

Move from higher level to detailed level.

Example:

  • Year → Quarter → Month → Day

4. Roll-Up

Aggregate data to a higher level.

Example:

  • City → State → Country

 5. Pivot (Rotate)

Reorient the cube (change axes).

Example:

  • Swap Product and Time dimensions

17. Pattern Warehousing

Pattern Warehousing is an extension of traditional data warehousing where patterns, knowledge, and insights discovered from data mining are stored instead of raw data.

Formal Definition

Pattern Warehousing is a data warehousing approach that stores patterns, rules, trends, or models extracted from data mining processes to enable faster and more intelligent decision-making.

Key Idea

  • Traditional Data Warehouse → Stores raw data
  • Pattern Warehouse → Stores patterns derived from data

Types of Patterns Stored

  • Association Rules
  • Classification Models
  • Clustering Patterns
  • Sequential Patterns

 This avoids repeated data mining and saves time & computation.

Architecture of Pattern Warehousing

1. Operational Data Sources

  • Raw transactional data from multiple systems

2. Data Warehouse

  • Cleaned, integrated, historical data

3. Data Mining Layer

  • Extracts:
    • Patterns
    • Rules
    • Models

4. Pattern Warehouse

  • Stores discovered patterns for reuse

5. Decision Support / OLAP Layer

  • Users query patterns instead of raw data

Types of Patterns Stored (Detailed)

Pattern Type Description
Association Rules “Customers who buy A also buy B”
Classification Models Predict categories (e.g., spam/not spam)
Clustering Patterns Groups similar data points
Sequential Patterns Order of events over time

Advantages of Pattern Warehousing

  • Faster decision-making (pre-computed patterns)
  • Reduces repeated data mining
  • Quick discovery of insights
  • Integrates patterns from multiple sources
  • Supports advanced analytics and forecasting

Example: Retail Scenario

Step 1: Data Warehouse Stores

  • Customer purchases
  • Product information
  • Time and location data

Step 2: Data Mining Discovers

  • Association Rule:
     “70% of customers who buy bread also buy butter”
  • Seasonal Trend:
     “Ice cream sales increase by 50% in summer”

Step 3: Pattern Warehouse Stores

  • These rules and trends

Step 4: Usage

  • Management queries patterns directly
  • Helps in:
    • Marketing campaigns
    • Product placement
    • Sales strategy

18. OLAP Systems

OLAP (Online Analytical Processing) is a technology that allows users to analyze large amounts of data from multiple perspectives quickly and interactively, mainly for decision-making.

It is different from OLTP (transaction systems) because:

  • OLTP → handles daily operations (insert/update)
  • OLAP → handles analysis, reporting, and trends

Key Points

  • Works on historical and aggregated data
  • Provides multidimensional views
  • Supports advanced analysis operations:
    • Slice
    • Dice
    • Drill-down
    • Roll-up

 Characteristics of OLAP

Feature Description
Multidimensional Data stored in cubes (Time, Product, Location)
Aggregated Data Precomputed summaries for fast results
Interactive Analysis Users explore data dynamically
Complex Queries Handles trends, patterns, comparisons
Time-Variant Supports historical data

 Components of OLAP Systems

1. Fact Table

  • Stores numerical data (measures)
  • Example: Sales, Profit, Revenue

2. Dimension Table

  • Stores descriptive attributes
  • Example: Time, Product, Customer, Region

3. Data Cube

  • Multidimensional structure combining facts + dimensions

4. OLAP Server (Engine)

  • Processes queries
  • Performs:
    • Aggregation
    • Drill-down
    • Roll-up

5. Client Tools

  • Used for:
    • Reports
    • Dashboards
    • Visualization

 Types of OLAP Systems

Type Description Example
MOLAP Data stored in cube format Microsoft Analysis Services
ROLAP Data stored in relational tables Oracle OLAP
HOLAP Combination of MOLAP + ROLAP SQL Server HOLAP
DOLAP Runs on desktop Personal analytics tools
WOLAP Web-based OLAP Web BI tools

 OLAP Operations (Very Important)

1. Slice

  • Select one dimension value
  • Reduces cube

  Example:
Sales in March 2024

2. Dice

  • Select multiple dimension values
  • Creates sub-cube

  Example:
Sales of Laptops & Mobiles in Mumbai & Delhi

3. Drill-Down

  • Go from summary → detail

  Example:
Year → Quarter → Month → Day

4. Roll-Up (Aggregation)

  • Go from detail → summary

  Example:
City → State → Country

5. Pivot (Rotate)

  • Change data view (swap axes)

  Example:
Swap Time and Product

 OLAP Data Models

1. Multidimensional Model

  • Data cube format
  • Example: Sales(Time × Product × Region)

2. Star Schema

  • One fact table
  • Connected to denormalized dimension tables

3. Snowflake Schema

  • Dimension tables are normalized
  • More complex but saves space

 Advantages of OLAP

  • Fast query performance
  • Handles complex analysis
  • Helps identify:
    • Trends
    • Patterns
    • Exceptions
  • Interactive data exploration
  • Supports decision-making

 Limitations of OLAP

  • Expensive for very large datasets
  • Requires preprocessing (aggregation)
  • Complex setup
  • ROLAP can be slower due to joins

 Example: Retail Sales Analysis

Dimensions

  • Product
  • Time
  • Location

Measures

  • Sales Amount
  • Quantity Sold

Operations Example

Operation Example
Slice Sales in March 2024
Dice Laptops & Mobiles in Mumbai & Delhi
Drill-down Q1 → January → 12 March
Roll-up City → State → Country

 

19. OLAP Queries

OLAP Queries are used to analyze data stored in multidimensional cubes.
They help in getting:

  • Summarized data
  • Aggregated results
  • Detailed insights

 Unlike normal SQL (OLTP), OLAP queries focus on analysis, not transactions.

 Basic Concepts

Concept Meaning Real-Life Example
Fact Table Stores numeric data Sales amount, profit
Dimension Table Provides context Time, Product, City
Cube Multidimensional data Sales by Time × Product × Location
Measures Values to analyze Revenue, Quantity
Dimensions Ways to view data Year, City, Product

 OLAP Query Operations (With Real-Life Examples)

1. Slice

 Concept:

Select one value from a dimension

Real-Life Example:

A store manager wants to see:
“Total sales only for March 2024”

 Query:

SELECT product, location, SUM(sales_amount)
FROM sales_cube
WHERE time = 'March 2024'
GROUP BY product, location;
 

2. Dice

 Concept:

Select multiple values from multiple dimensions

 Real-Life Example:

A regional manager asks:
“Sales of Laptops & Mobiles in Mumbai & Delhi during Q1 2024”

 Query:

SELECT product, location, SUM(sales_amount)
FROM sales_cube
WHERE product IN ('Laptop','Mobile')
AND location IN ('Mumbai','Delhi')
AND time BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY product, location;
 

3. Drill-Down

 Concept:

Go from summary → detailed data

 Real-Life Example:

CEO sees:
 Q1 sales report
Then wants:
 Monthly → Daily breakdown

 Query:

SELECT day, SUM(sales_amount)
FROM sales_cube
WHERE quarter = 'Q1 2024'
GROUP BY day;
 

4. Roll-Up (Consolidation)

 Concept:

Go from detailed → summarized data

 Real-Life Example:

A company wants:
 City-wise sales → State-wise → Country-wise summary

 Query:

SELECT state, SUM(sales_amount)
FROM sales_cube
GROUP BY state;
 

5. Pivot (Rotate)

 Concept:

Change the view of data (rotate axes)

 Real-Life Example:

In Excel/dashboard:
  Rows = Product
  Columns = Time

Then swapped to:
  Rows = Time
  Columns = Product

  No SQL needed — done in tools like dashboards.

6. Ranking / Top-N Query

  Concept:

Find Top or Bottom performers

  Real-Life Example:

Sales head asks:
 “Top 3 selling products in March”

  Query:

SELECT product, SUM(sales_amount) AS total_sales
FROM sales_cube
WHERE time = 'March 2024'
GROUP BY product
ORDER BY total_sales DESC
LIMIT 3;
 

7. Drill-Across

  Concept:

Combine data from multiple fact tables

  Real-Life Example:

Business wants:
  Compare Sales vs Inventory

  “Are we selling more than we stock?”

  Query:

SELECT s.product, SUM(s.sales_amount), SUM(i.quantity)
FROM sales_fact s
JOIN inventory_fact i
ON s.product_id = i.product_id
GROUP BY s.product;
 

 Quick Summary Table

Operation Meaning Real-Life Example
Slice One dimension Sales in March
Dice Multiple filters Laptops in Mumbai & Delhi
Drill-Down Summary → Detail Year → Month → Day
Roll-Up Detail → Summary City → State
Pivot Change view Swap rows & columns
Top-N Best performers Top 3 products
Drill-Across Multiple tables Sales + Inventory

 

20. OLAP Servers

An OLAP Server is the main engine that:

  • Stores data (cube or tables)
  • Processes queries
  • Provides fast analytical results

 It helps users analyze multidimensional data quickly from a data warehouse.

 Types of OLAP Servers

There are 3 main types:

  1. MOLAP
  2. ROLAP
  3. HOLAP

 1. MOLAP (Multidimensional OLAP)

 Concept

  • Data stored in multidimensional cubes
  • Aggregations are precomputed

 Real-Life Example

Imagine a shopping mall dashboard:

 Manager clicks:

  • “Show total sales for 2024”

 Result comes instantly because:

  • Data is already stored in cube format

 Features

  • Uses cube storage (not tables)
  • Pre-calculates totals (SUM, AVG)
  • Very fast performance

 Advantages

  • Extremely fast queries ⚡
  • Best for reports & dashboards
  • Handles complex calculations easily

 Disadvantages

  • Not good for very large data
  • Storage grows quickly
  • Less flexible for new queries

 Examples

  • Microsoft Analysis Services (SSAS)
  • Oracle Essbase
  • IBM Cognos TM1

 2. ROLAP (Relational OLAP)

 Concept

  • Data stored in relational tables (database)
  • Uses SQL queries for analysis

 Real-Life Example

Think of Amazon-like system:

 You ask:

  • “Show last 5 years sales data”

 System:

  • Fetches data from huge database tables
  • Calculates results on demand

 Features

  • No cube storage
  • Uses existing databases
  • Aggregation done dynamically

 Advantages

  • Handles very large datasets
  • Flexible for ad-hoc queries
  • Uses existing DB systems

 Disadvantages

  • Slower than MOLAP
  • Heavy load on database
  • Complex queries take time

 Examples

  • Oracle OLAP
  • IBM Cognos (ROLAP mode)
  • MicroStrategy

 3. HOLAP (Hybrid OLAP)

 Concept

  • Combines MOLAP + ROLAP

 Stores:

  • Summary → Cube
  • Detailed data → Tables

 Real-Life Example

Think of a large e-commerce company:

 Dashboard shows:

  • Total sales →  fast (cube)

 But when you click:

  • “View all orders of 2024”

 It fetches:

  • Detailed data from database

 Features

  • Fast summary + detailed access
  • Balanced approach

 Advantages

  • Best of both worlds
  • Fast + scalable
  • Good for mixed workloads

 Disadvantages

  • Complex to design
  • Needs synchronization
  • Slightly harder to maintain

 Examples

  • Microsoft SSAS (HOLAP mode)
  • SAP BW
  • IBM Cognos TM1 HOLAP

 Comparison Table (Very Important for Exams)

Feature MOLAP ROLAP HOLAP
Storage Cube Tables Cube + Tables
Speed Very Fast Slow Medium/Fast
Scalability Medium High High
Flexibility Low High Medium
Data Size Small–Medium Very Large Large
Best Use Dashboards Big Data Analysis Mixed Use

 

Page 4 of 5