Data Mining and Warehousing Notes
All Topics (21)
- 1. What is Data Warehouse ?
- 2. Key Characteristics of a Data Warehouse
- 3. Need, Importance & Components of Data Warehousing
- 4. Data Warehouse vs OLTP
- 5. Architecture, Advantages & Applications of a Data Warehouse
- 6. Data Warehouse (DW) Delivery Process
- 7. Data Warehouse Architecture
- 8. Data Cleaning
- 9. Data Integration
- 10. Data Transformation
- 11. Data Reduction
- 12. Data Warehouse Schema Design
- 13. Partitioning Strategy in Data Warehouse Implementation
- 14. Data Marts
- 15. Metadata
- 16. Example of a Multidimensional Data Model
- 17. Pattern Warehousing
- 18. OLAP Systems
- 19. OLAP Queries
- 20. OLAP Servers
- 21. OLAP Operations
16. Example of a Multidimensional Data Model
A Multidimensional Data Model organizes data into:
- Facts → Numeric/measurable data
- Dimensions → Descriptive attributes
It allows analysis from multiple perspectives using OLAP operations like:
- Slice
- Dice
- Drill-down
- Roll-up
- Pivot
Scenario: Sales Analysis
We want to analyze sales across:
- Time
- Product
- Location
This forms a 3-Dimensional Data Cube
1. Fact Table: Sales_Fact
| Attribute | Description |
|---|---|
| product_id | Foreign Key → Product Dimension |
| time_id | Foreign Key → Time Dimension |
| location_id | Foreign Key → Location Dimension |
| sales_amount | Measure |
| quantity_sold | Measure |
Measures:
- Sales Amount
- Quantity Sold
2. Dimension Tables
A. Product Dimension
| product_id | product_name | category | brand | price |
|---|---|---|---|---|
| 1 | Laptop | Electronics | Dell | 60000 |
| 2 | Mobile | Electronics | Samsung | 20000 |
B. Time Dimension
| time_id | day | month | quarter | year |
|---|---|---|---|---|
| 101 | 12 | March | Q1 | 2024 |
| 102 | 15 | March | Q1 | 2024 |
C. Location Dimension (Example)
| location_id | city | state | country |
|---|---|---|---|
| 201 | Mumbai | Maharashtra | India |
| 202 | Delhi | Delhi | India |
Multidimensional Cube Representation
↑
|
Location (City/State) ← Sales Cube → Product (Category/Brand)
Each cell in the cube stores a value like:
Sales_Amount = f(Product, Time, Location)
Example Data Cube Values
| Product | Time | Location | Sales Amount | Quantity Sold |
|---|---|---|---|---|
| Laptop | March 2024 | Mumbai | 25,00,000 | 42 |
| Mobile | March 2024 | Delhi | 8,00,000 | 60 |
OLAP Operations on Data Cube
1. Slice
Select a single dimension value.
Example:
- Sales for March 2024
Result: Product × Location
2. Dice
Select multiple values from multiple dimensions.
Example:
- Products: Laptop, Mobile
- Time: Q1 2024
- Location: Mumbai, Delhi
3. Drill-Down
Move from higher level to detailed level.
Example:
- Year → Quarter → Month → Day
4. Roll-Up
Aggregate data to a higher level.
Example:
- City → State → Country
5. Pivot (Rotate)
Reorient the cube (change axes).
Example:
- Swap Product and Time dimensions
17. Pattern Warehousing
Pattern Warehousing is an extension of traditional data warehousing where patterns, knowledge, and insights discovered from data mining are stored instead of raw data.
Formal Definition
Pattern Warehousing is a data warehousing approach that stores patterns, rules, trends, or models extracted from data mining processes to enable faster and more intelligent decision-making.
Key Idea
- Traditional Data Warehouse → Stores raw data
- Pattern Warehouse → Stores patterns derived from data
Types of Patterns Stored
- Association Rules
- Classification Models
- Clustering Patterns
- Sequential Patterns
This avoids repeated data mining and saves time & computation.
Architecture of Pattern Warehousing
1. Operational Data Sources
- Raw transactional data from multiple systems
2. Data Warehouse
- Cleaned, integrated, historical data
3. Data Mining Layer
- Extracts:
- Patterns
- Rules
- Models
4. Pattern Warehouse
- Stores discovered patterns for reuse
5. Decision Support / OLAP Layer
- Users query patterns instead of raw data
Types of Patterns Stored (Detailed)
| Pattern Type | Description |
|---|---|
| Association Rules | “Customers who buy A also buy B” |
| Classification Models | Predict categories (e.g., spam/not spam) |
| Clustering Patterns | Groups similar data points |
| Sequential Patterns | Order of events over time |
Advantages of Pattern Warehousing
- Faster decision-making (pre-computed patterns)
- Reduces repeated data mining
- Quick discovery of insights
- Integrates patterns from multiple sources
- Supports advanced analytics and forecasting
Example: Retail Scenario
Step 1: Data Warehouse Stores
- Customer purchases
- Product information
- Time and location data
Step 2: Data Mining Discovers
- Association Rule:
“70% of customers who buy bread also buy butter” - Seasonal Trend:
“Ice cream sales increase by 50% in summer”
Step 3: Pattern Warehouse Stores
- These rules and trends
Step 4: Usage
- Management queries patterns directly
- Helps in:
- Marketing campaigns
- Product placement
- Sales strategy
18. OLAP Systems
OLAP (Online Analytical Processing) is a technology that allows users to analyze large amounts of data from multiple perspectives quickly and interactively, mainly for decision-making.
It is different from OLTP (transaction systems) because:
- OLTP → handles daily operations (insert/update)
- OLAP → handles analysis, reporting, and trends
Key Points
- Works on historical and aggregated data
- Provides multidimensional views
- Supports advanced analysis operations:
- Slice
- Dice
- Drill-down
- Roll-up
Characteristics of OLAP
| Feature | Description |
|---|---|
| Multidimensional | Data stored in cubes (Time, Product, Location) |
| Aggregated Data | Precomputed summaries for fast results |
| Interactive Analysis | Users explore data dynamically |
| Complex Queries | Handles trends, patterns, comparisons |
| Time-Variant | Supports historical data |
Components of OLAP Systems
1. Fact Table
- Stores numerical data (measures)
- Example: Sales, Profit, Revenue
2. Dimension Table
- Stores descriptive attributes
- Example: Time, Product, Customer, Region
3. Data Cube
- Multidimensional structure combining facts + dimensions
4. OLAP Server (Engine)
- Processes queries
- Performs:
- Aggregation
- Drill-down
- Roll-up
5. Client Tools
- Used for:
- Reports
- Dashboards
- Visualization
Types of OLAP Systems
| Type | Description | Example |
|---|---|---|
| MOLAP | Data stored in cube format | Microsoft Analysis Services |
| ROLAP | Data stored in relational tables | Oracle OLAP |
| HOLAP | Combination of MOLAP + ROLAP | SQL Server HOLAP |
| DOLAP | Runs on desktop | Personal analytics tools |
| WOLAP | Web-based OLAP | Web BI tools |
OLAP Operations (Very Important)
1. Slice
- Select one dimension value
- Reduces cube
Example:
Sales in March 2024
2. Dice
- Select multiple dimension values
- Creates sub-cube
Example:
Sales of Laptops & Mobiles in Mumbai & Delhi
3. Drill-Down
- Go from summary → detail
Example:
Year → Quarter → Month → Day
4. Roll-Up (Aggregation)
- Go from detail → summary
Example:
City → State → Country
5. Pivot (Rotate)
- Change data view (swap axes)
Example:
Swap Time and Product
OLAP Data Models
1. Multidimensional Model
- Data cube format
- Example: Sales(Time × Product × Region)
2. Star Schema
- One fact table
- Connected to denormalized dimension tables
3. Snowflake Schema
- Dimension tables are normalized
- More complex but saves space
Advantages of OLAP
- Fast query performance
- Handles complex analysis
- Helps identify:
- Trends
- Patterns
- Exceptions
- Interactive data exploration
- Supports decision-making
Limitations of OLAP
- Expensive for very large datasets
- Requires preprocessing (aggregation)
- Complex setup
- ROLAP can be slower due to joins
Example: Retail Sales Analysis
Dimensions
- Product
- Time
- Location
Measures
- Sales Amount
- Quantity Sold
Operations Example
| Operation | Example |
|---|---|
| Slice | Sales in March 2024 |
| Dice | Laptops & Mobiles in Mumbai & Delhi |
| Drill-down | Q1 → January → 12 March |
| Roll-up | City → State → Country |
19. OLAP Queries
OLAP Queries are used to analyze data stored in multidimensional cubes.
They help in getting:
- Summarized data
- Aggregated results
- Detailed insights
Unlike normal SQL (OLTP), OLAP queries focus on analysis, not transactions.
Basic Concepts
| Concept | Meaning | Real-Life Example |
|---|---|---|
| Fact Table | Stores numeric data | Sales amount, profit |
| Dimension Table | Provides context | Time, Product, City |
| Cube | Multidimensional data | Sales by Time × Product × Location |
| Measures | Values to analyze | Revenue, Quantity |
| Dimensions | Ways to view data | Year, City, Product |
OLAP Query Operations (With Real-Life Examples)
1. Slice
Concept:
Select one value from a dimension
Real-Life Example:
A store manager wants to see:
“Total sales only for March 2024”
Query:
SELECT product, location, SUM(sales_amount)
FROM sales_cube
WHERE time = 'March 2024'
GROUP BY product, location;
2. Dice
Concept:
Select multiple values from multiple dimensions
Real-Life Example:
A regional manager asks:
“Sales of Laptops & Mobiles in Mumbai & Delhi during Q1 2024”
Query:
SELECT product, location, SUM(sales_amount)
FROM sales_cube
WHERE product IN ('Laptop','Mobile')
AND location IN ('Mumbai','Delhi')
AND time BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY product, location;
3. Drill-Down
Concept:
Go from summary → detailed data
Real-Life Example:
CEO sees:
Q1 sales report
Then wants:
Monthly → Daily breakdown
Query:
SELECT day, SUM(sales_amount)
FROM sales_cube
WHERE quarter = 'Q1 2024'
GROUP BY day;
4. Roll-Up (Consolidation)
Concept:
Go from detailed → summarized data
Real-Life Example:
A company wants:
City-wise sales → State-wise → Country-wise summary
Query:
SELECT state, SUM(sales_amount)
FROM sales_cube
GROUP BY state;
5. Pivot (Rotate)
Concept:
Change the view of data (rotate axes)
Real-Life Example:
In Excel/dashboard:
Rows = Product
Columns = Time
Then swapped to:
Rows = Time
Columns = Product
No SQL needed — done in tools like dashboards.
6. Ranking / Top-N Query
Concept:
Find Top or Bottom performers
Real-Life Example:
Sales head asks:
“Top 3 selling products in March”
Query:
SELECT product, SUM(sales_amount) AS total_sales
FROM sales_cube
WHERE time = 'March 2024'
GROUP BY product
ORDER BY total_sales DESC
LIMIT 3;
7. Drill-Across
Concept:
Combine data from multiple fact tables
Real-Life Example:
Business wants:
Compare Sales vs Inventory
“Are we selling more than we stock?”
Query:
SELECT s.product, SUM(s.sales_amount), SUM(i.quantity)
FROM sales_fact s
JOIN inventory_fact i
ON s.product_id = i.product_id
GROUP BY s.product;
Quick Summary Table
| Operation | Meaning | Real-Life Example |
|---|---|---|
| Slice | One dimension | Sales in March |
| Dice | Multiple filters | Laptops in Mumbai & Delhi |
| Drill-Down | Summary → Detail | Year → Month → Day |
| Roll-Up | Detail → Summary | City → State |
| Pivot | Change view | Swap rows & columns |
| Top-N | Best performers | Top 3 products |
| Drill-Across | Multiple tables | Sales + Inventory |
20. OLAP Servers
An OLAP Server is the main engine that:
- Stores data (cube or tables)
- Processes queries
- Provides fast analytical results
It helps users analyze multidimensional data quickly from a data warehouse.
Types of OLAP Servers
There are 3 main types:
- MOLAP
- ROLAP
- HOLAP
1. MOLAP (Multidimensional OLAP)
Concept
- Data stored in multidimensional cubes
- Aggregations are precomputed
Real-Life Example
Imagine a shopping mall dashboard:
Manager clicks:
- “Show total sales for 2024”
Result comes instantly because:
- Data is already stored in cube format
Features
- Uses cube storage (not tables)
- Pre-calculates totals (SUM, AVG)
- Very fast performance
Advantages
- Extremely fast queries ⚡
- Best for reports & dashboards
- Handles complex calculations easily
Disadvantages
- Not good for very large data
- Storage grows quickly
- Less flexible for new queries
Examples
- Microsoft Analysis Services (SSAS)
- Oracle Essbase
- IBM Cognos TM1
2. ROLAP (Relational OLAP)
Concept
- Data stored in relational tables (database)
- Uses SQL queries for analysis
Real-Life Example
Think of Amazon-like system:
You ask:
- “Show last 5 years sales data”
System:
- Fetches data from huge database tables
- Calculates results on demand
Features
- No cube storage
- Uses existing databases
- Aggregation done dynamically
Advantages
- Handles very large datasets
- Flexible for ad-hoc queries
- Uses existing DB systems
Disadvantages
- Slower than MOLAP
- Heavy load on database
- Complex queries take time
Examples
- Oracle OLAP
- IBM Cognos (ROLAP mode)
- MicroStrategy
3. HOLAP (Hybrid OLAP)
Concept
- Combines MOLAP + ROLAP
Stores:
- Summary → Cube
- Detailed data → Tables
Real-Life Example
Think of a large e-commerce company:
Dashboard shows:
- Total sales → fast (cube)
But when you click:
- “View all orders of 2024”
It fetches:
- Detailed data from database
Features
- Fast summary + detailed access
- Balanced approach
Advantages
- Best of both worlds
- Fast + scalable
- Good for mixed workloads
Disadvantages
- Complex to design
- Needs synchronization
- Slightly harder to maintain
Examples
- Microsoft SSAS (HOLAP mode)
- SAP BW
- IBM Cognos TM1 HOLAP
Comparison Table (Very Important for Exams)
| Feature | MOLAP | ROLAP | HOLAP |
|---|---|---|---|
| Storage | Cube | Tables | Cube + Tables |
| Speed | Very Fast | Slow | Medium/Fast |
| Scalability | Medium | High | High |
| Flexibility | Low | High | Medium |
| Data Size | Small–Medium | Very Large | Large |
| Best Use | Dashboards | Big Data Analysis | Mixed Use |