Data Mining and Warehousing Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (21)

  • 1. What is Data Warehouse ?
  • 2. Key Characteristics of a Data Warehouse
  • 3. Need, Importance & Components of Data Warehousing
  • 4. Data Warehouse vs OLTP
  • 5. Architecture, Advantages & Applications of a Data Warehouse
  • 6. Data Warehouse (DW) Delivery Process
  • 7. Data Warehouse Architecture
  • 8. Data Cleaning
  • 9. Data Integration
  • 10. Data Transformation
  • 11. Data Reduction
  • 12. Data Warehouse Schema Design
  • 13. Partitioning Strategy in Data Warehouse Implementation
  • 14. Data Marts
  • 15. Metadata
  • 16. Example of a Multidimensional Data Model
  • 17. Pattern Warehousing
  • 18. OLAP Systems
  • 19. OLAP Queries
  • 20. OLAP Servers
  • 21. OLAP Operations

11. Data Reduction

Data Reduction is the process of reducing the volume of data while maintaining its integrity and analytical value.

Why Data Reduction is Needed

In data warehousing, datasets are very large. Data reduction helps to:

  • Improve query performance
  • Reduce storage cost
  • Speed up data mining
  • Make analysis faster and easier

Goal

To obtain a reduced representation of data that:

  • Is smaller in size
  • Produces almost the same analytical results

Types of Data Reduction Techniques

1. Data Cube Aggregation

Data is stored at detailed (low-level) form and then aggregated into higher-level summaries.

Examples:

  • Daily sales → Monthly sales
  • Monthly → Yearly
  • City → State → Country

Used in:

  • OLAP Cubes
  • Roll-up operations
  • Summary tables

2. Dimensionality Reduction

Reduces the number of attributes (features) in the dataset.

Techniques:

  • PCA (Principal Component Analysis)
  • Feature Selection:
    • Correlation
    • Chi-square
    • RFE (Recursive Feature Elimination)
  • Wavelet Transform

Benefits:

  • Reduces model complexity
  • Removes irrelevant/redundant data
  • Improves accuracy and speed

3. Data Compression

Stores data in a compact format.

Types:

  • Lossless Compression
    • No data loss
    • Examples: ZIP, Huffman Coding
  • Lossy Compression
    • Some data loss allowed
    • Used in images, audio, video
  • Other Methods
    • Run-Length Encoding
    • String Compression

4. Numerosity Reduction

Replaces the original dataset with a smaller representation.

(a) Parametric Methods

Assume a model and store only its parameters.

Examples:

  • Regression Models
  • Log-linear Models

 Store model instead of full data.

(b) Non-Parametric Methods

No model assumption.

Examples:

  • Histograms
  • Clustering (e.g., K-means)
  • Sampling

5. Sampling

Selecting a small but representative subset of data.

Types:

  • Simple Random Sampling
  • Stratified Sampling
  • Systematic Sampling

Use:

  • When full dataset is too large to process

6. Discretization and Binning

Reduces the number of distinct values by grouping.

Examples:

  • Age (1–100) → (0–10, 11–20, …)
  • Income → Low, Medium, High

Benefits:

  • Reduces data size
  • Simplifies analysis
  • Improves model performance

12. Data Warehouse Schema Design

A schema in a data warehouse defines how data is organized and how tables are logically connected.

It is based on Dimensional Modeling, which divides data into:

  • Fact Tables → Store numeric measures (Sales, Revenue, Quantity)
  • Dimension Tables → Store descriptive attributes (Time, Product, Customer)

 Schema design determines how fact and dimension tables are arranged.

Types of Data Warehouse Schemas

1. Star Schema

Definition:
The simplest and most commonly used schema.

Structure:

  • One central Fact Table
  • Multiple surrounding Dimension Tables
  • All dimensions directly connected to the fact table

Diagram (Text):

 
Product Dimension
|
Customer — Fact Table — Time
|
Store Dimension
 

Features:

  • Easy to understand
  • Fast query performance
  • Denormalized dimension tables (redundancy allowed)

Use Cases:

  • Retail sales
  • Finance dashboards
  • OLAP queries

2. Snowflake Schema

Definition:
A normalized version of the star schema.

Structure:

  • One central Fact Table
  • Dimension tables further divided into sub-dimensions (normalized)

Diagram (Text):

         Product Category
                     |
Product — Fact Table — Time
                     |
Customer → Customer City
 

Features:

  • Reduced data redundancy
  • More complex joins
  • Saves storage space

Use Cases:

  • Large and complex data warehouses
  • Systems with many attributes

3. Galaxy Schema (Fact Constellation Schema)

Definition:
A schema with multiple fact tables sharing common dimension tables.

Structure:

  • Multiple fact tables (e.g., Sales Fact, Inventory Fact)
  • Shared dimension tables (Product, Time, Store)

Diagram (Text):

      Product Dimension
            /                 \
Sales Fact       Inventory Fact
          \                     /
         Time Dimension

Features:

  • Handles complex business scenarios
  • Supports multiple data marts
  • Flexible but complex

Use Cases:

  • Enterprise-level data warehouses
  • Large organizations

Components of a Data Warehouse Schema

1. Fact Table

  • Stores numerical/measurable data
  • Contains foreign keys to dimension tables

Examples:

  • Sales Amount
  • Quantity
  • Profit

2. Dimension Table

  • Stores descriptive attributes

Examples:

  • Time Dimension
  • Product Dimension
  • Customer Dimension
  • Store / Location Dimension

3. Measures and Metrics

  • Calculated values used for analysis

Examples:

  • SUM(Sales)
  • AVG(Profit)

4. Hierarchies

Used for Drill-Down and Roll-Up operations in OLAP.

Examples:

  • Day → Month → Quarter → Year
  • City → State → Country

Comparison of Schemas

Feature Star Schema Snowflake Schema Galaxy Schema
Normalization Low High Medium
Query Speed Fast Slower Moderate
Complexity Simple Complex High
Storage Large Smaller Medium
Fact Tables One One Many
Dimensions Denormalized Normalized Shared

 

13. Partitioning Strategy in Data Warehouse Implementation

Partitioning is a physical design technique used in data warehouses to divide large tables (especially fact tables) into smaller, manageable pieces called partitions.

Why Partitioning is Needed

Data warehouses store massive volumes of historical data (often billions of rows). Partitioning helps in:

  • Faster Query Processing
    Only relevant partitions are scanned
  • Faster ETL & Data Loading
    Only new/recent partitions are updated
  • Easy Data Archiving / Purging
    Old partitions can be dropped without affecting others
  • Better Indexing & Storage Management
    Smaller, more efficient indexes

Types of Partitioning Strategies

1. Range Partitioning (Most Common)

Data is divided based on a continuous range of values.

Example (Date-based):

  • Partition 1 → Jan 2024
  • Partition 2 → Feb 2024
  • Partition 3 → Mar 2024

Advantages:

  • Best for time-based data
  • Easy archiving and deletion
  • Efficient for range queries (e.g., last 3 months)

2. List Partitioning

Data is divided based on a predefined list of values.

Example (Region-based):

  • Partition A → USA, Canada
  • Partition B → India, Nepal, Sri Lanka
  • Partition C → UK, Germany, France

Advantages:

  • Good for categorical data
  • Faster queries on specific categories

3. Hash Partitioning

Data is distributed using a hash function.

Example:

Partition = HASH(Customer_ID) mod 4

Advantages:

  • Uniform data distribution
  • Avoids data skew
  • Useful when no natural partition key exists

4. Composite Partitioning (Hybrid)

Combination of two or more partitioning techniques.

Example (Range + Hash):

  • First partition by Month
  • Then apply hash on Customer_ID

Advantages:

  • Suitable for very large datasets
  • Balances performance and distribution

5. Column-based Partitioning (Columnar Storage)

Used in modern data warehouses.

Features:

  • Stores data column-wise instead of row-wise
  • High compression
  • Very fast analytical queries

Examples of systems:

  • BigQuery
  • Snowflake
  • Amazon Redshift

How Partitioning Works (Steps)

Step 1: Identify Large Tables

  • Usually fact tables with billions of rows

Step 2: Select Partition Key

Common Keys:

  • Date
  • Product Category
  • Region
  • Department
  • Customer ID

Step 3: Choose Partition Type

  • Range
  • List
  • Hash
  • Composite

Step 4: Create Partitions

  • Table is divided into multiple segments on disk

Step 5: Manage Partitions

  • Add new partitions (e.g., new month/year)
  • Drop old partitions (archival)
  • Merge or split partitions

Partition Pruning

Definition:
Query engines automatically skip irrelevant partitions.

Example Query:

 
SELECT * FROM sales_fact
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';
 

 Only January partition is scanned, not the full table.

Result:
 Faster query performance

 Benefits of Partitioning

Benefit Explanation
Faster Query Performance Only required partitions are scanned
Better ETL Speed Incremental data loading
Easy Archival Drop old partitions easily
Improved Index Efficiency Smaller indexes per partition
High Availability Operations at partition level
Manageability Easier maintenance

Example Scenario

Table: Sales_Fact
Partition Key: Sale_Date
Partition Type: Range Partitioning

Partition Name Date Range
SALES_2023_Q1 Jan – Mar 2023
SALES_2023_Q2 Apr – Jun 2023
SALES_2023_Q3 Jul – Sep 2023
SALES_2023_Q4 Oct – Dec 2023

Operations You Can Perform

  • Archive old partition (e.g., SALES_2021)
  • Add new partition (e.g., SALES_2025)
  • Query only recent data (last 6 months)

14. Data Marts

A Data Mart is a subset of a data warehouse that focuses on a specific business area such as Sales, Marketing, Finance, HR, or Inventory.

 It is smaller, faster, and easier to manage than a full data warehouse.

Formal Definition

A Data Mart is a subject-oriented, integrated, and optimized collection of data designed to serve the needs of a particular department or business function.

Why Use Data Marts

  • Provide quick access to relevant data
  • Support department-specific reporting
  • Reduce load on the main Data Warehouse
  • Improve performance and simplify analytics
  • Lower storage and management cost

Types of Data Marts

1. Dependent Data Mart

Definition:
Created from the Enterprise Data Warehouse (EDW).

Structure:

 
Operational Systems → Data Warehouse → Data Mart
 

Features:

  • Uses centralized and clean data
  • Highly consistent and reliable

Example:
Sales Data Mart extracted from EDW

2. Independent Data Mart

Definition:
Created directly from operational systems without a data warehouse.

Structure:

 
Operational Systems → Data Mart
 

Features:

  • Faster to build
  • May cause data inconsistency

Example:
Marketing Data Mart built from CRM system

3. Hybrid Data Mart

Definition:
Combination of both dependent and independent approaches.

Features:

  • Uses both EDW and operational systems
  • Flexible and faster implementation

Example:
Finance Data Mart using EDW + external banking data

Data Mart Architecture

A Data Mart consists of 3 layers:

1. Source Layer

  • Operational databases
  • Data warehouse
  • External data sources

2. ETL Layer (Extract – Transform – Load)

  • Data cleaning
  • Data integration
  • Data transformation
  • Loading into data mart

3. Data Mart Storage / Presentation Layer

  • Fact tables
  • Dimension tables
  • OLAP cubes
  • Reports and dashboards

Schema Used in Data Marts

Data marts typically use Dimensional Modeling:

  • Star Schema
  • Snowflake Schema

Data Mart Components

1. Fact Tables

  • Store transactional data
  • Contain measures

Examples:

  • Sales amount
  • Revenue
  • Profit

2. Dimension Tables

  • Store descriptive attributes

Examples:

  • Time
  • Product
  • Customer
  • Region

Examples of Data Marts

1. Sales Data Mart

  • Measures: Sales amount, Quantity
  • Dimensions: Time, Product, Store

2. Finance Data Mart

  • Measures: Revenue, Expenses, Profit
  • Dimensions: Time, Department, Account

3. HR Data Mart

  • Measures: Employee count, Salary
  • Dimensions: Department, Location, Job Role

Advantages of Data Marts

Advantage Explanation
Faster Access Smaller dataset → quick queries
Better Performance Optimized for specific tasks
Lower Cost Less storage and processing
Easy to Manage Can be handled by small teams
Department-Focused Tailored for specific needs
Reduces DW Load Offloads queries from main warehouse

Disadvantages of Data Marts

Disadvantage Explanation
Data Inconsistency Occurs in independent marts
Data Redundancy Same data stored multiple times
Limited Scope Only covers one department
Integration Issues Hard to combine multiple marts

 

15. Metadata

Metadata means “data about data.”
It describes, explains, and gives meaning to the data stored in a data warehouse.

Formal Definition

Metadata is descriptive information that defines the structure, operations, rules, and contents of data in a data warehouse, enabling users and systems to understand and manage the data.

What Metadata Helps Users Understand

  • What the data is
  • Where it came from
  • How it is structured
  • How it should be used

Why Metadata is Important

  • Helps users understand data definitions
  • Helps ETL developers track data movement
  • Provides source-to-target mapping
  • Supports data governance
  • Improves data quality and consistency
  • Helps analysts generate accurate reports

Types of Metadata

1. Technical Metadata

Used by: IT teams, DBAs, ETL developers

Includes:

  • Table names and column names
  • Data types
  • Primary keys and foreign keys
  • Indexes and constraints
  • Schema and table structure
  • File paths and storage details

Example:

  • Customer_ID → Integer (Primary Key)
  • Date → Format: YYYY-MM-DD

2. Business Metadata

Used by: Business users, analysts

Includes:

  • Business definitions
  • KPIs (Key Performance Indicators)
  • Data ownership
  • Business rules
  • Meaning of data fields

Example:

  • Sales_Amount = Total value of products sold (excluding tax)

3. Operational Metadata

Describes: ETL and system operations

Includes:

  • Data load time
  • Record counts (before/after load)
  • Error logs
  • Job execution logs
  • Data source details

Example:

  • ETL Load Time: 02:00 AM
  • Rows Loaded: 2,500,000
  • Failed Rows: 1,200

Metadata in ETL Process

Metadata stores information about:

  • Source extraction rules
  • Transformation logic
  • Mapping between source and target
  • Load schedules
  • Job dependencies

Metadata Repository (Metadata Catalog)

Definition:
A Metadata Repository is a central storage location where all metadata is stored.

Used By:

  • ETL Tools
  • BI Tools
  • Data Warehouse Administrators
  • Analysts

Examples of Metadata Tools:

  • Informatica Repository
  • SSIS Catalog
  • AWS Glue Data Catalog
  • Hive Metastore

Examples of Metadata

1. Technical Metadata

  • Table: sales_fact
  • Column: sale_date (DATE)
  • Column: amount (NUMBER, NOT NULL)

2. Business Metadata

  • Revenue = Total sales after discount

3. Operational Metadata

  • Last ETL Run: 1 March 2025, 3:00 AM
  • Status: Success

Advantages of Metadata

Benefit Explanation
Improves Data Understanding Clear meaning of data
Better Data Quality Detects errors and inconsistencies
Easier ETL Development Provides clear documentation
Faster Data Retrieval Helps optimize queries
Supports Governance Tracks data origin and lineage

 

Page 3 of 5