Data Mining and Warehousing Notes

DSA

Software Engineering

Software Architecture

Operating System

Big Data

Data Mining and Warehousing

TOC

Ada

CPP

DBMS

All Topics (21)

1. What is Data Warehouse ?
2. Key Characteristics of a Data Warehouse
3. Need, Importance & Components of Data Warehousing
4. Data Warehouse vs OLTP
5. Architecture, Advantages & Applications of a Data Warehouse
6. Data Warehouse (DW) Delivery Process
7. Data Warehouse Architecture
8. Data Cleaning
9. Data Integration
10. Data Transformation
11. Data Reduction
12. Data Warehouse Schema Design
13. Partitioning Strategy in Data Warehouse Implementation
14. Data Marts
15. Metadata
16. Example of a Multidimensional Data Model
17. Pattern Warehousing
18. OLAP Systems
19. OLAP Queries
20. OLAP Servers
21. OLAP Operations

11. Data Reduction

Data Reduction is the process of reducing the volume of data while maintaining its integrity and analytical value.

Why Data Reduction is Needed

In data warehousing, datasets are very large. Data reduction helps to:

Improve query performance
Reduce storage cost
Speed up data mining
Make analysis faster and easier

Goal

To obtain a reduced representation of data that:

Is smaller in size
Produces almost the same analytical results

Types of Data Reduction Techniques

1. Data Cube Aggregation

Data is stored at detailed (low-level) form and then aggregated into higher-level summaries.

Examples:

Daily sales → Monthly sales
Monthly → Yearly
City → State → Country

Used in:

OLAP Cubes
Roll-up operations
Summary tables

2. Dimensionality Reduction

Reduces the number of attributes (features) in the dataset.

Techniques:

PCA (Principal Component Analysis)
Feature Selection:
- Correlation
- Chi-square
- RFE (Recursive Feature Elimination)
Wavelet Transform

Benefits:

Reduces model complexity
Removes irrelevant/redundant data
Improves accuracy and speed

3. Data Compression

Stores data in a compact format.

Types:

Lossless Compression
- No data loss
- Examples: ZIP, Huffman Coding
Lossy Compression
- Some data loss allowed
- Used in images, audio, video
Other Methods
- Run-Length Encoding
- String Compression

4. Numerosity Reduction

Replaces the original dataset with a smaller representation.

(a) Parametric Methods

Assume a model and store only its parameters.

Examples:

Regression Models
Log-linear Models

Store model instead of full data.

(b) Non-Parametric Methods

No model assumption.

Examples:

Histograms
Clustering (e.g., K-means)
Sampling

5. Sampling

Selecting a small but representative subset of data.

Types:

Simple Random Sampling
Stratified Sampling
Systematic Sampling

Use:

When full dataset is too large to process

6. Discretization and Binning

Reduces the number of distinct values by grouping.

Examples:

Age (1–100) → (0–10, 11–20, …)
Income → Low, Medium, High

Benefits:

Reduces data size
Simplifies analysis
Improves model performance

12. Data Warehouse Schema Design

A schema in a data warehouse defines how data is organized and how tables are logically connected.

It is based on Dimensional Modeling, which divides data into:

Fact Tables → Store numeric measures (Sales, Revenue, Quantity)
Dimension Tables → Store descriptive attributes (Time, Product, Customer)

Schema design determines how fact and dimension tables are arranged.

Types of Data Warehouse Schemas

1. Star Schema

Definition:
The simplest and most commonly used schema.

Structure:

One central Fact Table
Multiple surrounding Dimension Tables
All dimensions directly connected to the fact table

Diagram (Text):

Product Dimension
|
Customer — Fact Table — Time
|
Store Dimension

Features:

Easy to understand
Fast query performance
Denormalized dimension tables (redundancy allowed)

Use Cases:

Retail sales
Finance dashboards
OLAP queries

2. Snowflake Schema

Definition:
A normalized version of the star schema.

Structure:

One central Fact Table
Dimension tables further divided into sub-dimensions (normalized)

Diagram (Text):

         Product Category
                     |
Product — Fact Table — Time
                     |
Customer → Customer City

Features:

Reduced data redundancy
More complex joins
Saves storage space

Use Cases:

Large and complex data warehouses
Systems with many attributes

3. Galaxy Schema (Fact Constellation Schema)

Definition:
A schema with multiple fact tables sharing common dimension tables.

Structure:

Multiple fact tables (e.g., Sales Fact, Inventory Fact)
Shared dimension tables (Product, Time, Store)

Diagram (Text):

      Product Dimension
            /                 \
Sales Fact       Inventory Fact
          \                     /
         Time Dimension

Features:

Handles complex business scenarios
Supports multiple data marts
Flexible but complex

Use Cases:

Enterprise-level data warehouses
Large organizations

Components of a Data Warehouse Schema

1. Fact Table

Stores numerical/measurable data
Contains foreign keys to dimension tables

Examples:

Sales Amount
Quantity
Profit

2. Dimension Table

Stores descriptive attributes

Examples:

Time Dimension
Product Dimension
Customer Dimension
Store / Location Dimension

3. Measures and Metrics

Calculated values used for analysis

Examples:

SUM(Sales)
AVG(Profit)

4. Hierarchies

Used for Drill-Down and Roll-Up operations in OLAP.

Examples:

Day → Month → Quarter → Year
City → State → Country

Comparison of Schemas

Feature	Star Schema	Snowflake Schema	Galaxy Schema
Normalization	Low	High	Medium
Query Speed	Fast	Slower	Moderate
Complexity	Simple	Complex	High
Storage	Large	Smaller	Medium
Fact Tables	One	One	Many
Dimensions	Denormalized	Normalized	Shared

13. Partitioning Strategy in Data Warehouse Implementation

Partitioning is a physical design technique used in data warehouses to divide large tables (especially fact tables) into smaller, manageable pieces called partitions.

Why Partitioning is Needed

Data warehouses store massive volumes of historical data (often billions of rows). Partitioning helps in:

Faster Query Processing
Only relevant partitions are scanned
Faster ETL & Data Loading
Only new/recent partitions are updated
Easy Data Archiving / Purging
Old partitions can be dropped without affecting others
Better Indexing & Storage Management
Smaller, more efficient indexes

Types of Partitioning Strategies

1. Range Partitioning (Most Common)

Data is divided based on a continuous range of values.

Example (Date-based):

Partition 1 → Jan 2024
Partition 2 → Feb 2024
Partition 3 → Mar 2024

Advantages:

Best for time-based data
Easy archiving and deletion
Efficient for range queries (e.g., last 3 months)

2. List Partitioning

Data is divided based on a predefined list of values.

Example (Region-based):

Partition A → USA, Canada
Partition B → India, Nepal, Sri Lanka
Partition C → UK, Germany, France

Advantages:

Good for categorical data
Faster queries on specific categories

3. Hash Partitioning

Data is distributed using a hash function.

Example:

Partition = HASH(Customer_ID) mod 4

Advantages:

Uniform data distribution
Avoids data skew
Useful when no natural partition key exists

4. Composite Partitioning (Hybrid)

Combination of two or more partitioning techniques.

Example (Range + Hash):

First partition by Month
Then apply hash on Customer_ID

Advantages:

Suitable for very large datasets
Balances performance and distribution

5. Column-based Partitioning (Columnar Storage)

Used in modern data warehouses.

Features:

Stores data column-wise instead of row-wise
High compression
Very fast analytical queries

Examples of systems:

BigQuery
Snowflake
Amazon Redshift

How Partitioning Works (Steps)

Step 1: Identify Large Tables

Usually fact tables with billions of rows

Step 2: Select Partition Key

Common Keys:

Date
Product Category
Region
Department
Customer ID

Step 3: Choose Partition Type

Range
List
Hash
Composite

Step 4: Create Partitions

Table is divided into multiple segments on disk

Step 5: Manage Partitions

Add new partitions (e.g., new month/year)
Drop old partitions (archival)
Merge or split partitions

Partition Pruning

Definition:
Query engines automatically skip irrelevant partitions.

Example Query:

SELECT * FROM sales_fact
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';

Only January partition is scanned, not the full table.

Result:
Faster query performance

Benefits of Partitioning

Benefit	Explanation
Faster Query Performance	Only required partitions are scanned
Better ETL Speed	Incremental data loading
Easy Archival	Drop old partitions easily
Improved Index Efficiency	Smaller indexes per partition
High Availability	Operations at partition level
Manageability	Easier maintenance

Example Scenario

Table: Sales_Fact
Partition Key: Sale_Date
Partition Type: Range Partitioning

Partition Name	Date Range
SALES_2023_Q1	Jan – Mar 2023
SALES_2023_Q2	Apr – Jun 2023
SALES_2023_Q3	Jul – Sep 2023
SALES_2023_Q4	Oct – Dec 2023

Operations You Can Perform

Archive old partition (e.g., SALES_2021)
Add new partition (e.g., SALES_2025)
Query only recent data (last 6 months)

14. Data Marts

A Data Mart is a subset of a data warehouse that focuses on a specific business area such as Sales, Marketing, Finance, HR, or Inventory.

It is smaller, faster, and easier to manage than a full data warehouse.

Formal Definition

A Data Mart is a subject-oriented, integrated, and optimized collection of data designed to serve the needs of a particular department or business function.

Why Use Data Marts

Provide quick access to relevant data
Support department-specific reporting
Reduce load on the main Data Warehouse
Improve performance and simplify analytics
Lower storage and management cost

Types of Data Marts

1. Dependent Data Mart

Definition:
Created from the Enterprise Data Warehouse (EDW).

Structure:

Operational Systems → Data Warehouse → Data Mart

Features:

Uses centralized and clean data
Highly consistent and reliable

Example:
Sales Data Mart extracted from EDW

2. Independent Data Mart

Definition:
Created directly from operational systems without a data warehouse.

Structure:

Operational Systems → Data Mart

Features:

Faster to build
May cause data inconsistency

Example:
Marketing Data Mart built from CRM system

3. Hybrid Data Mart

Definition:
Combination of both dependent and independent approaches.

Features:

Uses both EDW and operational systems
Flexible and faster implementation

Example:
Finance Data Mart using EDW + external banking data

Data Mart Architecture

A Data Mart consists of 3 layers:

1. Source Layer

Operational databases
Data warehouse
External data sources

2. ETL Layer (Extract – Transform – Load)

Data cleaning
Data integration
Data transformation
Loading into data mart

3. Data Mart Storage / Presentation Layer

Fact tables
Dimension tables
OLAP cubes
Reports and dashboards

Schema Used in Data Marts

Data marts typically use Dimensional Modeling:

Star Schema
Snowflake Schema

Data Mart Components

1. Fact Tables

Store transactional data
Contain measures

Examples:

Sales amount
Revenue
Profit

2. Dimension Tables

Store descriptive attributes

Examples:

Time
Product
Customer
Region

Examples of Data Marts

1. Sales Data Mart

Measures: Sales amount, Quantity
Dimensions: Time, Product, Store

2. Finance Data Mart

Measures: Revenue, Expenses, Profit
Dimensions: Time, Department, Account

3. HR Data Mart

Measures: Employee count, Salary
Dimensions: Department, Location, Job Role

Advantages of Data Marts

Advantage	Explanation
Faster Access	Smaller dataset → quick queries
Better Performance	Optimized for specific tasks
Lower Cost	Less storage and processing
Easy to Manage	Can be handled by small teams
Department-Focused	Tailored for specific needs
Reduces DW Load	Offloads queries from main warehouse

Disadvantages of Data Marts

Disadvantage	Explanation
Data Inconsistency	Occurs in independent marts
Data Redundancy	Same data stored multiple times
Limited Scope	Only covers one department
Integration Issues	Hard to combine multiple marts

15. Metadata

Metadata means “data about data.”
It describes, explains, and gives meaning to the data stored in a data warehouse.

Formal Definition

Metadata is descriptive information that defines the structure, operations, rules, and contents of data in a data warehouse, enabling users and systems to understand and manage the data.

What Metadata Helps Users Understand

What the data is
Where it came from
How it is structured
How it should be used

Why Metadata is Important

Helps users understand data definitions
Helps ETL developers track data movement
Provides source-to-target mapping
Supports data governance
Improves data quality and consistency
Helps analysts generate accurate reports

Types of Metadata

1. Technical Metadata

Used by: IT teams, DBAs, ETL developers

Includes:

Table names and column names
Data types
Primary keys and foreign keys
Indexes and constraints
Schema and table structure
File paths and storage details

Example:

Customer_ID → Integer (Primary Key)
Date → Format: YYYY-MM-DD

2. Business Metadata

Used by: Business users, analysts

Includes:

Business definitions
KPIs (Key Performance Indicators)
Data ownership
Business rules
Meaning of data fields

Example:

Sales_Amount = Total value of products sold (excluding tax)

3. Operational Metadata

Describes: ETL and system operations

Includes:

Data load time
Record counts (before/after load)
Error logs
Job execution logs
Data source details

Example:

ETL Load Time: 02:00 AM
Rows Loaded: 2,500,000
Failed Rows: 1,200

Metadata in ETL Process

Metadata stores information about:

Source extraction rules
Transformation logic
Mapping between source and target
Load schedules
Job dependencies

Metadata Repository (Metadata Catalog)

Definition:
A Metadata Repository is a central storage location where all metadata is stored.

Used By:

ETL Tools
BI Tools
Data Warehouse Administrators
Analysts

Examples of Metadata Tools:

Informatica Repository
SSIS Catalog
AWS Glue Data Catalog
Hive Metastore

Examples of Metadata

1. Technical Metadata

Table: sales_fact
Column: sale_date (DATE)
Column: amount (NUMBER, NOT NULL)

2. Business Metadata

Revenue = Total sales after discount

3. Operational Metadata

Last ETL Run: 1 March 2025, 3:00 AM
Status: Success

Advantages of Metadata

Benefit	Explanation
Improves Data Understanding	Clear meaning of data
Better Data Quality	Detects errors and inconsistencies
Easier ETL Development	Provides clear documentation
Faster Data Retrieval	Helps optimize queries
Supports Governance	Tracks data origin and lineage

Page 3 of 5