Data Mining and Warehousing Notes

C
DSA
Software Engineering
Software Architecture
Operating System
Big Data
Data Mining and Warehousing
TOC
Ada
CPP
DBMS

All Topics (21)

  • 1. What is Data Warehouse ?
  • 2. Key Characteristics of a Data Warehouse
  • 3. Need, Importance & Components of Data Warehousing
  • 4. Data Warehouse vs OLTP
  • 5. Architecture, Advantages & Applications of a Data Warehouse
  • 6. Data Warehouse (DW) Delivery Process
  • 7. Data Warehouse Architecture
  • 8. Data Cleaning
  • 9. Data Integration
  • 10. Data Transformation
  • 11. Data Reduction
  • 12. Data Warehouse Schema Design
  • 13. Partitioning Strategy in Data Warehouse Implementation
  • 14. Data Marts
  • 15. Metadata
  • 16. Example of a Multidimensional Data Model
  • 17. Pattern Warehousing
  • 18. OLAP Systems
  • 19. OLAP Queries
  • 20. OLAP Servers
  • 21. OLAP Operations

6. Data Warehouse (DW) Delivery Process

A Data Warehouse Delivery Process is the end-to-end lifecycle of delivering a working, reliable DW solution—from requirements gathering to maintenance.

It consists of 6 major stages:

1. Requirements Gathering & Business Analysis

Purpose: Understand business needs for data and analytics.

Activities:

  • Identify business processes: Sales, Finance, Inventory, CRM, etc.

  • Determine analytical requirements: KPIs, metrics, reports, dashboards, data granularity (daily, hourly, transactional).

  • Conduct stakeholder interviews.

  • Gather functional & non-functional requirements (performance, SLAs, security).

Outputs:
a) Business Requirement Document (BRD)
b) Use cases, dashboard/report mockups
c) List of data sources

2. Data Warehouse Architecture & Design

Purpose: Define how the DW will be structured.

2.1 High-Level Architecture

  • Choose DW architecture:

    • Kimball (Dimensional – Star Schema)

    • Inmon (Normalized – Enterprise DW)

    • Data Vault

    • Modern Cloud DW (Snowflake, BigQuery, Redshift)

2.2 Data Modeling

  • Identify facts and dimensions

  • Define grain (level of detail)

  • Design schemas: Star/Snowflake, conformed dimensions

  • Plan surrogate keys and Slowly Changing Dimensions (SCD types 1, 2, 3)

2.3 Technical Design

  • ETL/ELT architecture

  • Data staging design

  • Metadata repository

  • Error handling, logging

  • Security & access control

  • Data quality rules

Outputs:
a) Logical & physical data models
b) Architecture diagrams
c) ETL specification documents

3. ETL / ELT Development

ETL is the backbone of DW delivery.

3.1 Extraction

  • Connect to sources: Databases, files, APIs, SaaS, streams

  • Extract full or incremental data

  • Handle CDC (Change Data Capture), timestamps, and logs

3.2 Transformation

  • Data cleansing (nulls, invalid values)

  • Standardization (dates, units)

  • Apply business rules

  • Deduplication

  • Derive metrics and surrogate keys

  • Handle SCD for dimensions

3.3 Loading

  • Load data: Stage → Integration → Dimensional layer

  • Load facts & dimensions

  • Log errors/exceptions

  • Optimize performance (bulk loads, partitioning)

Outputs:
a) ETL/ELT pipelines
b) Staging & transformed datasets
c) Data validation scripts

4. Testing & Quality Assurance

Ensures accuracy, completeness, and performance.

Types of Testing:

  1. Unit Testing: Individual ETL modules

  2. System/Integration Testing: Full pipeline flow

  3. Data Quality Testing: Nulls, duplicates, integrity constraints, referential integrity

  4. Performance Testing: Query optimization, batch load performance

  5. User Acceptance Testing (UAT): Business validation of reports/KPIs

Outputs:
a) Test cases & results
b) Sign-off from business users

5. Deployment & Production Rollout

Activities:

  • Infrastructure setup (on-premises or cloud)

  • Automate workflows using schedulers (Airflow, Cron, Data Factory)

  • Deploy ETL pipelines

  • Initial historical load

  • Configure access & permissions

  • Set up monitoring dashboards

Outputs:
a) Production-ready data warehouse
b) Automated nightly/real-time pipelines
c) Security & governance policies

6. Maintenance, Monitoring & Enhancements

Activities:

  • Monitor pipeline health and performance

  • Handle incremental loads

  • Resolve data quality issues

  • Manage schema changes

  • Add new data sources or KPIs

  • Continuous optimization

Monitoring Tools:

  • Airflow / Luigi

  • CloudWatch / Stackdriver

  • Grafana

  • ETL tool dashboards

Outputs:
a) Stable, scalable DW
b) Continuous improvements

Delivery Process Flow 

 
1. Requirements Gathering
             ↓
2. Architecture & Data Modeling
              ↓
3. ETL/ELT Development
             ↓
4. Testing (Unit → UAT)
             ↓
5. Deployment to Production
             ↓
6. Maintenance & Enhancements

7. Data Warehouse Architecture

A Data Warehouse Architecture defines how data is collected, stored, managed, and delivered for analysis and reporting.

Key Characteristics

  • Subject-oriented → Organized by business subjects (sales, finance)
  • Integrated → Combines data from multiple sources
  • Time-variant → Stores historical data
  • Non-volatile → Data is stable (not frequently changed)

1. Data Sources (Operational Systems)

These are the original systems from where data is collected.

Types

  • OLTP systems (ERP, CRM, Billing)
  • External data (market data, APIs, flat files)
  • Logs & semi-structured data

Role

  • Provide raw transactional data

2. ETL / ELT Layer (Data Integration Layer)

ETL (Extract – Transform – Load)

  • Data is transformed before loading into warehouse

ELT (Extract – Load – Transform)

  • Data is loaded first, then transformed inside warehouse
  • Common in cloud systems

Key Functions

  • Data cleaning
  • Data validation
  • Data transformation
  • Data integration (multiple sources)
  • Data loading (batch / real-time)

3. Staging Area

A temporary storage area used during ETL.

Features

  • Stores raw extracted data
  • Used for cleansing, sorting, transformation
  • Not accessible to end users

Purpose

  • Prevents overloading of source systems

4. Data Storage Layer (Data Warehouse Repository)

a. Enterprise Data Warehouse (EDW)

  • Centralized warehouse
  • Provides single version of truth

b. Data Marts

  • Subsets of warehouse
  • Department-specific (Sales, Finance)

Types:

  • Dependent → From EDW
  • Independent → Directly from sources

Storage Approaches

  • Relational databases (SQL)
  • Cloud warehouses (Snowflake, BigQuery, Redshift)
  • Distributed systems (Hadoop, Hive)

5. Metadata Layer

Metadata = “Data about data”

Types

  • Technical metadata → tables, columns, data types
  • Business metadata → rules, KPIs, definitions
  • Operational metadata → ETL logs, load timings

Purpose

  • Helps users understand and trust data

6. OLAP Engine (Query Processing Layer)

Used for fast analytical processing.

Types of OLAP

  • ROLAP → Relational databases
  • MOLAP → Multidimensional cubes
  • HOLAP → Hybrid approach

Features

  • Slice & Dice
  • Drill-down / Roll-up
  • Pivoting
  • Aggregation

7. Presentation Layer (BI / Reporting Tools)

Provides access to end-users.

Tools

  • Power BI
  • Tableau
  • Qlik
  • Excel

Purpose

  • Data visualization
  • Reporting
  • Analytics
  • Business insights

8. Data Cleaning

Data Cleaning is the process of detecting, correcting, or removing corrupt, inaccurate, incomplete, inconsistent, and irrelevant data from a dataset.
It is essential because real-world data is messy, and poor-quality data leads to poor-quality models.

1. Handling Missing Data

Missing values can occur due to human error, system issues, or incomplete records.

Techniques:

a. Deletion Methods

  • Listwise Deletion (Remove Entire Record)
    • If a row has missing values, delete the entire row.
    • Useful when: Few records are missing, and data size is large.
  • Pairwise Deletion
    • Use available values for each analysis without deleting whole rows.

b. Imputation Methods

  • Mean/Median/Mode Imputation → Replace missing values using basic statistics.
  • K-Nearest Neighbors (KNN) Imputation → Uses similar rows to estimate missing values.
  • Regression Imputation → Predict missing values using regression models.
  • Multiple Imputation → Advanced statistical technique for unbiased estimates.

2. Handling Noisy Data

Noise = random error or variance in data.

Techniques:

  • Smoothing
    • Moving averages
    • Binning (equal width, equal frequency)
    • Regression smoothing
  • Outlier Detection & Removal
    • Z-score method
    • IQR method (Q1 − 1.5IQR, Q3 + 1.5IQR)
    • DBSCAN clustering
  • Transformation
    • Log transform
    • Power transform

3. Handling Inconsistent Data

Inconsistencies occur due to different naming conventions or data entry errors.

Examples:

  • “Male”, “M”, “m”, “MALE”
  • "USA", "United States", "U.S."

Techniques:

  • Standardization Rules → Define uniform formats for text, dates, categories.
  • Data Validation Rules → Example: age should be 0–120.
  • Referential Integrity Checks → Ensure foreign keys match primary keys.

4. Handling Duplicate Data

Duplicate records arise from merging datasets or repeated entries.

Techniques:

  • Exact match detection (same values across rows)
  • Fuzzy matching (similar but not identical values)
  • Record linkage (identify same entity across sources)
  • Remove / merge duplicates

5. Handling Incorrect or Invalid Data

Incorrect data includes typos, wrong values, or impossible values.

Techniques:

  • Validation Checks
    • Range checks (e.g., salary ≥ 0)
    • Type checks (integers, strings)
    • Format checks (email, phone number)
  • Dictionary / Master Data Lookup → To correct category mistakes (e.g., "India" → “India”)
  • Business Rules → Example: transaction date cannot be in the future

6. Handling Irrelevant Data

Irrelevant features do not contribute to analysis or modeling.

Techniques:

  • Feature Selection
    • Filter methods (correlation, chi-square)
    • Wrapper methods (RFE)
    • Embedded methods (Lasso)
  • Domain Knowledge Filtering → Remove attributes not needed for analysis

7. Data Integration Issues (Part of Cleaning)

When combining multiple sources, errors may occur.

Problems:

  • Naming conflicts
  • Unit inconsistencies (kg vs lbs)
  • Schema mismatch (different column layouts)
  • Duplicate records across sources

Solutions:

  • Schema integration
  • Entity resolution
  • Standardization of measurement units

9. Data Integration

Data Integration is the process of combining data from multiple heterogeneous sources into a unified, consistent view.

Sources may include:

  • Databases (SQL/NoSQL)
  • Flat files (CSV, Excel)
  • ERP / CRM systems
  • Web data / Logs
  • Cloud storage

Goal: Create a single, consistent dataset for analysis or storage in a data warehouse.

Key Issues in Data Integration

1.1 Schema Integration

  • Different sources may use different structures.
    Example:
  • Source A: Cust_ID, FName, LName
  • Source B: CustomerNo, FirstName, LastName

Solution:

  • Match attributes using schema mapping / metadata

1.2 Data Value Conflicts

  • Same data represented differently.
    Examples:
  • “Male” vs “M”
  • “USA” vs “United States”
  • Date formats: DD/MM/YYYY vs MM/DD/YYYY
  • Units: kg vs lbs

Solution:

  • Create standardized formats or conversion rules

1.3 Entity Identification Problem

  • Determining if records from different sources refer to the same real-world entity.

Example:

  • John Doe, 123-45-6789
  • J. Doe, SSN: 123456789

Solution:

  • Matching keys (primary / foreign keys)
  • Fuzzy matching / record linkage
  • Master Data Management (MDM)

1.4 Redundancy and Duplicate Detection

  • Combining data may introduce duplicates.

Solution:

  • Remove duplicates
  • Merge records
  • Use deduplication tools / matching algorithms

1.5 Data Granularity Issues

  • Different levels of detail in different sources.

Example:

  • Sales per day in one source
  • Sales per month in another

Solution:

  • Aggregation or disaggregation during integration

Techniques Used in Data Integration

  • ETL (Extract, Transform, Load)
  • Data federation / Virtual integration
  • Data warehousing
  • Data Lake integration
  • Enterprise Information Integration (EII)
  • Schema matching and mapping

2. Data Transformation

Data Transformation modifies, consolidates, or restructures data into a suitable format for analysis or mining.

  • Usually comes after data integration in pipelines.

2.1 Smoothing

  • Reduces noise in data.
    Techniques:
  • Moving average
  • Binning
  • Regression smoothing

2.2 Aggregation

  • Combines data to produce summary information.

Examples:

  • Summing monthly sales → yearly sales
  • Converting hourly data → daily averages

Used in:

  • Data warehousing (fact tables)
  • OLAP cube creation

2.3 Generalization

  • Replacing low-level data with higher-level concepts.

Examples:

  • City → State → Country
  • Age → Age group (0–10, 11–20, etc.)

Used in:

  • Data cubes
  • Concept hierarchies

2.4 Normalization (Feature Scaling)

  • Brings numerical values to a common scale for ML.

Techniques:

  • Min-Max Normalization:

    x=  x min

Z-score Standardization:

  • Decimal Scaling

            x= x μ / σ

2.5 Attribute / Feature Construction

  • Creating new attributes from existing ones.

Examples:

  • BMI = weight / height²
  • FullName = FirstName + LastName
  • Revenue = Price × Quantity

Benefit: Improves data quality & model performance

2.6 Discretization

  • Converting continuous data → discrete bins

Example:

  • Age → {Child, Adult, Senior}

Techniques:

  • Binning
  • Histogram-based
  • Decision-tree-based

2.7 Encoding Categorical Data

  • Prepares categorical data for ML models

Techniques:

  • Label Encoding
  • One-hot Encoding
  • Target Encoding

2.8 Data Reduction (Part of Transformation)

  • Reduces data size while maintaining integrity

Techniques:

  • PCA (Principal Component Analysis)
  • Sampling
  • Data cube aggregation
  • Attribute subset selection

10. Data Transformation

Data Transformation is the process of converting data into a proper, meaningful, and useful format for analysis or data mining.
It is performed after Data Cleaning and Data Integration.

1. Smoothing (Noise Removal)

Used to remove random errors or fluctuations (noise) from data.

Techniques:

  • Binning
    • Equal-width binning
    • Equal-frequency binning
  • Moving Average
  • Regression Smoothing

2. Aggregation

Combining data to create summarized information.

Examples:

  • Daily data → Monthly data
  • Monthly sales → Yearly sales
  • Operations: Sum, Average, Count

Uses:

  • OLAP
  • Data Cubes
  • Data Warehouses

3. Generalization

Replacing low-level (detailed) data with higher-level concepts.

Examples:

  • City → State → Country
  • Age → Age Group (0–10, 11–20, 21–30)

Uses:

  • Concept Hierarchies
  • Data Cubes

4. Normalization (Feature Scaling)

Scaling data values to a common range for better performance of ML models.

Types:

(a) Min–Max Normalization:

x'  = ( x - min ) / ( max - min )

(b) Z-score Standardization:

x' = (x - μ) / σ

5. Attribute Construction (Feature Construction)

Creating new attributes from existing data.

Examples:

  • BMI = Weight / Height²
  • Full Name = First Name + Last Name
  • Revenue = Price × Quantity

Advantage: Improves model performance.

6. Discretization

Converting continuous data into discrete categories (bins).

Example:

  • Age → {Child, Adult, Senior}

Techniques:

  • Binning
  • Histogram Method
  • Decision Tree-Based Discretization

7. Categorical Data Transformation

Converting categorical (text) data into numeric form for ML models.

Techniques:

  • Label Encoding
  • One-Hot Encoding
  • Binary Encoding
  • Target Encoding

8. Data Reduction (Optional in Transformation)

Reducing data size while preserving important information.

Techniques:

  • PCA (Principal Component Analysis)
  • Sampling
  • Attribute Subset Selection
  • Data Cube Aggregation
Page 2 of 5