Topics
- 1. What is Data Warehouse ?
- 2. Key Characteristics of a Data Warehouse
- 3. Need, Importance & Components of Data Warehousing
- 4. Data Warehouse vs OLTP
- 5. Architecture, Advantages & Applications of a Data Warehouse
- 6. Data Warehouse (DW) Delivery Process
- 7. Data Warehouse Architecture
- 8. Data Cleaning
- 9. Data Integration
- 10. Data Transformation
- 11. Data Reduction
- 12. Data Warehouse Schema Design
- 13. Partitioning Strategy in Data Warehouse Implementation
- 14. Data Marts
- 15. Metadata
- 16. Example of a Multidimensional Data Model
- 17. Pattern Warehousing
Topic Ends from here
A Data Warehouse is a centralized repository that stores integrated data from multiple heterogeneous sources (databases, CRM systems, ERP, logs, spreadsheets, etc.). It supports business intelligence (BI), reporting, and decision-making.
Formal Definition (Bill Inmon – Father of Data Warehousing):A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data to support management’s decision-making process.
Ralph Kimball’s Definition (Dimensional Model Advocate):A data warehouse is a centralized repository providing users access to data organized for analysis, typically using facts and dimensions.
| Characteristic | Description |
|---|---|
| Subject-Oriented | Organized around key business subjects such as Sales, Customers, Products, Finance, Inventory. Focuses on analysis, not transactional processes. |
| Integrated | Combines data from multiple sources: cleaned, transformed, standardized. Example: consistent customer names, date formats, currencies, product codes. |
| Non-Volatile | Data rarely changes once stored. Mostly read-only. Refreshed periodically via ETL processes. |
| Time-Variant | Stores historical data over long periods (months/years) for trend analysis, forecasting, and comparisons. Example: Sales by region over 5 years, seasonal trends. |
-
Businesses generate huge amounts of daily data.
-
Operational systems (OLTP) focus on transaction processing, speed, and routine operations, but are not ideal for analysis.
Benefits of Data Warehousing:
a) Better decision-making
b) Data consistency and accuracy
c) Fast query performance
d) Historical trend analysis
e) Supports BI and data mining
Components:
-
Data Sources – OLTP databases, CRM, ERP, spreadsheets, logs.
-
ETL (Extract, Transform, Load) – Extracts data, cleans it, standardizes formats, and loads into DW.
-
Data Storage – Can be:
-
Relational databases (RDBMS)
-
Multidimensional databases
-
Cloud warehouses (Snowflake, BigQuery, Redshift)
-
-
Metadata Repository – Stores information about data: definitions, source mapping, ETL rules, table descriptions.
-
Data Marts – Focused subsets of DW for specific business areas (e.g., Marketing Mart, Sales Mart, Finance Mart).
-
OLAP (Online Analytical Processing) – Enables complex, multi-dimensional analysis (slice, dice, drill-down, roll-up).
| Feature | Data Warehouse (OLAP) | OLTP (Operational DB) |
|---|---|---|
| Purpose | Analysis & reporting | Day-to-day transactions |
| Data Type | Historical | Current |
| Updates | Periodic (batch) | Frequent (real-time) |
| Queries | Complex | Simple |
| Orientation | Subject-oriented | Process-oriented |
| Design | De-normalized | Highly normalized |
-
Single-tier Architecture – Minimizes data redundancy; rarely used.
-
Two-tier Architecture – Client → Server; suitable for small systems.
-
Three-tier Architecture (Most Common):
-
Bottom Tier: Data sources + ETL
-
Middle Tier: OLAP server / Data Warehouse
- Top Tier: Front-end tools (reports, dashboards, BI tools)
Advantages
-
a) Improved data quality & consistency
b) Faster decision-making
c) Historical analysis & future forecasting
d) Better business intelligence
e) Supports data mining & analytics
Application
-
Retail: Sales forecasting, customer behavior analysis
-
Banking: Fraud detection, risk analysis
-
Healthcare: Treatment patterns, insurance claims
-
Telecom: Customer churn analysis
-
Government: Tax analysis, social data
-
Manufacturing: Supply chain optimization
A Data Warehouse Delivery Process is the end-to-end lifecycle of delivering a working, reliable DW solution—from requirements gathering to maintenance.
It consists of 6 major stages:
1. Requirements Gathering & Business Analysis
Purpose: Understand business needs for data and analytics.
Activities:
-
Identify business processes: Sales, Finance, Inventory, CRM, etc.
-
Determine analytical requirements: KPIs, metrics, reports, dashboards, data granularity (daily, hourly, transactional).
-
Conduct stakeholder interviews.
-
Gather functional & non-functional requirements (performance, SLAs, security).
Outputs:
a) Business Requirement Document (BRD)
b) Use cases, dashboard/report mockups
c) List of data sources
2. Data Warehouse Architecture & Design
Purpose: Define how the DW will be structured.
2.1 High-Level Architecture
-
Choose DW architecture:
-
Kimball (Dimensional – Star Schema)
-
Inmon (Normalized – Enterprise DW)
-
Data Vault
-
Modern Cloud DW (Snowflake, BigQuery, Redshift)
-
2.2 Data Modeling
-
Identify facts and dimensions
-
Define grain (level of detail)
-
Design schemas: Star/Snowflake, conformed dimensions
-
Plan surrogate keys and Slowly Changing Dimensions (SCD types 1, 2, 3)
2.3 Technical Design
-
ETL/ELT architecture
-
Data staging design
-
Metadata repository
-
Error handling, logging
-
Security & access control
-
Data quality rules
Outputs:
a) Logical & physical data models
b) Architecture diagrams
c) ETL specification documents
3. ETL / ELT Development
ETL is the backbone of DW delivery.
3.1 Extraction
-
Connect to sources: Databases, files, APIs, SaaS, streams
-
Extract full or incremental data
-
Handle CDC (Change Data Capture), timestamps, and logs
3.2 Transformation
-
Data cleansing (nulls, invalid values)
-
Standardization (dates, units)
-
Apply business rules
-
Deduplication
-
Derive metrics and surrogate keys
-
Handle SCD for dimensions
3.3 Loading
-
Load data: Stage → Integration → Dimensional layer
-
Load facts & dimensions
-
Log errors/exceptions
-
Optimize performance (bulk loads, partitioning)
Outputs:
a) ETL/ELT pipelines
b) Staging & transformed datasets
c) Data validation scripts
4. Testing & Quality Assurance
Ensures accuracy, completeness, and performance.
Types of Testing:
-
Unit Testing: Individual ETL modules
-
System/Integration Testing: Full pipeline flow
-
Data Quality Testing: Nulls, duplicates, integrity constraints, referential integrity
-
Performance Testing: Query optimization, batch load performance
-
User Acceptance Testing (UAT): Business validation of reports/KPIs
Outputs:
a) Test cases & results
b) Sign-off from business users
5. Deployment & Production Rollout
Activities:
-
Infrastructure setup (on-premises or cloud)
-
Automate workflows using schedulers (Airflow, Cron, Data Factory)
-
Deploy ETL pipelines
-
Initial historical load
-
Configure access & permissions
-
Set up monitoring dashboards
Outputs:
a) Production-ready data warehouse
b) Automated nightly/real-time pipelines
c) Security & governance policies
6. Maintenance, Monitoring & Enhancements
Activities:
-
Monitor pipeline health and performance
-
Handle incremental loads
-
Resolve data quality issues
-
Manage schema changes
-
Add new data sources or KPIs
-
Continuous optimization
Monitoring Tools:
-
Airflow / Luigi
-
CloudWatch / Stackdriver
-
Grafana
-
ETL tool dashboards
Outputs:
a) Stable, scalable DW
b) Continuous improvements
Delivery Process Flow
↓
2. Architecture & Data Modeling
↓
3. ETL/ELT Development
↓
4. Testing (Unit → UAT)
↓
5. Deployment to Production
↓
6. Maintenance & Enhancements
A Data Warehouse Architecture defines how data is collected, stored, managed, and delivered for analysis and reporting.
Key Characteristics
- Subject-oriented → Organized by business subjects (sales, finance)
- Integrated → Combines data from multiple sources
- Time-variant → Stores historical data
- Non-volatile → Data is stable (not frequently changed)
1. Data Sources (Operational Systems)
These are the original systems from where data is collected.
Types
- OLTP systems (ERP, CRM, Billing)
- External data (market data, APIs, flat files)
- Logs & semi-structured data
Role
- Provide raw transactional data
2. ETL / ELT Layer (Data Integration Layer)
ETL (Extract – Transform – Load)
- Data is transformed before loading into warehouse
ELT (Extract – Load – Transform)
- Data is loaded first, then transformed inside warehouse
- Common in cloud systems
Key Functions
- Data cleaning
- Data validation
- Data transformation
- Data integration (multiple sources)
- Data loading (batch / real-time)
3. Staging Area
A temporary storage area used during ETL.
Features
- Stores raw extracted data
- Used for cleansing, sorting, transformation
- Not accessible to end users
Purpose
- Prevents overloading of source systems
4. Data Storage Layer (Data Warehouse Repository)
a. Enterprise Data Warehouse (EDW)
- Centralized warehouse
- Provides single version of truth
b. Data Marts
- Subsets of warehouse
- Department-specific (Sales, Finance)
Types:
- Dependent → From EDW
- Independent → Directly from sources
Storage Approaches
- Relational databases (SQL)
- Cloud warehouses (Snowflake, BigQuery, Redshift)
- Distributed systems (Hadoop, Hive)
5. Metadata Layer
Metadata = “Data about data”
Types
- Technical metadata → tables, columns, data types
- Business metadata → rules, KPIs, definitions
- Operational metadata → ETL logs, load timings
Purpose
- Helps users understand and trust data
6. OLAP Engine (Query Processing Layer)
Used for fast analytical processing.
Types of OLAP
- ROLAP → Relational databases
- MOLAP → Multidimensional cubes
- HOLAP → Hybrid approach
Features
- Slice & Dice
- Drill-down / Roll-up
- Pivoting
- Aggregation
7. Presentation Layer (BI / Reporting Tools)
Provides access to end-users.
Tools
- Power BI
- Tableau
- Qlik
- Excel
Purpose
- Data visualization
- Reporting
- Analytics
- Business insights

Data Cleaning is the process of detecting, correcting, or removing corrupt, inaccurate, incomplete, inconsistent, and irrelevant data from a dataset.
It is essential because real-world data is messy, and poor-quality data leads to poor-quality models.
1. Handling Missing Data
Missing values can occur due to human error, system issues, or incomplete records.
Techniques:
a. Deletion Methods
- Listwise Deletion (Remove Entire Record)
- If a row has missing values, delete the entire row.
- Useful when: Few records are missing, and data size is large.
- Pairwise Deletion
- Use available values for each analysis without deleting whole rows.
b. Imputation Methods
- Mean/Median/Mode Imputation → Replace missing values using basic statistics.
- K-Nearest Neighbors (KNN) Imputation → Uses similar rows to estimate missing values.
- Regression Imputation → Predict missing values using regression models.
- Multiple Imputation → Advanced statistical technique for unbiased estimates.
2. Handling Noisy Data
Noise = random error or variance in data.
Techniques:
- Smoothing
- Moving averages
- Binning (equal width, equal frequency)
- Regression smoothing
- Outlier Detection & Removal
- Z-score method
- IQR method (Q1 − 1.5IQR, Q3 + 1.5IQR)
- DBSCAN clustering
- Transformation
- Log transform
- Power transform
3. Handling Inconsistent Data
Inconsistencies occur due to different naming conventions or data entry errors.
Examples:
- “Male”, “M”, “m”, “MALE”
- "USA", "United States", "U.S."
Techniques:
- Standardization Rules → Define uniform formats for text, dates, categories.
- Data Validation Rules → Example: age should be 0–120.
- Referential Integrity Checks → Ensure foreign keys match primary keys.
4. Handling Duplicate Data
Duplicate records arise from merging datasets or repeated entries.
Techniques:
- Exact match detection (same values across rows)
- Fuzzy matching (similar but not identical values)
- Record linkage (identify same entity across sources)
- Remove / merge duplicates
5. Handling Incorrect or Invalid Data
Incorrect data includes typos, wrong values, or impossible values.
Techniques:
- Validation Checks
- Range checks (e.g., salary ≥ 0)
- Type checks (integers, strings)
- Format checks (email, phone number)
- Dictionary / Master Data Lookup → To correct category mistakes (e.g., "India" → “India”)
- Business Rules → Example: transaction date cannot be in the future
6. Handling Irrelevant Data
Irrelevant features do not contribute to analysis or modeling.
Techniques:
- Feature Selection
- Filter methods (correlation, chi-square)
- Wrapper methods (RFE)
- Embedded methods (Lasso)
- Domain Knowledge Filtering → Remove attributes not needed for analysis
7. Data Integration Issues (Part of Cleaning)
When combining multiple sources, errors may occur.
Problems:
- Naming conflicts
- Unit inconsistencies (kg vs lbs)
- Schema mismatch (different column layouts)
- Duplicate records across sources
Solutions:
- Schema integration
- Entity resolution
- Standardization of measurement units
Data Integration is the process of combining data from multiple heterogeneous sources into a unified, consistent view.
Sources may include:
- Databases (SQL/NoSQL)
- Flat files (CSV, Excel)
- ERP / CRM systems
- Web data / Logs
- Cloud storage
Goal: Create a single, consistent dataset for analysis or storage in a data warehouse.
Key Issues in Data Integration
1.1 Schema Integration
- Different sources may use different structures.
Example: - Source A: Cust_ID, FName, LName
- Source B: CustomerNo, FirstName, LastName
Solution:
- Match attributes using schema mapping / metadata
1.2 Data Value Conflicts
- Same data represented differently.
Examples: - “Male” vs “M”
- “USA” vs “United States”
- Date formats: DD/MM/YYYY vs MM/DD/YYYY
- Units: kg vs lbs
Solution:
- Create standardized formats or conversion rules
1.3 Entity Identification Problem
- Determining if records from different sources refer to the same real-world entity.
Example:
- John Doe, 123-45-6789
- J. Doe, SSN: 123456789
Solution:
- Matching keys (primary / foreign keys)
- Fuzzy matching / record linkage
- Master Data Management (MDM)
1.4 Redundancy and Duplicate Detection
- Combining data may introduce duplicates.
Solution:
- Remove duplicates
- Merge records
- Use deduplication tools / matching algorithms
1.5 Data Granularity Issues
- Different levels of detail in different sources.
Example:
- Sales per day in one source
- Sales per month in another
Solution:
- Aggregation or disaggregation during integration
Techniques Used in Data Integration
- ETL (Extract, Transform, Load)
- Data federation / Virtual integration
- Data warehousing
- Data Lake integration
- Enterprise Information Integration (EII)
- Schema matching and mapping
2. Data Transformation
Data Transformation modifies, consolidates, or restructures data into a suitable format for analysis or mining.
- Usually comes after data integration in pipelines.
2.1 Smoothing
- Reduces noise in data.
Techniques: - Moving average
- Binning
- Regression smoothing
2.2 Aggregation
- Combines data to produce summary information.
Examples:
- Summing monthly sales → yearly sales
- Converting hourly data → daily averages
Used in:
- Data warehousing (fact tables)
- OLAP cube creation
2.3 Generalization
- Replacing low-level data with higher-level concepts.
Examples:
- City → State → Country
- Age → Age group (0–10, 11–20, etc.)
Used in:
- Data cubes
- Concept hierarchies
2.4 Normalization (Feature Scaling)
- Brings numerical values to a common scale for ML.
Techniques:
-
Min-Max Normalization:
x′ = x − min / max - min
Z-score Standardization:
- Decimal Scaling
x′ = x − μ / σ
2.5 Attribute / Feature Construction
- Creating new attributes from existing ones.
Examples:
- BMI = weight / height²
- FullName = FirstName + LastName
- Revenue = Price × Quantity
Benefit: Improves data quality & model performance
2.6 Discretization
- Converting continuous data → discrete bins
Example:
- Age → {Child, Adult, Senior}
Techniques:
- Binning
- Histogram-based
- Decision-tree-based
2.7 Encoding Categorical Data
- Prepares categorical data for ML models
Techniques:
- Label Encoding
- One-hot Encoding
- Target Encoding
2.8 Data Reduction (Part of Transformation)
- Reduces data size while maintaining integrity
Techniques:
- PCA (Principal Component Analysis)
- Sampling
- Data cube aggregation
- Attribute subset selection
Data Transformation is the process of converting data into a proper, meaningful, and useful format for analysis or data mining.
It is performed after Data Cleaning and Data Integration.
1. Smoothing (Noise Removal)
Used to remove random errors or fluctuations (noise) from data.
Techniques:
- Binning
- Equal-width binning
- Equal-frequency binning
- Moving Average
- Regression Smoothing
2. Aggregation
Combining data to create summarized information.
Examples:
- Daily data → Monthly data
- Monthly sales → Yearly sales
- Operations: Sum, Average, Count
Uses:
- OLAP
- Data Cubes
- Data Warehouses
3. Generalization
Replacing low-level (detailed) data with higher-level concepts.
Examples:
- City → State → Country
- Age → Age Group (0–10, 11–20, 21–30)
Uses:
- Concept Hierarchies
- Data Cubes
4. Normalization (Feature Scaling)
Scaling data values to a common range for better performance of ML models.
Types:
(a) Min–Max Normalization:
(b) Z-score Standardization:
5. Attribute Construction (Feature Construction)
Creating new attributes from existing data.
Examples:
- BMI = Weight / Height²
- Full Name = First Name + Last Name
- Revenue = Price × Quantity
Advantage: Improves model performance.
6. Discretization
Converting continuous data into discrete categories (bins).
Example:
- Age → {Child, Adult, Senior}
Techniques:
- Binning
- Histogram Method
- Decision Tree-Based Discretization
7. Categorical Data Transformation
Converting categorical (text) data into numeric form for ML models.
Techniques:
- Label Encoding
- One-Hot Encoding
- Binary Encoding
- Target Encoding
8. Data Reduction (Optional in Transformation)
Reducing data size while preserving important information.
Techniques:
- PCA (Principal Component Analysis)
- Sampling
- Attribute Subset Selection
- Data Cube Aggregation
Data Reduction is the process of reducing the volume of data while maintaining its integrity and analytical value.
Why Data Reduction is Needed
In data warehousing, datasets are very large. Data reduction helps to:
- Improve query performance
- Reduce storage cost
- Speed up data mining
- Make analysis faster and easier
Goal
To obtain a reduced representation of data that:
- Is smaller in size
- Produces almost the same analytical results
Types of Data Reduction Techniques
1. Data Cube Aggregation
Data is stored at detailed (low-level) form and then aggregated into higher-level summaries.
Examples:
- Daily sales → Monthly sales
- Monthly → Yearly
- City → State → Country
Used in:
- OLAP Cubes
- Roll-up operations
- Summary tables
2. Dimensionality Reduction
Reduces the number of attributes (features) in the dataset.
Techniques:
- PCA (Principal Component Analysis)
- Feature Selection:
- Correlation
- Chi-square
- RFE (Recursive Feature Elimination)
- Wavelet Transform
Benefits:
- Reduces model complexity
- Removes irrelevant/redundant data
- Improves accuracy and speed
3. Data Compression
Stores data in a compact format.
Types:
- Lossless Compression
- No data loss
- Examples: ZIP, Huffman Coding
- Lossy Compression
- Some data loss allowed
- Used in images, audio, video
- Other Methods
- Run-Length Encoding
- String Compression
4. Numerosity Reduction
Replaces the original dataset with a smaller representation.
(a) Parametric Methods
Assume a model and store only its parameters.
Examples:
- Regression Models
- Log-linear Models
Store model instead of full data.
(b) Non-Parametric Methods
No model assumption.
Examples:
- Histograms
- Clustering (e.g., K-means)
- Sampling
5. Sampling
Selecting a small but representative subset of data.
Types:
- Simple Random Sampling
- Stratified Sampling
- Systematic Sampling
Use:
- When full dataset is too large to process
6. Discretization and Binning
Reduces the number of distinct values by grouping.
Examples:
- Age (1–100) → (0–10, 11–20, …)
- Income → Low, Medium, High
Benefits:
- Reduces data size
- Simplifies analysis
- Improves model performance
A schema in a data warehouse defines how data is organized and how tables are logically connected.
It is based on Dimensional Modeling, which divides data into:
- Fact Tables → Store numeric measures (Sales, Revenue, Quantity)
- Dimension Tables → Store descriptive attributes (Time, Product, Customer)
Schema design determines how fact and dimension tables are arranged.
Types of Data Warehouse Schemas
1. Star Schema
Definition:
The simplest and most commonly used schema.
Structure:
- One central Fact Table
- Multiple surrounding Dimension Tables
- All dimensions directly connected to the fact table
Diagram (Text):
|
Customer — Fact Table — Time
|
Store Dimension
Features:
- Easy to understand
- Fast query performance
- Denormalized dimension tables (redundancy allowed)
Use Cases:
- Retail sales
- Finance dashboards
- OLAP queries
2. Snowflake Schema
Definition:
A normalized version of the star schema.
Structure:
- One central Fact Table
- Dimension tables further divided into sub-dimensions (normalized)
Diagram (Text):
|
Product — Fact Table — Time
|
Customer → Customer City
Features:
- Reduced data redundancy
- More complex joins
- Saves storage space
Use Cases:
- Large and complex data warehouses
- Systems with many attributes
3. Galaxy Schema (Fact Constellation Schema)
Definition:
A schema with multiple fact tables sharing common dimension tables.
Structure:
- Multiple fact tables (e.g., Sales Fact, Inventory Fact)
- Shared dimension tables (Product, Time, Store)
Diagram (Text):
/ \
Sales Fact Inventory Fact
\ /
Time Dimension
Features:
- Handles complex business scenarios
- Supports multiple data marts
- Flexible but complex
Use Cases:
- Enterprise-level data warehouses
- Large organizations
Components of a Data Warehouse Schema
1. Fact Table
- Stores numerical/measurable data
- Contains foreign keys to dimension tables
Examples:
- Sales Amount
- Quantity
- Profit
2. Dimension Table
- Stores descriptive attributes
Examples:
- Time Dimension
- Product Dimension
- Customer Dimension
- Store / Location Dimension
3. Measures and Metrics
- Calculated values used for analysis
Examples:
- SUM(Sales)
- AVG(Profit)
4. Hierarchies
Used for Drill-Down and Roll-Up operations in OLAP.
Examples:
- Day → Month → Quarter → Year
- City → State → Country
Comparison of Schemas
| Feature | Star Schema | Snowflake Schema | Galaxy Schema |
|---|---|---|---|
| Normalization | Low | High | Medium |
| Query Speed | Fast | Slower | Moderate |
| Complexity | Simple | Complex | High |
| Storage | Large | Smaller | Medium |
| Fact Tables | One | One | Many |
| Dimensions | Denormalized | Normalized | Shared |
Partitioning is a physical design technique used in data warehouses to divide large tables (especially fact tables) into smaller, manageable pieces called partitions.
Why Partitioning is Needed
Data warehouses store massive volumes of historical data (often billions of rows). Partitioning helps in:
- Faster Query Processing
Only relevant partitions are scanned - Faster ETL & Data Loading
Only new/recent partitions are updated - Easy Data Archiving / Purging
Old partitions can be dropped without affecting others - Better Indexing & Storage Management
Smaller, more efficient indexes
Types of Partitioning Strategies
1. Range Partitioning (Most Common)
Data is divided based on a continuous range of values.
Example (Date-based):
- Partition 1 → Jan 2024
- Partition 2 → Feb 2024
- Partition 3 → Mar 2024
Advantages:
- Best for time-based data
- Easy archiving and deletion
- Efficient for range queries (e.g., last 3 months)
2. List Partitioning
Data is divided based on a predefined list of values.
Example (Region-based):
- Partition A → USA, Canada
- Partition B → India, Nepal, Sri Lanka
- Partition C → UK, Germany, France
Advantages:
- Good for categorical data
- Faster queries on specific categories
3. Hash Partitioning
Data is distributed using a hash function.
Example:
Advantages:
- Uniform data distribution
- Avoids data skew
- Useful when no natural partition key exists
4. Composite Partitioning (Hybrid)
Combination of two or more partitioning techniques.
Example (Range + Hash):
- First partition by Month
- Then apply hash on Customer_ID
Advantages:
- Suitable for very large datasets
- Balances performance and distribution
5. Column-based Partitioning (Columnar Storage)
Used in modern data warehouses.
Features:
- Stores data column-wise instead of row-wise
- High compression
- Very fast analytical queries
Examples of systems:
- BigQuery
- Snowflake
- Amazon Redshift
How Partitioning Works (Steps)
Step 1: Identify Large Tables
- Usually fact tables with billions of rows
Step 2: Select Partition Key
Common Keys:
- Date
- Product Category
- Region
- Department
- Customer ID
Step 3: Choose Partition Type
- Range
- List
- Hash
- Composite
Step 4: Create Partitions
- Table is divided into multiple segments on disk
Step 5: Manage Partitions
- Add new partitions (e.g., new month/year)
- Drop old partitions (archival)
- Merge or split partitions
Partition Pruning
Definition:
Query engines automatically skip irrelevant partitions.
Example Query:
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';
Only January partition is scanned, not the full table.
Result:
Faster query performance
Benefits of Partitioning
| Benefit | Explanation |
|---|---|
| Faster Query Performance | Only required partitions are scanned |
| Better ETL Speed | Incremental data loading |
| Easy Archival | Drop old partitions easily |
| Improved Index Efficiency | Smaller indexes per partition |
| High Availability | Operations at partition level |
| Manageability | Easier maintenance |
Example Scenario
Table: Sales_Fact
Partition Key: Sale_Date
Partition Type: Range Partitioning
| Partition Name | Date Range |
|---|---|
| SALES_2023_Q1 | Jan – Mar 2023 |
| SALES_2023_Q2 | Apr – Jun 2023 |
| SALES_2023_Q3 | Jul – Sep 2023 |
| SALES_2023_Q4 | Oct – Dec 2023 |
Operations You Can Perform
- Archive old partition (e.g., SALES_2021)
- Add new partition (e.g., SALES_2025)
- Query only recent data (last 6 months)
A Data Mart is a subset of a data warehouse that focuses on a specific business area such as Sales, Marketing, Finance, HR, or Inventory.
It is smaller, faster, and easier to manage than a full data warehouse.
Formal Definition
A Data Mart is a subject-oriented, integrated, and optimized collection of data designed to serve the needs of a particular department or business function.
Why Use Data Marts
- Provide quick access to relevant data
- Support department-specific reporting
- Reduce load on the main Data Warehouse
- Improve performance and simplify analytics
- Lower storage and management cost
Types of Data Marts
1. Dependent Data Mart
Definition:
Created from the Enterprise Data Warehouse (EDW).
Structure:
Features:
- Uses centralized and clean data
- Highly consistent and reliable
Example:
Sales Data Mart extracted from EDW
2. Independent Data Mart
Definition:
Created directly from operational systems without a data warehouse.
Structure:
Features:
- Faster to build
- May cause data inconsistency
Example:
Marketing Data Mart built from CRM system
3. Hybrid Data Mart
Definition:
Combination of both dependent and independent approaches.
Features:
- Uses both EDW and operational systems
- Flexible and faster implementation
Example:
Finance Data Mart using EDW + external banking data
Data Mart Architecture
A Data Mart consists of 3 layers:
1. Source Layer
- Operational databases
- Data warehouse
- External data sources
2. ETL Layer (Extract – Transform – Load)
- Data cleaning
- Data integration
- Data transformation
- Loading into data mart
3. Data Mart Storage / Presentation Layer
- Fact tables
- Dimension tables
- OLAP cubes
- Reports and dashboards
Schema Used in Data Marts
Data marts typically use Dimensional Modeling:
- Star Schema
- Snowflake Schema
Data Mart Components
1. Fact Tables
- Store transactional data
- Contain measures
Examples:
- Sales amount
- Revenue
- Profit
2. Dimension Tables
- Store descriptive attributes
Examples:
- Time
- Product
- Customer
- Region
Examples of Data Marts
1. Sales Data Mart
- Measures: Sales amount, Quantity
- Dimensions: Time, Product, Store
2. Finance Data Mart
- Measures: Revenue, Expenses, Profit
- Dimensions: Time, Department, Account
3. HR Data Mart
- Measures: Employee count, Salary
- Dimensions: Department, Location, Job Role
Advantages of Data Marts
| Advantage | Explanation |
|---|---|
| Faster Access | Smaller dataset → quick queries |
| Better Performance | Optimized for specific tasks |
| Lower Cost | Less storage and processing |
| Easy to Manage | Can be handled by small teams |
| Department-Focused | Tailored for specific needs |
| Reduces DW Load | Offloads queries from main warehouse |
Disadvantages of Data Marts
| Disadvantage | Explanation |
|---|---|
| Data Inconsistency | Occurs in independent marts |
| Data Redundancy | Same data stored multiple times |
| Limited Scope | Only covers one department |
| Integration Issues | Hard to combine multiple marts |
Metadata means “data about data.”
It describes, explains, and gives meaning to the data stored in a data warehouse.
Formal Definition
Metadata is descriptive information that defines the structure, operations, rules, and contents of data in a data warehouse, enabling users and systems to understand and manage the data.
What Metadata Helps Users Understand
- What the data is
- Where it came from
- How it is structured
- How it should be used
Why Metadata is Important
- Helps users understand data definitions
- Helps ETL developers track data movement
- Provides source-to-target mapping
- Supports data governance
- Improves data quality and consistency
- Helps analysts generate accurate reports
Types of Metadata
1. Technical Metadata
Used by: IT teams, DBAs, ETL developers
Includes:
- Table names and column names
- Data types
- Primary keys and foreign keys
- Indexes and constraints
- Schema and table structure
- File paths and storage details
Example:
- Customer_ID → Integer (Primary Key)
- Date → Format: YYYY-MM-DD
2. Business Metadata
Used by: Business users, analysts
Includes:
- Business definitions
- KPIs (Key Performance Indicators)
- Data ownership
- Business rules
- Meaning of data fields
Example:
- Sales_Amount = Total value of products sold (excluding tax)
3. Operational Metadata
Describes: ETL and system operations
Includes:
- Data load time
- Record counts (before/after load)
- Error logs
- Job execution logs
- Data source details
Example:
- ETL Load Time: 02:00 AM
- Rows Loaded: 2,500,000
- Failed Rows: 1,200
Metadata in ETL Process
Metadata stores information about:
- Source extraction rules
- Transformation logic
- Mapping between source and target
- Load schedules
- Job dependencies
Metadata Repository (Metadata Catalog)
Definition:
A Metadata Repository is a central storage location where all metadata is stored.
Used By:
- ETL Tools
- BI Tools
- Data Warehouse Administrators
- Analysts
Examples of Metadata Tools:
- Informatica Repository
- SSIS Catalog
- AWS Glue Data Catalog
- Hive Metastore
Examples of Metadata
1. Technical Metadata
- Table: sales_fact
- Column: sale_date (DATE)
- Column: amount (NUMBER, NOT NULL)
2. Business Metadata
- Revenue = Total sales after discount
3. Operational Metadata
- Last ETL Run: 1 March 2025, 3:00 AM
- Status: Success
Advantages of Metadata
| Benefit | Explanation |
|---|---|
| Improves Data Understanding | Clear meaning of data |
| Better Data Quality | Detects errors and inconsistencies |
| Easier ETL Development | Provides clear documentation |
| Faster Data Retrieval | Helps optimize queries |
| Supports Governance | Tracks data origin and lineage |
A Multidimensional Data Model organizes data into:
- Facts → Numeric/measurable data
- Dimensions → Descriptive attributes
It allows analysis from multiple perspectives using OLAP operations like:
- Slice
- Dice
- Drill-down
- Roll-up
- Pivot
Scenario: Sales Analysis
We want to analyze sales across:
- Time
- Product
- Location
This forms a 3-Dimensional Data Cube
1. Fact Table: Sales_Fact
| Attribute | Description |
|---|---|
| product_id | Foreign Key → Product Dimension |
| time_id | Foreign Key → Time Dimension |
| location_id | Foreign Key → Location Dimension |
| sales_amount | Measure |
| quantity_sold | Measure |
Measures:
- Sales Amount
- Quantity Sold
2. Dimension Tables
A. Product Dimension
| product_id | product_name | category | brand | price |
|---|---|---|---|---|
| 1 | Laptop | Electronics | Dell | 60000 |
| 2 | Mobile | Electronics | Samsung | 20000 |
B. Time Dimension
| time_id | day | month | quarter | year |
|---|---|---|---|---|
| 101 | 12 | March | Q1 | 2024 |
| 102 | 15 | March | Q1 | 2024 |
C. Location Dimension (Example)
| location_id | city | state | country |
|---|---|---|---|
| 201 | Mumbai | Maharashtra | India |
| 202 | Delhi | Delhi | India |
Multidimensional Cube Representation
↑
|
Location (City/State) ← Sales Cube → Product (Category/Brand)
Each cell in the cube stores a value like:
Sales_Amount = f(Product, Time, Location)
Example Data Cube Values
| Product | Time | Location | Sales Amount | Quantity Sold |
|---|---|---|---|---|
| Laptop | March 2024 | Mumbai | 25,00,000 | 42 |
| Mobile | March 2024 | Delhi | 8,00,000 | 60 |
OLAP Operations on Data Cube
1. Slice
Select a single dimension value.
Example:
- Sales for March 2024
Result: Product × Location
2. Dice
Select multiple values from multiple dimensions.
Example:
- Products: Laptop, Mobile
- Time: Q1 2024
- Location: Mumbai, Delhi
3. Drill-Down
Move from higher level to detailed level.
Example:
- Year → Quarter → Month → Day
4. Roll-Up
Aggregate data to a higher level.
Example:
- City → State → Country
5. Pivot (Rotate)
Reorient the cube (change axes).
Example:
- Swap Product and Time dimensions
Pattern Warehousing is an extension of traditional data warehousing where patterns, knowledge, and insights discovered from data mining are stored instead of raw data.
Formal Definition
Pattern Warehousing is a data warehousing approach that stores patterns, rules, trends, or models extracted from data mining processes to enable faster and more intelligent decision-making.
Key Idea
- Traditional Data Warehouse → Stores raw data
- Pattern Warehouse → Stores patterns derived from data
Types of Patterns Stored
- Association Rules
- Classification Models
- Clustering Patterns
- Sequential Patterns
This avoids repeated data mining and saves time & computation.
Architecture of Pattern Warehousing
1. Operational Data Sources
- Raw transactional data from multiple systems
2. Data Warehouse
- Cleaned, integrated, historical data
3. Data Mining Layer
- Extracts:
- Patterns
- Rules
- Models
4. Pattern Warehouse
- Stores discovered patterns for reuse
5. Decision Support / OLAP Layer
- Users query patterns instead of raw data
Types of Patterns Stored (Detailed)
| Pattern Type | Description |
|---|---|
| Association Rules | “Customers who buy A also buy B” |
| Classification Models | Predict categories (e.g., spam/not spam) |
| Clustering Patterns | Groups similar data points |
| Sequential Patterns | Order of events over time |
Advantages of Pattern Warehousing
- Faster decision-making (pre-computed patterns)
- Reduces repeated data mining
- Quick discovery of insights
- Integrates patterns from multiple sources
- Supports advanced analytics and forecasting
Example: Retail Scenario
Step 1: Data Warehouse Stores
- Customer purchases
- Product information
- Time and location data
Step 2: Data Mining Discovers
- Association Rule:
“70% of customers who buy bread also buy butter” - Seasonal Trend:
“Ice cream sales increase by 50% in summer”
Step 3: Pattern Warehouse Stores
- These rules and trends
Step 4: Usage
- Management queries patterns directly
- Helps in:
- Marketing campaigns
- Product placement
- Sales strategy