ETL Data Warehouse Testing Interview Questions – Complete Real-Time Guide

1. What is ETL Testing? (Definition + Example)

ETL Testing is the process of validating data during Extract, Transform, and Load operations to ensure accuracy, completeness, consistency, and performance when data moves from source systems into a data warehouse (DW).

Real-World Example

In a telecom analytics project:

  • Source: CRM, billing system, flat files
  • Target: Enterprise Data Warehouse
  • ETL testing validates:
    • Source-to-target record counts
    • Transformation logic (usage aggregation, billing rules)
    • SCD Type 1 & Type 2 handling for customers
    • Audit fields (batch_id, load_date, checksum)
    • Performance SLA for daily loads

ETL data warehouse testing focuses on data correctness rather than UI behavior.


2. Data Warehouse Flow – Source → Staging → Transform → Load → Reporting

  1. Source Layer
    • OLTP databases, APIs, CSV/JSON files
  2. Staging Layer
    • Raw extracted data
    • Minimal validation
  3. Transformation Layer
    • Business rules
    • Joins, aggregations, lookups
    • SCD handling
  4. Load Layer
    • Fact and dimension tables
  5. Reporting Layer
    • BI dashboards, analytics, regulatory reports

Testing focus: S2T mapping validation, reconciliation, referential integrity, aggregates, and performance.


3. ETL & Data Warehouse Architecture (Tester’s View)

  • ETL Tools: Informatica, SSIS, Ab Initio
  • Metadata & Mapping Layer: S2T documents
  • Control Tables: Job status, row counts, reject counts
  • Audit Framework: Batch tracking, hashing
  • Parallel Processing: Partitioning for performance

ETL testers ensure data quality, restartability, and SLA compliance.


4. ETL Data Warehouse Testing Interview Questions & Answers (Basic → Advanced)

🔹 Basic Questions (1–15)

  1. What is ETL data warehouse testing?
    Validation of data loaded into a DW from source systems.
  2. Why is ETL testing critical?
    Incorrect data leads to wrong business decisions.
  3. What is a data warehouse?
    Central repository for reporting and analytics.
  4. What is staging area?
    Temporary storage for raw extracted data.
  5. What is S2T mapping?
    Source-to-Target document defining transformations.
  6. What is a fact table?
    Stores measurable metrics (sales, revenue).
  7. What is a dimension table?
    Stores descriptive attributes (customer, product).
  8. What are audit fields?
    batch_id, load_ts, source_system, checksum.
  9. What is reconciliation testing?
    Comparing source and target data.
  10. What is full load?
    Reloading entire target table.
  11. What is incremental load?
    Loading only new or changed records.
  12. What is primary key validation?
    Ensuring uniqueness and non-null values.
  13. What is reject data?
    Records failing validation rules.
  14. What is data profiling?
    Analyzing source data patterns.
  15. What is null validation?
    Checking null handling as per rules.

🔹 Intermediate Questions (16–35)

  1. Explain SCD Type 1.
    Overwrites old dimension data.
  2. Explain SCD Type 2.
    Maintains historical records.
  3. How do you test SCD2?
    Validate new row insert and old row expiry.
  4. What is a surrogate key?
    System-generated unique key.
  5. How do you validate surrogate keys?
    Check uniqueness and non-null.
  6. What is CDC?
    Change Data Capture for delta loads.
  7. How do you test CDC?
    Compare before/after snapshots.
  8. What is referential integrity testing?
    Fact FK exists in dimension PK.
  9. What is aggregation testing?
    Validating SUM, COUNT, AVG.
  10. What is lookup testing?
    Validating reference data mappings.
  11. What is deduplication?
    Removing duplicate business keys.
  12. How do you test dedup logic?
    GROUP BY or window functions.
  13. What is late-arriving dimension?
    Fact arrives before dimension.
  14. What is data skew?
    Uneven data distribution.
  15. What is restartability testing?
    Ensuring job resumes correctly after failure.
  16. What is metadata testing?
    Validating column names, types, lengths.
  17. What is data lineage?
    Tracking data from source to report.
  18. What is threshold testing?
    Job fails if rejects exceed limit.
  19. What is hashing in ETL?
    Detecting data changes efficiently.
  20. What is SLA testing?
    Ensuring ETL jobs complete on time.

🔹 Advanced & Scenario-Based Questions (36–60)

  1. How do you handle record count mismatch?
    Check filters, joins, rejects, CDC logic.
  2. How do you validate null handling?
    Defaults vs rejects vs allowed nulls.
  3. How do you test ETL performance?
    Partitioning, indexing, parallelism.
  4. How do you test incremental loads?
    Validate watermark logic.
  5. How do you test multi-source joins?
    Validate join keys and cardinality.
  6. How do you test aggregation failures?
    Recalculate totals using SQL.
  7. How do you test timezone conversions?
    Validate timestamps.
  8. How do you test re-runs?
    Ensure no duplicate data.
  9. How do you test historical data loads?
    Validate backdated inserts.
  10. How do you validate audit tables?
    Compare source_count vs target_count.
  11. What causes ETL performance issues?
    Data skew, large joins, missing indexes.
  12. How do you test file-based ETL?
    Header, footer, delimiter checks.
  13. How do you test schema changes?
    Backward compatibility validation.
  14. How do you validate reporting data?
    BI totals vs DW aggregates.
  15. Explain a critical ETL defect you found.
    Example: SCD2 failure, wrong aggregation.

5. Real SQL Query Examples for ETL & DW Validation

Sample Tables

src_orders(order_id, cust_id, amount, order_date)
fact_sales(order_sk, cust_sk, sales_amt, order_date, batch_id)

JOIN Validation (Missing Records)

SELECT COUNT(*)

FROM src_orders s

LEFT JOIN fact_sales f

ON s.order_id = f.order_sk

WHERE f.order_sk IS NULL;

GROUP BY Aggregation Validation

SELECT order_date, SUM(amount)

FROM src_orders

GROUP BY order_date;

SELECT order_date, SUM(sales_amt)

FROM fact_sales

GROUP BY order_date;

Window Function – De-duplication

SELECT *

FROM (

  SELECT *,

         ROW_NUMBER() OVER

         (PARTITION BY order_id ORDER BY load_date DESC) rn

  FROM stage_orders

) t

WHERE rn = 1;

Performance Tuning Validation

EXPLAIN ANALYZE

SELECT cust_sk, SUM(sales_amt)

FROM fact_sales

GROUP BY cust_sk;


6. Scenario-Based ETL Data Warehouse Testing

ScenarioValidation
Record mismatchSource vs target counts
Null handlingDefault or reject logic
Duplicate dataWindow functions
Late-arriving dataSCD2 backdated insert
Slow jobPartitioning & indexing

7. Common ETL Tools for DW Testing

  • Informatica
  • Microsoft SSIS
  • Ab Initio
  • Pentaho
  • Talend

Concepts remain same across tools.


8. ETL Defect Examples + Sample Test Case

Defect: SCD2 history not maintained

  • Expected: Old row expired, new row inserted
  • Actual: Two active records
  • Severity: High

Test Case:

  • Update dimension attribute
  • Validate effective_end_date and current_flag

9. ETL Data Warehouse Testing – Quick Revision Sheet

  • ETL flow & architecture
  • S2T mapping validation
  • SCD1 vs SCD2
  • SQL JOIN, GROUP BY, window functions
  • Hashing, CDC, audit fields
  • Performance & reconciliation

10. FAQs – Featured Snippet Ready

Q1. Is SQL mandatory for ETL data warehouse testing?
Yes, SQL is essential for validation.

Q2. Which ETL tool is most important?
Concepts matter more than tools.

Q3. Is ETL testing suitable for automation?
Yes, especially for regression and large datasets.

Leave a Comment

Your email address will not be published. Required fields are marked *