ETL QA Testing Interview Questions – Complete Real-Time Interview Guide

1. What is ETL Testing? (Definition + Example)

ETL Testing is the process of validating data during Extract, Transform, and Load operations to ensure accuracy, completeness, consistency, and performance when data moves from source systems into a data warehouse (DW).

Real-Time Example (ETL QA Role)

In an e-commerce analytics project:

  • Source: Order DB, customer CRM, product files
  • Target: Enterprise Data Warehouse
  • ETL QA validates:
    • Source vs target record counts
    • Transformation rules (discounts, tax, currency conversion)
    • SCD Type 1 & Type 2 for customer/product dimensions
    • Audit fields (batch_id, load_date, checksum)
    • Performance SLA for daily loads

ETL QA testing focuses on data correctness and business logic, not UI behavior.


2. DW Flow – Source → Staging → Transform → Load → Reporting

  1. Source Layer
    OLTP databases, APIs, flat files (CSV, JSON, XML)
  2. Staging Layer
    Raw extracted data with minimal validation
  3. Transformation Layer
    Business rules, joins, aggregations, lookups, SCD handling
  4. Load Layer
    Fact tables and dimension tables
  5. Reporting Layer
    BI dashboards, analytics, regulatory reports

QA focus: S2T mapping validation, reconciliation, referential integrity, aggregates, and performance.


3. ETL Architecture – QA Tester’s Perspective

  • ETL Tools Layer: Informatica, SSIS, Ab Initio
  • Mapping & Metadata Layer: S2T documents, business rules
  • Control Tables: Job status, row counts, error counts
  • Audit Framework: Batch tracking, hashing, rejects
  • Performance Layer: Partitioning, parallel processing

ETL QA ensures data quality, restartability, and SLA compliance.


4. ETL QA Testing Interview Questions & Answers (Basic → Advanced)

🔹 Basic ETL QA Interview Questions (1–15)

  1. What is ETL QA testing?
    Validation of data extraction, transformation, and loading into a DW.
  2. Why is ETL QA important?
    Incorrect data leads to wrong business decisions.
  3. What is a data warehouse?
    Centralized repository for analytics and reporting.
  4. What is a staging area?
    Temporary storage for raw extracted data.
  5. What is S2T mapping?
    Source-to-Target document defining column mappings and rules.
  6. What is a fact table?
    Stores measurable business metrics.
  7. What is a dimension table?
    Stores descriptive attributes.
  8. What are audit fields?
    batch_id, load_date, source_system, checksum.
  9. What is data reconciliation?
    Comparing source and target data.
  10. What is full load?
    Reloading entire target table.
  11. What is incremental load?
    Loading only new or changed records.
  12. What is primary key validation?
    Ensuring uniqueness and non-null values.
  13. What is reject data?
    Records failing validation rules.
  14. What is data profiling?
    Analyzing source data patterns.
  15. What is null validation?
    Verifying null handling per business rules.

🔹 Intermediate ETL QA Questions (16–35)

  1. Explain SCD Type 1.
    Overwrites old dimension data without history.
  2. Explain SCD Type 2.
    Maintains historical records using effective dates and flags.
  3. How do you test SCD2?
    Validate new row insertion, old row expiry, current_flag.
  4. What is a surrogate key?
    System-generated unique identifier.
  5. How do you validate surrogate keys?
    Check uniqueness and non-null constraints.
  6. What is CDC (Change Data Capture)?
    Processing only changed or new records.
  7. How do you test CDC logic?
    Compare before/after snapshots.
  8. What is referential integrity testing?
    Fact FK must exist in dimension PK.
  9. What is aggregation testing?
    Validating SUM, COUNT, AVG logic.
  10. What is lookup testing?
    Verifying reference data mappings.
  11. What is deduplication?
    Removing duplicate business keys.
  12. How do you test dedup logic?
    Using GROUP BY or window functions.
  13. What is late-arriving dimension?
    Fact arrives before dimension.
  14. What is data skew?
    Uneven data distribution impacting performance.
  15. What is restartability testing?
    Ensuring ETL resumes correctly after failure.
  16. What is metadata testing?
    Validating column names, types, lengths.
  17. What is data lineage?
    Tracking data from source to report.
  18. What is threshold testing?
    Job fails if reject count exceeds limit.
  19. What is hashing in ETL?
    Detecting data changes efficiently.
  20. What is SLA testing?
    Ensuring ETL jobs complete on time.

🔹 Advanced & Scenario-Based Questions (36–60)

  1. How do you handle record count mismatch?
    Check filters, joins, rejects, CDC logic.
  2. How do you validate null handling?
    Defaults vs rejects vs allowed nulls.
  3. How do you test ETL performance?
    Partitioning, indexing, parallelism.
  4. How do you test incremental loads?
    Validate watermark logic.
  5. How do you test multi-source joins?
    Validate join keys and cardinality.
  6. How do you test aggregation failures?
    Recalculate totals using SQL.
  7. How do you test timezone conversions?
    Validate timestamps across zones.
  8. How do you test re-runs?
    Ensure no duplicate data.
  9. How do you test historical data loads?
    Validate backdated inserts.
  10. How do you validate audit tables?
    Compare source_count vs target_count.
  11. What causes ETL performance issues?
    Data skew, large joins, missing indexes.
  12. How do you test file-based ETL?
    Header, footer, delimiter, encoding.
  13. How do you test schema changes?
    Backward compatibility validation.
  14. How do you validate reporting data?
    BI totals vs DW aggregates.
  15. Explain a critical ETL QA defect you found.
    Example: SCD2 failure, incorrect aggregation.

5. Real SQL Query Examples for ETL QA Validation

Sample Tables

src_orders(order_id, cust_id, amount, order_date)
fact_sales(order_sk, cust_sk, sales_amt, order_date, batch_id)

JOIN Validation – Missing Records

SELECT COUNT(*)

FROM src_orders s

LEFT JOIN fact_sales f

ON s.order_id = f.order_sk

WHERE f.order_sk IS NULL;

GROUP BY – Aggregation Validation

SELECT order_date, SUM(amount)

FROM src_orders

GROUP BY order_date;

SELECT order_date, SUM(sales_amt)

FROM fact_sales

GROUP BY order_date;

Window Function – De-duplication

SELECT *

FROM (

  SELECT *,

         ROW_NUMBER() OVER

         (PARTITION BY order_id ORDER BY load_date DESC) rn

  FROM stage_orders

) t

WHERE rn = 1;

Performance Tuning Validation

EXPLAIN ANALYZE

SELECT cust_sk, SUM(sales_amt)

FROM fact_sales

GROUP BY cust_sk;


6. Scenario-Based ETL QA Testing Examples

ScenarioQA Validation
Record mismatchSource vs target counts
Null handlingDefault or reject logic
Duplicate dataWindow function checks
Late-arriving dataSCD2 backdated insert
Slow jobPartitioning & indexing

7. ETL Tools Commonly Used by QA Teams

  • Informatica
  • Microsoft SSIS
  • Ab Initio
  • Pentaho
  • Talend

ETL QA concepts remain same across tools.


8. ETL QA Defect Examples + Sample Test Case

Defect: SCD2 history not maintained

  • Expected: Old row expired, new row inserted
  • Actual: Two active records
  • Severity: High

Sample Test Case:

  • Update dimension attribute
  • Validate effective_start_date, effective_end_date, current_flag

9. ETL QA Testing – Quick Revision Sheet

  • ETL flow & architecture
  • S2T mapping validation
  • SCD1 vs SCD2
  • SQL JOIN, GROUP BY, window functions
  • Hashing, CDC, audit fields
  • Performance & reconciliation

10. FAQs – Featured Snippet Ready

Q1. Is SQL mandatory for ETL QA testing roles?
Yes. SQL is essential for data validation.

Q2. Are ETL QA interviews tool-specific?
Mostly no. Concepts matter more than tools.

Q3. Can ETL QA testing be automated?
Yes, especially for regression and large datasets.

Leave a Comment

Your email address will not be published. Required fields are marked *