1. What is ETL Testing? (Definition + Example)
ETL Testing is the process of validating data accuracy, completeness, transformation logic, and performance as data moves from source systems into a data warehouse.
Real-World Example (IBM Project Context)
- Source: Banking transaction system (DB2 / Oracle)
- Transformation: Currency conversion, deduplication, SCD handling
- Target: Enterprise Data Warehouse (EDW)
- Reporting: IBM Cognos / BI dashboards
Goal: Ensure business-critical reports use trusted, reconciled data.
2. Data Warehouse Flow: Source → Staging → Transform → Load → Reporting
DW Layer Responsibilities
| Layer | Description |
| Source | OLTP systems, files, APIs |
| Staging | Raw data landing zone |
| Transformation | Business rules, joins, aggregations |
| Target (DW) | Fact & dimension tables |
| Reporting | BI tools, KPIs, analytics |
3. ETL Architecture & Source-to-Target (S2T) Mapping
ETL Architecture Components
- Source systems (DB2, Oracle, Flat files)
- Staging schema
- ETL tool
- Data warehouse
- Reporting layer
S2T Mapping Validation Includes
- Column-to-column mapping
- Data type & length checks
- Transformation logic
- Default values & audit fields
4. IBM ETL Testing Interview Questions (Basic → Advanced)
Below are 75 interview-tested IBM ETL testing interview questions with concise answers.
A. Basic ETL Testing Interview Questions (1–20)
- What is ETL testing?
Validating data extraction, transformation, and loading. - Why is ETL testing important in IBM projects?
IBM clients rely on EDW for regulatory and financial reporting. - What is a data warehouse?
Central repository for analytical data. - What is staging area?
Temporary storage for raw extracted data. - What is S2T mapping?
Document defining source-to-target transformation rules. - Difference between ETL and ELT?
ETL transforms before load; ELT after load. - What is data reconciliation?
Comparing source and target data. - What is surrogate key?
System-generated unique identifier. - Difference between fact and dimension tables?
Fact = measures; Dimension = descriptive attributes. - What is full load?
Loading entire dataset. - What is incremental load?
Loading only changed data. - What is audit column?
load_date, batch_id, updated_ts. - What is data profiling?
Analyzing source data quality. - What is truncation testing?
Ensuring data is not cut due to column size. - What is referential integrity?
Fact foreign keys must exist in dimensions. - What is data validation?
Business rule verification. - What is data verification?
Data movement verification. - What is CDC?
Change Data Capture. - What is reject table?
Stores failed records. - What is data lineage?
Tracking data from source to report.
B. SQL-Based ETL Testing Questions (21–45)
- How do you validate record count?
SELECT COUNT(*) FROM source_orders;
SELECT COUNT(*) FROM target_fact_orders;
- How to find duplicate records?
SELECT order_id, COUNT(*)
FROM staging_orders
GROUP BY order_id
HAVING COUNT(*) > 1;
- How do you validate JOIN logic?
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id;
- How to validate aggregation logic?
SELECT customer_id, SUM(order_amount)
FROM fact_orders
GROUP BY customer_id;
- How to detect missing records?
SELECT s.id
FROM source_table s
LEFT JOIN target_table t
ON s.id = t.id
WHERE t.id IS NULL;
- What is GROUP BY used for in ETL testing?
Validating summaries and totals. - How to validate null handling?
SELECT COUNT(*) FROM dim_customer WHERE email IS NULL;
- What is Slowly Changing Dimension (SCD)?
Managing dimension changes over time. - Difference between SCD Type 1 and Type 2?
Type 1 overwrites, Type 2 preserves history. - SCD2 validation query
SELECT customer_id, COUNT(*)
FROM dim_customer
GROUP BY customer_id
HAVING COUNT(*) > 1;
- How to validate current record in SCD2?
Filter current_flag = ‘Y’. - What is hashing in ETL?
Detecting data changes using hash keys. - How to validate derived columns?
SELECT amount * tax_rate AS expected_tax FROM staging_sales;
- How to validate date transformations?
SELECT * FROM fact_orders WHERE order_date > CURRENT_DATE;
- What is lookup validation?
Ensure lookup values exist and match. - What is control table?
Tracks batch status and load counts. - What is watermark column?
Used for incremental loads. - What is late arriving dimension?
Fact arrives before dimension. - What is late arriving fact?
Fact arrives after reporting window. - What is data balancing?
Matching totals across systems. - How do you validate decimal precision?
SELECT CAST(amount AS DECIMAL(10,2)) FROM staging;
- Difference between truncate and delete?
Truncate is faster, no rollback. - What is metadata testing?
Validating schema and data types. - What is factless fact table?
Tracks events without measures. - What is idempotent ETL?
Same result on multiple runs.
C. Advanced & Performance ETL Questions (46–75)
- Window function example
SELECT customer_id,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY updated_ts DESC) rn
FROM dim_customer;
- Why use window functions in ETL?
For ranking, deduplication, SCD logic. - What is ETL performance testing?
Measuring load time and throughput. - How to tune slow ETL jobs?
Indexing, partitioning, parallelism. - What is pushdown optimization?
Running transformations in DB. - What is partitioning?
Dividing tables to improve performance. - How do you validate data freshness?
SELECT MAX(load_date) FROM fact_sales;
- What is ETL regression testing?
Ensuring changes don’t break existing flows. - How do you test error handling?
Validate reject tables and logs. - What is data skew?
Uneven data distribution. - What is bulk load?
High-volume loading strategy. - How do you validate historical accuracy?
Check effective_date ranges. - What is schema evolution testing?
Validating source schema changes. - What is data latency?
Delay between source and DW availability. - How do you validate negative scenarios?
Invalid, null, boundary data. - What is reconciliation report?
Counts, sums, rejects summary. - What is ETL restartability?
Resume after failure. - What is data anonymization testing?
Validate masking of PII. - How do you validate surrogate key uniqueness?
SELECT sk, COUNT(*) FROM dim_customer GROUP BY sk HAVING COUNT(*)>1;
- What is parallel processing?
Running ETL jobs concurrently. - What is audit trail testing?
Validate batch_id and timestamps. - What is data archival testing?
Old data moved correctly. - What is transformation logic testing?
Validate business rules. - What is end-to-end ETL testing?
Source → report validation. - What is OLTP vs OLAP?
Transactions vs analytics. - What is data drift?
Unexpected data changes. - What is reject analysis?
Root cause of rejected records. - How do you validate currency conversion?
SELECT local_amt * rate = usd_amt FROM staging;
- What is data mart testing?
Validating subject-specific DW subsets. - What is most critical ETL testing skill for IBM interviews?
Strong SQL + business understanding.
5. ETL Tools Used in IBM Projects
- Informatica
- Microsoft SSIS
- Ab Initio
- Pentaho
- Talend
6. ETL Defect Examples + Sample Test Case
Defect: Duplicate records in Fact table
Root Cause: Missing business key join
Fix: Correct JOIN + hashing
Sample Test Case
| Field | Value |
| Scenario | Duplicate detection |
| SQL | GROUP BY HAVING COUNT > 1 |
| Expected | No duplicates |
7. Quick Revision Sheet (IBM Interview Ready)
- Validate counts, sums, duplicates
- Focus on SCD1 / SCD2
- Practice JOIN, GROUP BY, window functions
- Understand performance tuning
8. FAQs (Featured Snippet Friendly)
Q1. What SQL skills are required for IBM ETL testing?
JOIN, GROUP BY, subqueries, window functions.
Q2. Is Informatica mandatory for IBM ETL roles?
Not mandatory, but widely used.
Q3. What is most asked ETL topic in IBM interviews?
SCD, reconciliation, SQL validation.
