The Data Transformation Problem
Data transformation is traditionally approached as a deterministic engineering problem: define mappings, apply transformations, validate results. The agentic approach fundamentally reframes this as a sequential decision-making problem under uncertainty — precisely the domain where reinforcement learning excels.
| RL Concept | Data Transformation Analog |
|---|---|
| State | Current schema/data structure representation |
| Action | Transformation operations (filter, join, reshape, etc.) |
| Reward | Progress toward target schema, data quality metrics |
| Policy | Strategy for selecting transformations |
| Value Function | Estimation of how promising a partial transformation is |
| Environment | Data systems, APIs, and transformation tools |
Core RL Frameworks
Model-Free vs. Model-Based Approaches
Model-Free: The agent learns directly from experience without building an explicit model of transformation outcomes.
Model-Based: The agent builds a model of how transformations affect data structures, then uses this model to plan transformation sequences.
Hybrid Approach (Recommended): Combine model-based planning with model-free learning — use model-free methods initially while building transformation knowledge, gradually shift to model-based planning as confidence in predictions increases, and maintain exploration to discover new transformation patterns.
On-Policy vs. Off-Policy Learning
Off-Policy (Recommended): The agent can learn from transformations performed by other systems or human experts — learn from existing ETL pipelines without replicating them exactly, incorporate human demonstrations, and make more efficient use of available data.
State and Action Representation
Multi-Resolution State Representation
A multi-level representation of data schemas:
- Schema-level: Graph representation of tables/fields, type distributions, completeness and quality metrics
- Field-level: Name, detected type, semantic type, constraints, distribution summary, embeddings
- Attention mechanism: To focus on relevant parts of large schemas using relevance scoring against the current goal
Hierarchical Action Space
Transformation operations organized by level of abstraction:
- High-level: Schema restructuring, type conversion strategies, entity resolution
- Mid-level: Table operations (join, split, pivot), column operations (rename, combine, derive), row operations (filter, deduplicate)
- Low-level: Tool selection (SQL, jq, Python), parameter configuration, validation
Compositional actions allow the agent to construct complex transformations from primitive operations while maintaining a manageable action space:
action = {
"tool": selected_tool,
"operation": operation_type,
"parameters": {param1: value1, param2: value2},
"target": {fields: [...], constraints: [...]}
}Value Functions and Reward Design
Multi-Objective Reward Function
Data transformation quality has multiple dimensions that must be balanced:
- Schema Alignment — how closely the current schema matches the target
- Data Integrity — preservation of relationships and constraints
- Information Preservation — minimizing information loss
- Efficiency — computational resources required
- Simplicity — complexity of the transformation pipeline
R(s,a,s') = w₁·schema_similarity(s',target) +
w₂·data_integrity(s') +
w₃·information_retention(s,s') +
w₄·efficiency_score(a) +
w₅·simplicity_score(transformation_history)Where weights w₁–w₅ can be learned from expert demonstrations or adjusted based on specific use cases.
Learning Strategies
Curriculum Learning for Progressive Mastery
Organize training to build capabilities incrementally:
- Stage 1: Simple transformations — field renaming, basic type conversions, simple filtering
- Stage 2: Structural transformations — flattening nested structures, joining, grouping
- Stage 3: Complex semantic transformations — entity resolution, semantic type inference
- Stage 4: Multi-step transformation planning — long-horizon planning, optimization, error recovery
Imitation Learning from Human Demonstrations
Before full RL training, bootstrap with human demonstrations: record expert data engineers, mine existing ETL pipelines, generate synthetic examples. Then apply behavioral cloning and inverse reinforcement learning to infer reward functions from demonstrations.
Temporal Difference Learning for Progressive Transformation
Traditional ETL approaches evaluate success only after complete pipeline execution. TD learning enables incremental learning from partial progress:
def update_value_estimate(state, action, reward, next_state):
Q(state, action) += α * (reward + γ * max_a Q(next_state, a) - Q(state, action))Credit Assignment with Eligibility Traces
TD(λ) properly credits early transformations that enable later success — addressing the observation that earlier decisions have more weight in redirecting the path:
def td_lambda_update(state, action, reward, next_state, next_action):
δ = reward + γ * Q(next_state, next_action) - Q(state, action)
E(state, action) += 1
for all s,a:
Q(s,a) += α * δ * E(s,a)
E(s,a) = γ * λ * E(s,a)Options Framework for Complex Transformations
Temporal Abstraction
The Options framework learns reusable transformation sequences:
option = {
"initiation_condition": predicate_on_data_state,
"termination_condition": predicate_on_data_state,
"policy": mapping_from_states_to_actions
}Example options: flatten_nested_json(target_field), normalize_denormalized_table(table, keys), standardize_date_formats(fields), resolve_entity_duplicates(entity_type).
This addresses the hierarchical nature of data transformations, where high-level goals (like "normalize this table") comprise multiple atomic operations.
Planning and Model-Based Approaches
Monte Carlo Tree Search for Transformation Planning
MCTS efficiently explores the space of possible transformation sequences:
- Selection: Navigate tree using UCB selection to balance exploration/exploitation
- Expansion: Add new transformation nodes to the search tree
- Simulation: Use model to simulate outcomes of transformation sequences
- Backpropagation: Update value estimates based on simulated outcomes
Novel Ideas
Schema Entropy Reduction
Frame data transformation as progressively reducing entropy (disorder) in the data schema. Measure "schema entropy" as a function of type inconsistencies, structural irregularities, and semantic ambiguities. Each transformation step aims to reduce this entropy — selecting the transformation that maximizes entropy reduction at each step.
Inverse Transformation Learning
Instead of just learning to transform data forward, learn to infer what transformations might have produced a given data structure — enabling "reverse engineering" of data pipelines.
Curiosity-Driven Data Exploration
Implement intrinsic motivation to encourage discovery of unexpected data patterns. Reward the system for discovering surprising patterns or anomalies in the data — larger surprise means higher curiosity reward.
Meta-Learning for Transformation Strategy
Train a meta-model that predicts which transformation strategies will work best for different types of data — learning to adapt the transformation approach itself based on data characteristics.
Theoretical Connections
Transformation as State-Space Navigation
The process of transforming data can be framed as navigating through a state space where states are different possible representations of the data, actions are transformation operations, and reward measures progress toward the target. This reveals why many transformation problems are challenging — the state space is vast, partially observable, and has complex transition dynamics.
The Temporal Credit Assignment Problem in ETL
The classic temporal credit assignment problem in RL has a perfect analog in data transformation: was the successful result due to the early data cleaning step, the clever join operation in the middle, or the type conversion at the end? Eligibility traces provide a mechanism for properly attributing credit across multi-step transformations.
Practical "Aha Moments"
- Data Sampling is Like Experience Replay — strategically sampling subsets of data for transformation planning is analogous to experience replay in deep RL
- Pattern Libraries are Value Function Approximators — a library of transformation patterns with success rates is essentially a value function approximator
- Human Feedback as Reward Shaping — human input guides the system toward promising areas of the transformation space without specifying exact steps
- LLMs Already Implement Policy Networks — LLMs trained on code and data transformations already implement a kind of policy network
- The Target Network Concept Applied to Schema Evolution — using a separate "target network" for evaluating transformations can improve stability when transforming rapidly evolving schemas