Appendix B: Reinforcement Learning for Data Transformation

The Data Transformation Problem

Data transformation is traditionally approached as a deterministic engineering problem: define mappings, apply transformations, validate results. The agentic approach fundamentally reframes this as a sequential decision-making problem under uncertainty — precisely the domain where reinforcement learning excels.

RL Concept	Data Transformation Analog
State	Current schema/data structure representation
Action	Transformation operations (filter, join, reshape, etc.)
Reward	Progress toward target schema, data quality metrics
Policy	Strategy for selecting transformations
Value Function	Estimation of how promising a partial transformation is
Environment	Data systems, APIs, and transformation tools

Core RL Frameworks

Model-Free vs. Model-Based Approaches

Model-Free: The agent learns directly from experience without building an explicit model of transformation outcomes.

Model-Based: The agent builds a model of how transformations affect data structures, then uses this model to plan transformation sequences.

Hybrid Approach (Recommended): Combine model-based planning with model-free learning — use model-free methods initially while building transformation knowledge, gradually shift to model-based planning as confidence in predictions increases, and maintain exploration to discover new transformation patterns.

On-Policy vs. Off-Policy Learning

Off-Policy (Recommended): The agent can learn from transformations performed by other systems or human experts — learn from existing ETL pipelines without replicating them exactly, incorporate human demonstrations, and make more efficient use of available data.

State and Action Representation

Multi-Resolution State Representation

A multi-level representation of data schemas:

Schema-level: Graph representation of tables/fields, type distributions, completeness and quality metrics
Field-level: Name, detected type, semantic type, constraints, distribution summary, embeddings
Attention mechanism: To focus on relevant parts of large schemas using relevance scoring against the current goal

Hierarchical Action Space

Transformation operations organized by level of abstraction:

High-level: Schema restructuring, type conversion strategies, entity resolution
Mid-level: Table operations (join, split, pivot), column operations (rename, combine, derive), row operations (filter, deduplicate)
Low-level: Tool selection (SQL, jq, Python), parameter configuration, validation

Compositional actions allow the agent to construct complex transformations from primitive operations while maintaining a manageable action space:

Python
action = {
  "tool": selected_tool,
  "operation": operation_type,
  "parameters": {param1: value1, param2: value2},
  "target": {fields: [...], constraints: [...]}
}

Value Functions and Reward Design

Multi-Objective Reward Function

Data transformation quality has multiple dimensions that must be balanced:

Schema Alignment — how closely the current schema matches the target
Data Integrity — preservation of relationships and constraints
Information Preservation — minimizing information loss
Efficiency — computational resources required
Simplicity — complexity of the transformation pipeline

Text

R(s,a,s') = w₁·schema_similarity(s',target) +
            w₂·data_integrity(s') +
            w₃·information_retention(s,s') +
            w₄·efficiency_score(a) +
            w₅·simplicity_score(transformation_history)

Where weights w₁–w₅ can be learned from expert demonstrations or adjusted based on specific use cases.

Learning Strategies

Curriculum Learning for Progressive Mastery

Organize training to build capabilities incrementally:

Stage 1: Simple transformations — field renaming, basic type conversions, simple filtering
Stage 2: Structural transformations — flattening nested structures, joining, grouping
Stage 3: Complex semantic transformations — entity resolution, semantic type inference
Stage 4: Multi-step transformation planning — long-horizon planning, optimization, error recovery

Imitation Learning from Human Demonstrations

Before full RL training, bootstrap with human demonstrations: record expert data engineers, mine existing ETL pipelines, generate synthetic examples. Then apply behavioral cloning and inverse reinforcement learning to infer reward functions from demonstrations.

Temporal Difference Learning for Progressive Transformation

Traditional ETL approaches evaluate success only after complete pipeline execution. TD learning enables incremental learning from partial progress:

Python
def update_value_estimate(state, action, reward, next_state):
    Q(state, action) += α * (reward + γ * max_a Q(next_state, a) - Q(state, action))

Credit Assignment with Eligibility Traces

TD(λ) properly credits early transformations that enable later success — addressing the observation that earlier decisions have more weight in redirecting the path:

Python
def td_lambda_update(state, action, reward, next_state, next_action):
    δ = reward + γ * Q(next_state, next_action) - Q(state, action)
    E(state, action) += 1
    for all s,a:
        Q(s,a) += α * δ * E(s,a)
        E(s,a) = γ * λ * E(s,a)

Options Framework for Complex Transformations

Temporal Abstraction

The Options framework learns reusable transformation sequences:

Text

option = {
  "initiation_condition": predicate_on_data_state,
  "termination_condition": predicate_on_data_state,
  "policy": mapping_from_states_to_actions
}

Example options: flatten_nested_json(target_field), normalize_denormalized_table(table, keys), standardize_date_formats(fields), resolve_entity_duplicates(entity_type).

This addresses the hierarchical nature of data transformations, where high-level goals (like "normalize this table") comprise multiple atomic operations.

Planning and Model-Based Approaches

Monte Carlo Tree Search for Transformation Planning

MCTS efficiently explores the space of possible transformation sequences:

Selection: Navigate tree using UCB selection to balance exploration/exploitation
Expansion: Add new transformation nodes to the search tree
Simulation: Use model to simulate outcomes of transformation sequences
Backpropagation: Update value estimates based on simulated outcomes

Novel Ideas

Schema Entropy Reduction

Frame data transformation as progressively reducing entropy (disorder) in the data schema. Measure "schema entropy" as a function of type inconsistencies, structural irregularities, and semantic ambiguities. Each transformation step aims to reduce this entropy — selecting the transformation that maximizes entropy reduction at each step.

Inverse Transformation Learning

Instead of just learning to transform data forward, learn to infer what transformations might have produced a given data structure — enabling "reverse engineering" of data pipelines.

Curiosity-Driven Data Exploration

Implement intrinsic motivation to encourage discovery of unexpected data patterns. Reward the system for discovering surprising patterns or anomalies in the data — larger surprise means higher curiosity reward.

Meta-Learning for Transformation Strategy

Train a meta-model that predicts which transformation strategies will work best for different types of data — learning to adapt the transformation approach itself based on data characteristics.

Theoretical Connections

The process of transforming data can be framed as navigating through a state space where states are different possible representations of the data, actions are transformation operations, and reward measures progress toward the target. This reveals why many transformation problems are challenging — the state space is vast, partially observable, and has complex transition dynamics.

The Temporal Credit Assignment Problem in ETL

The classic temporal credit assignment problem in RL has a perfect analog in data transformation: was the successful result due to the early data cleaning step, the clever join operation in the middle, or the type conversion at the end? Eligibility traces provide a mechanism for properly attributing credit across multi-step transformations.

Practical "Aha Moments"

Data Sampling is Like Experience Replay — strategically sampling subsets of data for transformation planning is analogous to experience replay in deep RL
Pattern Libraries are Value Function Approximators — a library of transformation patterns with success rates is essentially a value function approximator
Human Feedback as Reward Shaping — human input guides the system toward promising areas of the transformation space without specifying exact steps
LLMs Already Implement Policy Networks — LLMs trained on code and data transformations already implement a kind of policy network
The Target Network Concept Applied to Schema Evolution — using a separate "target network" for evaluating transformations can improve stability when transforming rapidly evolving schemas

The Data Transformation Problem

RL Concept	Data Transformation Analog
State	Current schema/data structure representation
Action	Transformation operations (filter, join, reshape, etc.)
Reward	Progress toward target schema, data quality metrics
Policy	Strategy for selecting transformations
Value Function	Estimation of how promising a partial transformation is
Environment	Data systems, APIs, and transformation tools

Core RL Frameworks

Model-Free vs. Model-Based Approaches

Model-Free: The agent learns directly from experience without building an explicit model of transformation outcomes.

Model-Based: The agent builds a model of how transformations affect data structures, then uses this model to plan transformation sequences.

On-Policy vs. Off-Policy Learning

State and Action Representation

Multi-Resolution State Representation

A multi-level representation of data schemas:

Schema-level: Graph representation of tables/fields, type distributions, completeness and quality metrics
Field-level: Name, detected type, semantic type, constraints, distribution summary, embeddings
Attention mechanism: To focus on relevant parts of large schemas using relevance scoring against the current goal

Hierarchical Action Space

Transformation operations organized by level of abstraction:

High-level: Schema restructuring, type conversion strategies, entity resolution
Mid-level: Table operations (join, split, pivot), column operations (rename, combine, derive), row operations (filter, deduplicate)
Low-level: Tool selection (SQL, jq, Python), parameter configuration, validation

Compositional actions allow the agent to construct complex transformations from primitive operations while maintaining a manageable action space:

Python
action = {
  "tool": selected_tool,
  "operation": operation_type,
  "parameters": {param1: value1, param2: value2},
  "target": {fields: [...], constraints: [...]}
}

Value Functions and Reward Design

Multi-Objective Reward Function

Data transformation quality has multiple dimensions that must be balanced:

Schema Alignment — how closely the current schema matches the target
Data Integrity — preservation of relationships and constraints
Information Preservation — minimizing information loss
Efficiency — computational resources required
Simplicity — complexity of the transformation pipeline

Text

R(s,a,s') = w₁·schema_similarity(s',target) +
            w₂·data_integrity(s') +
            w₃·information_retention(s,s') +
            w₄·efficiency_score(a) +
            w₅·simplicity_score(transformation_history)

Where weights w₁–w₅ can be learned from expert demonstrations or adjusted based on specific use cases.

Learning Strategies

Curriculum Learning for Progressive Mastery

Organize training to build capabilities incrementally:

Stage 1: Simple transformations — field renaming, basic type conversions, simple filtering
Stage 2: Structural transformations — flattening nested structures, joining, grouping
Stage 3: Complex semantic transformations — entity resolution, semantic type inference
Stage 4: Multi-step transformation planning — long-horizon planning, optimization, error recovery

Imitation Learning from Human Demonstrations

Temporal Difference Learning for Progressive Transformation

Traditional ETL approaches evaluate success only after complete pipeline execution. TD learning enables incremental learning from partial progress:

Python
def update_value_estimate(state, action, reward, next_state):
    Q(state, action) += α * (reward + γ * max_a Q(next_state, a) - Q(state, action))

Credit Assignment with Eligibility Traces

TD(λ) properly credits early transformations that enable later success — addressing the observation that earlier decisions have more weight in redirecting the path:

Python
def td_lambda_update(state, action, reward, next_state, next_action):
    δ = reward + γ * Q(next_state, next_action) - Q(state, action)
    E(state, action) += 1
    for all s,a:
        Q(s,a) += α * δ * E(s,a)
        E(s,a) = γ * λ * E(s,a)

Options Framework for Complex Transformations

Temporal Abstraction

The Options framework learns reusable transformation sequences:

Text

option = {
  "initiation_condition": predicate_on_data_state,
  "termination_condition": predicate_on_data_state,
  "policy": mapping_from_states_to_actions
}

Example options: flatten_nested_json(target_field), normalize_denormalized_table(table, keys), standardize_date_formats(fields), resolve_entity_duplicates(entity_type).

This addresses the hierarchical nature of data transformations, where high-level goals (like "normalize this table") comprise multiple atomic operations.

Planning and Model-Based Approaches

Monte Carlo Tree Search for Transformation Planning

MCTS efficiently explores the space of possible transformation sequences:

Selection: Navigate tree using UCB selection to balance exploration/exploitation
Expansion: Add new transformation nodes to the search tree
Simulation: Use model to simulate outcomes of transformation sequences
Backpropagation: Update value estimates based on simulated outcomes

Novel Ideas

Schema Entropy Reduction

Inverse Transformation Learning

Instead of just learning to transform data forward, learn to infer what transformations might have produced a given data structure — enabling "reverse engineering" of data pipelines.

Curiosity-Driven Data Exploration

Meta-Learning for Transformation Strategy

Train a meta-model that predicts which transformation strategies will work best for different types of data — learning to adapt the transformation approach itself based on data characteristics.

Theoretical Connections

The Temporal Credit Assignment Problem in ETL

Practical "Aha Moments"

Data Sampling is Like Experience Replay — strategically sampling subsets of data for transformation planning is analogous to experience replay in deep RL
Pattern Libraries are Value Function Approximators — a library of transformation patterns with success rates is essentially a value function approximator
Human Feedback as Reward Shaping — human input guides the system toward promising areas of the transformation space without specifying exact steps
LLMs Already Implement Policy Networks — LLMs trained on code and data transformations already implement a kind of policy network
The Target Network Concept Applied to Schema Evolution — using a separate "target network" for evaluating transformations can improve stability when transforming rapidly evolving schemas

The Data Transformation Problem

Core RL Frameworks

Model-Free vs. Model-Based Approaches

On-Policy vs. Off-Policy Learning

State and Action Representation

Multi-Resolution State Representation

Hierarchical Action Space

Value Functions and Reward Design

Multi-Objective Reward Function

Learning Strategies

Curriculum Learning for Progressive Mastery

Imitation Learning from Human Demonstrations

Temporal Difference Learning for Progressive Transformation

Credit Assignment with Eligibility Traces

Options Framework for Complex Transformations

Temporal Abstraction

Planning and Model-Based Approaches

Monte Carlo Tree Search for Transformation Planning

Novel Ideas

Schema Entropy Reduction

Inverse Transformation Learning

Curiosity-Driven Data Exploration

Meta-Learning for Transformation Strategy

Theoretical Connections

Transformation as State-Space Navigation

The Temporal Credit Assignment Problem in ETL

Practical "Aha Moments"

The Data Transformation Problem

Core RL Frameworks

Model-Free vs. Model-Based Approaches

On-Policy vs. Off-Policy Learning

State and Action Representation

Multi-Resolution State Representation

Hierarchical Action Space

Value Functions and Reward Design

Multi-Objective Reward Function

Learning Strategies

Curriculum Learning for Progressive Mastery

Imitation Learning from Human Demonstrations

Temporal Difference Learning for Progressive Transformation

Credit Assignment with Eligibility Traces

Options Framework for Complex Transformations

Temporal Abstraction

Planning and Model-Based Approaches

Monte Carlo Tree Search for Transformation Planning

Novel Ideas

Schema Entropy Reduction

Inverse Transformation Learning

Curiosity-Driven Data Exploration

Meta-Learning for Transformation Strategy

Theoretical Connections

Transformation as State-Space Navigation

The Temporal Credit Assignment Problem in ETL

Practical "Aha Moments"