HankweaveWritingAboutusJoinus
HankweaveWritingAbout

Appendix B: Reinforcement Learning for Data Transformation

Hrishi Olickel·February 26, 2026
  • The Data Transformation Problem
  • Core RL Frameworks
  • Model-Free vs. Model-Based Approaches
  • On-Policy vs. Off-Policy Learning
  • State and Action Representation
  • Multi-Resolution State Representation
  • Hierarchical Action Space
  • Value Functions and Reward Design
  • Multi-Objective Reward Function
  • Learning Strategies
  • Curriculum Learning for Progressive Mastery
  • Imitation Learning from Human Demonstrations
  • Temporal Difference Learning for Progressive Transformation
  • Credit Assignment with Eligibility Traces
  • Options Framework for Complex Transformations
  • Temporal Abstraction
  • Planning and Model-Based Approaches
  • Monte Carlo Tree Search for Transformation Planning
  • Novel Ideas
  • Schema Entropy Reduction
  • Inverse Transformation Learning
  • Curiosity-Driven Data Exploration
  • Meta-Learning for Transformation Strategy
  • Theoretical Connections
  • Transformation as State-Space Navigation
  • The Temporal Credit Assignment Problem in ETL
  • Practical "Aha Moments"
Note

This spec was written in Q2 2025, when we were exploring RL as a framework for teaching agents to perform data transformations. We abandoned the approach when we abandoned the idea of a singular intermediate format — but the thinking here about state representation, temporal credit assignment, and the options framework for temporal abstraction fed directly into how hanks are structured today. See the parent article for context.

The Data Transformation Problem

Data transformation is traditionally approached as a deterministic engineering problem: define mappings, apply transformations, validate results. The agentic approach fundamentally reframes this as a sequential decision-making problem under uncertainty — precisely the domain where reinforcement learning excels.

RL ConceptData Transformation Analog
StateCurrent schema/data structure representation
ActionTransformation operations (filter, join, reshape, etc.)
RewardProgress toward target schema, data quality metrics
PolicyStrategy for selecting transformations
Value FunctionEstimation of how promising a partial transformation is
EnvironmentData systems, APIs, and transformation tools

Core RL Frameworks

Model-Free vs. Model-Based Approaches

Model-Free: The agent learns directly from experience without building an explicit model of transformation outcomes.

Model-Based: The agent builds a model of how transformations affect data structures, then uses this model to plan transformation sequences.

Hybrid Approach (Recommended): Combine model-based planning with model-free learning — use model-free methods initially while building transformation knowledge, gradually shift to model-based planning as confidence in predictions increases, and maintain exploration to discover new transformation patterns.

On-Policy vs. Off-Policy Learning

Off-Policy (Recommended): The agent can learn from transformations performed by other systems or human experts — learn from existing ETL pipelines without replicating them exactly, incorporate human demonstrations, and make more efficient use of available data.


State and Action Representation

Multi-Resolution State Representation

A multi-level representation of data schemas:

  1. Schema-level: Graph representation of tables/fields, type distributions, completeness and quality metrics
  2. Field-level: Name, detected type, semantic type, constraints, distribution summary, embeddings
  3. Attention mechanism: To focus on relevant parts of large schemas using relevance scoring against the current goal

Hierarchical Action Space

Transformation operations organized by level of abstraction:

  1. High-level: Schema restructuring, type conversion strategies, entity resolution
  2. Mid-level: Table operations (join, split, pivot), column operations (rename, combine, derive), row operations (filter, deduplicate)
  3. Low-level: Tool selection (SQL, jq, Python), parameter configuration, validation

Compositional actions allow the agent to construct complex transformations from primitive operations while maintaining a manageable action space:

Python
action = { "tool": selected_tool, "operation": operation_type, "parameters": {param1: value1, param2: value2}, "target": {fields: [...], constraints: [...]} }

Value Functions and Reward Design

Multi-Objective Reward Function

Data transformation quality has multiple dimensions that must be balanced:

  1. Schema Alignment — how closely the current schema matches the target
  2. Data Integrity — preservation of relationships and constraints
  3. Information Preservation — minimizing information loss
  4. Efficiency — computational resources required
  5. Simplicity — complexity of the transformation pipeline
Text
R(s,a,s') = w₁·schema_similarity(s',target) + w₂·data_integrity(s') + w₃·information_retention(s,s') + w₄·efficiency_score(a) + w₅·simplicity_score(transformation_history)

Where weights w₁–w₅ can be learned from expert demonstrations or adjusted based on specific use cases.


Learning Strategies

Curriculum Learning for Progressive Mastery

Organize training to build capabilities incrementally:

  1. Stage 1: Simple transformations — field renaming, basic type conversions, simple filtering
  2. Stage 2: Structural transformations — flattening nested structures, joining, grouping
  3. Stage 3: Complex semantic transformations — entity resolution, semantic type inference
  4. Stage 4: Multi-step transformation planning — long-horizon planning, optimization, error recovery

Imitation Learning from Human Demonstrations

Before full RL training, bootstrap with human demonstrations: record expert data engineers, mine existing ETL pipelines, generate synthetic examples. Then apply behavioral cloning and inverse reinforcement learning to infer reward functions from demonstrations.


Temporal Difference Learning for Progressive Transformation

Traditional ETL approaches evaluate success only after complete pipeline execution. TD learning enables incremental learning from partial progress:

Python
def update_value_estimate(state, action, reward, next_state): Q(state, action) += α * (reward + γ * max_a Q(next_state, a) - Q(state, action))

Credit Assignment with Eligibility Traces

TD(λ) properly credits early transformations that enable later success — addressing the observation that earlier decisions have more weight in redirecting the path:

Python
def td_lambda_update(state, action, reward, next_state, next_action): δ = reward + γ * Q(next_state, next_action) - Q(state, action) E(state, action) += 1 for all s,a: Q(s,a) += α * δ * E(s,a) E(s,a) = γ * λ * E(s,a)

Options Framework for Complex Transformations

Temporal Abstraction

The Options framework learns reusable transformation sequences:

Text
option = { "initiation_condition": predicate_on_data_state, "termination_condition": predicate_on_data_state, "policy": mapping_from_states_to_actions }

Example options: flatten_nested_json(target_field), normalize_denormalized_table(table, keys), standardize_date_formats(fields), resolve_entity_duplicates(entity_type).

This addresses the hierarchical nature of data transformations, where high-level goals (like "normalize this table") comprise multiple atomic operations.


Planning and Model-Based Approaches

Monte Carlo Tree Search for Transformation Planning

MCTS efficiently explores the space of possible transformation sequences:

  1. Selection: Navigate tree using UCB selection to balance exploration/exploitation
  2. Expansion: Add new transformation nodes to the search tree
  3. Simulation: Use model to simulate outcomes of transformation sequences
  4. Backpropagation: Update value estimates based on simulated outcomes

Novel Ideas

Schema Entropy Reduction

Frame data transformation as progressively reducing entropy (disorder) in the data schema. Measure "schema entropy" as a function of type inconsistencies, structural irregularities, and semantic ambiguities. Each transformation step aims to reduce this entropy — selecting the transformation that maximizes entropy reduction at each step.

Inverse Transformation Learning

Instead of just learning to transform data forward, learn to infer what transformations might have produced a given data structure — enabling "reverse engineering" of data pipelines.

Curiosity-Driven Data Exploration

Implement intrinsic motivation to encourage discovery of unexpected data patterns. Reward the system for discovering surprising patterns or anomalies in the data — larger surprise means higher curiosity reward.

Meta-Learning for Transformation Strategy

Train a meta-model that predicts which transformation strategies will work best for different types of data — learning to adapt the transformation approach itself based on data characteristics.


Theoretical Connections

Transformation as State-Space Navigation

The process of transforming data can be framed as navigating through a state space where states are different possible representations of the data, actions are transformation operations, and reward measures progress toward the target. This reveals why many transformation problems are challenging — the state space is vast, partially observable, and has complex transition dynamics.

The Temporal Credit Assignment Problem in ETL

The classic temporal credit assignment problem in RL has a perfect analog in data transformation: was the successful result due to the early data cleaning step, the clever join operation in the middle, or the type conversion at the end? Eligibility traces provide a mechanism for properly attributing credit across multi-step transformations.

Practical "Aha Moments"

  1. Data Sampling is Like Experience Replay — strategically sampling subsets of data for transformation planning is analogous to experience replay in deep RL
  2. Pattern Libraries are Value Function Approximators — a library of transformation patterns with success rates is essentially a value function approximator
  3. Human Feedback as Reward Shaping — human input guides the system toward promising areas of the transformation space without specifying exact steps
  4. LLMs Already Implement Policy Networks — LLMs trained on code and data transformations already implement a kind of policy network
  5. The Target Network Concept Applied to Schema Evolution — using a separate "target network" for evaluating transformations can improve stability when transforming rapidly evolving schemas
← Back to main articleRISC won: Building towards Data AGI
Southbridge AI
WritingAboutGitHub
  • The Data Transformation Problem
  • Core RL Frameworks
  • State and Action Representation
  • Value Functions and Reward Design
  • Learning Strategies
  • Temporal Difference Learning for Progressive Transformation
  • Options Framework for Complex Transformations
  • Planning and Model-Based Approaches
  • Novel Ideas
  • Theoretical Connections
AboutJoin