Your Training Data Is Going Stale. Composition-RL Shows What to Do About It.

There is a problem hiding inside every reinforcement learning pipeline that trains language models on verifiable tasks. The training data becomes useless faster than anyone expects.

A paper accepted at ICML 2026 --- “Composition-RL” by Xu, Bai, Yang et al. --- documents this precisely. On the MATH12K dataset, within 250 training steps, the proportion of “solve_all” prompts (where every rollout produces the correct answer) rises from approximately 0% to over 75%. The model masters three-quarters of its training data before training is meaningfully underway.

This means the effective dataset collapses from 12,000 prompts to roughly 3,000. Not because the data was bad. Because the model got good. The better your training works, the faster your data stops teaching.

If you are spending money on data collection, labeling, and curation for RLVR training, this is the first number you should internalize.

The Standard Response Is Wrong

The obvious reaction is to collect more data. Harder problems. Broader domains. New benchmarks. This is how most organizations think about the problem: the model outgrew the data, so get more data.

The Composition-RL authors propose something different. Instead of acquiring new problems, they compose existing ones.

Sequential Prompt Composition (SPC) takes two problems the model has already solved individually and chains them together. To answer the second, the model must first solve the first. The answer to problem A becomes an input to problem B. If you get A wrong, you get B wrong. If you get both right, the final answer is verifiable because both component answers were verifiable.

The effect is immediate and measurable. The solve_all rate drops from 81.5% to 41.4%. The effective training surface roughly doubles --- from existing data, with no new collection, no new labeling, and no new privacy exposure.

From 12,000 original prompts, the method generates 199,000 composed variants.

Why This Matters Beyond the Lab

The composition technique is elegant, but the strategic insight is what deserves attention. Here are the implications.

Data efficiency changes the economics of model training

Most organizations approaching AI training think in terms of three levers: more data, more compute, bigger models. Composition-RL demonstrates a fourth lever --- smarter data recycling --- that operates orthogonally to the other three.

This is not a marginal improvement. Doubling the effective training surface from existing data changes the cost structure of every training run. For enterprises that have invested in curating proprietary datasets, this means the asset they already own may be significantly underutilized.

Smaller models with better training strategy can beat larger models with naive training

The paper’s most provocative result: a 4-billion parameter model trained with curriculum-depth Composition-RL achieves 37.9% on AIME24. For context, an 8B model trained with Beyond-80/20 achieves 34.6%. An 8B model trained with Alpha-RL achieves 28.3%.

A model half the size outperforming models twice its parameter count.

This is not a controlled comparison --- the models differ in architecture, training data, and algorithms, not just size. The authors of the paper do not claim otherwise. But the directional finding reinforces a pattern we see across the industry: training methodology is increasingly the differentiator, not raw parameter count.

For enterprises making build-versus-buy decisions about model training, this reframes the question. The answer may not be “get a bigger model.” It may be “train the model you have more intelligently.”

Cross-domain composition unlocks transfer that naive mixing destroys

Here is a result that should make anyone mixing training datasets pause.

Standard reinforcement learning with mixed physics and math data degrades math performance by 1.4%. The domains interfere with each other. This is well-known enough that most practitioners avoid naive data mixing.

But composing physics and math problems together --- chaining a physics problem into a math problem so solving one requires solving the other --- yields a 9.1% improvement on AIME24. Same data. Same domains. Different structure. Opposite result.

The difference between mixing and composing is the difference between dumping ingredients into a bowl and actually cooking. The structure of how data is presented to the model matters at least as much as what data is presented.

Composed prompts create implicit process supervision

This is the finding I find most interesting for governance.

Training on composed prompts forces the model to get intermediate steps right, even though the reward signal only checks the final answer. If problem B depends on the output of problem A, the model cannot get the right final answer without getting A right first. The intermediate correctness is enforced by the structure of the prompt, not by explicit step-by-step labels.

This matters because process supervision --- verifying not just what a model concludes but how it reasons --- is one of the hardest problems in AI governance. Step-by-step labeling is expensive, slow, and does not scale. If compositional training achieves even a fraction of the auditability benefits of explicit process supervision, it changes the cost structure of building verifiable AI systems.

I will add a caveat here that the paper’s own evidence is suggestive rather than definitive. The authors show that compositional training improves intermediate step accuracy, but they do not demonstrate the mechanism. Alternative explanations --- difficulty calibration, length generalization effects --- have not been ruled out. The signal is promising. The proof is incomplete.

Curriculum progression matters

Training with composed prompts of depth 1, then depth 2, then depth 3 outperforms jumping directly to depth 2 by 3.0%. The model needs to walk before it runs.

This finding is consistent with everything we know about human learning and increasingly about machine learning: graduated difficulty produces more robust capability than trial by fire. Organizations designing training pipelines should build in curriculum structure rather than presenting the hardest problems first.

What This Does Not Prove

Intellectual honesty requires noting what the paper does not demonstrate.

The data staleness problem is not new. DAPO, a prior method, already addresses the solve_all problem through Dynamic Sampling --- filtering out uninformative prompts during training. Composition-RL is a complementary approach, not a replacement for existing techniques. The paper positions itself against a problem that the field has already begun addressing through other means.

The method currently works for numeric, verifiable domains. Math and physics problems have clean right-or-wrong answers. Extension to tasks with subjective or non-numeric evaluation --- summarization, creative writing, open-ended reasoning --- has not been demonstrated. The verification mechanism that makes composition work depends on composability that many real-world tasks lack.

The cross-domain result uses overlapping domains. Physics and math are not distant domains. They share notation, reasoning patterns, and problem structures. Whether composition works across genuinely distant domains --- say, legal reasoning and code generation --- is an open question.

The 30B-A3B result has a weak baseline. The paper reports a 21.4% AIME24 improvement on a 30B-A3B model, but the baseline for that model is anomalously weak. The improvement is real but inflated by the starting point.

These caveats do not invalidate the work. They bound it. And bounding is what converts a research finding into actionable strategy.

The Governance Angle

For organizations deploying AI in regulated or high-stakes environments, Composition-RL points toward a model of training governance that is worth watching.

Data minimization. Composition generates harder training signal from existing data rather than requiring new data acquisition. In environments where data collection carries privacy, compliance, or cost burdens, this is directly valuable.

Verifiability by construction. Composed prompts create training problems where intermediate reasoning must be correct for the final answer to be correct. This is a structural property, not a post-hoc check. If this approach generalizes, it offers a path toward training processes that are auditable by design.

Smaller models, smaller risk surface. If a 4B model trained with compositional methods can match or exceed an 8B model trained without them, the operational implications are significant. Smaller models are cheaper to serve, faster to inference, easier to audit, and simpler to deploy in constrained environments. Every parameter you do not need is a parameter you do not have to govern.

None of these governance benefits are guaranteed by the paper alone. They are directions the research suggests. The distance between a conference paper and production deployment is real, and organizations should treat these as hypotheses worth testing, not conclusions worth betting on.

What to Do with This

If you train models: Evaluate whether your training data has a solve_all problem. If you are running RLVR on verifiable tasks, measure what percentage of your training prompts are fully solved at each checkpoint. If the number is above 50%, you are burning compute on data that teaches nothing. Composition is one approach to fix this. Dynamic Sampling is another. Doing neither is the most expensive option.

If you buy models: Ask your vendors about training methodology, not just parameter count and benchmark scores. The Composition-RL results suggest that how a model was trained may matter more than how big it is. A well-trained small model may outperform a poorly trained large one --- and cost less to serve.

If you govern AI systems: Watch the implicit process supervision finding. If compositional training reliably produces models that get intermediate steps right without explicit step labels, it changes the cost curve for building auditable AI. This is early-stage evidence, not a proven technique. But it is the kind of structural approach to verification that governance frameworks should be designed to accommodate.

If you manage training budgets: The paper’s core insight is that your existing data is probably underutilized. Before investing in new data acquisition, investigate whether compositional or curriculum-based approaches can extract more training signal from what you already have. The ROI on smarter data recycling may exceed the ROI on new data collection.

Sources

Xu, X., Bai, C., Yang, K., et al. “Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models.” ICML 2026. arXiv:2602.12036
Yu, Q., et al. “DAPO: An Open-Source LLM Reinforcement Learning System.” 2024.
He, K., et al. “Deep Residual Learning for Image Recognition.” arXiv:1512.03385 (2015).

At Victorino Group, we help organizations design AI training and governance strategies that optimize for efficiency, auditability, and long-term value --- not just benchmark scores. If you are evaluating model training approaches or building governance frameworks for AI systems, reach out.