// Research / Custom Fine-tuning

Self-play RL for game-theoretic agents.

PPO + self-play + league play on Jaxpot, our open-source JAX-native RL framework. Vectorized environments running thousands of parallel games per GPU, snapshot leagues so the policy plays against past selves, ResNet/MLP function approximators, and Hydra-configured experiments. Built for board games, hidden-information games, and multi-agent decision problems where the optimal strategy emerges from self-play, not from demonstrations.

Book a meeting View Jaxpot on GitHub

jaxpot · train_selfplay.py

# config/experiment/dark-hex/league-v3.yml
env: dark_hex_7x7
model: resnet_b6_c128
trainer: ppo  # self-play + league

# trainer/ppo
rollout_length: 128
num_envs: 2048  # vectorized JAX
gae_lambda: 0.95
clip_eps: 0.2
learning_rate: 3e-4

# league
opponent_pool: snapshot
archive_every: 500 iters
eval_against: [random, latest, archived]

2048 envs · 4×H100 · TTT in minutesjaxpot · OSS

// What we see

When demonstrations don't exist, self-play is the only option that scales.

Imitation hits a ceiling on game-theoretic problems

SFT against expert play caps at the expert's level - and for adversarial games (Hex, Go variants, hidden-info card games, novel multi-agent decision problems) the expert pool is small or doesn't exist. Self-play sidesteps the data problem entirely: the agent generates its own training data by playing against itself or earlier versions of itself.

Generic RL frameworks aren't tuned for self-play throughput

RLlib and Stable-Baselines3 work, but their abstractions weren't designed around vectorized JAX rollouts and league management. Jaxpot runs thousands of parallel games per GPU on `pgx`-compatible environments and treats league play and archived-checkpoint evaluation as first-class concepts. Tic-tac-toe converges in minutes on CPU; Dark Hex 7×7 trains overnight on a single H100.

Reward shaping + checkpoint discipline is the actual engagement

PPO works. The harder problem is what to score, when to archive, when to swap opponents, and how to evaluate that the policy is genuinely improving rather than just specializing against its current self. ELO ladders against frozen checkpoints, win-rate vs random + scripted baselines, and held-out adversarial probes are the discipline that separates a converging run from a circular one.

// What we do

Three layers, all on top of Jaxpot.

Self-play training on our open-source JAX-native RL framework. PPO + league play + vectorized environments. Open code, reproducible Hydra configs, your engineering team can audit and extend the same stack we ship.

Self-play training loops

PPO with GAE, vectorized rollouts via JAX (2048+ parallel envs per GPU on a typical board game), ResNet or MLP function approximators chosen per environment. Self-play with snapshot opponents - periodically archive the current policy, sample from the archive as opponents during rollouts, prevent the policy from collapsing into a strategy that only beats its current self.

PPO + self-play + GAE on the Jaxpot training loop
Vectorized pgx environments - 1000s of parallel games per GPU
ResNet (b6/c128 typical) or MLP backbones, configurable per env
Snapshot opponent archive with configurable archive cadence

League play + multi-agent training

Multi-agent tournaments where multiple distinct policies train against each other concurrently - useful when a single self-play policy collapses to a narrow strategy. Configurable opponent sampling: latest checkpoint, random archived, or weighted by ELO. Built-in baseline evaluation against scripted and random opponents alongside the league.

League management with configurable opponent sampling strategies
Concurrent training of N policies with cross-policy rollouts
Population-based selection for non-transitive game structure
Automatic baseline eval (random, scripted, prior checkpoints)

Evaluation, ELO, reproducibility

Self-play loss curves don't tell you the policy is actually improving - you can train for a week and end up at a strategy that only beats its current self. ELO ladders against frozen checkpoints, win-rate vs scripted baselines, and held-out adversarial probes are the discipline. Every Jaxpot run logs to W&B / TensorBoard with hashed configs so the run is reproducible six months later.

ELO / TrueSkill ladders against frozen checkpoint pools
Win-rate vs random + scripted + archived adversaries
Held-out probes for non-transitive overfitting
Hydra config + W&B / TensorBoard / local - full reproducibility

Jaxpot on GitHub - open-source, audit-friendly

// Method fit

RL for agents fits when the problem is game-theoretic.

skip it if

You want to RL fine-tune an LLM
RL fine-tuning of LLMs - DPO, KTO, GRPO, RLHF, RLAIF - is a different engagement entirely. The reward signals, infra (TRL, Unsloth, VeRL, vLLM rollouts), and failure modes (likelihood displacement, reward hacking on text) are distinct from self-play RL on game environments. We do that work, just on a different page.
Preference Optimization (DPO / KTO / GRPO)
Your problem isn't multi-agent or game-theoretic
Single-agent control on a fixed environment, prediction, classification - those are usually better solved with supervised learning, imitation learning from demonstrations, or offline RL. Self-play is the right tool when the difficulty comes from an opponent or from the agent's own past behavior, not from a fixed external environment.
You don't have a simulator (or can't afford to build one)
Self-play needs a fast environment. Real-world data collection at the rates RL requires is usually impractical (slow, expensive, or unsafe). If you don't have a simulator and can't justify building one, the engagement is simulator-building first - happy to help, but it's a different scope.
You're at the prototype stage with no clear target
Self-play training compounds value when there's a measurable target (beat the current best agent, hit ELO X, reach competence on N test scenarios). At the "explore what's possible" stage, prototyping in a Colab on toy environments is cheaper and faster than spinning up the full pipeline.

use it if

Your problem is game-theoretic, multi-agent, or has adversarial dynamics - board games, hidden-info card games, debate / negotiation, market making against other agents, robotic policies trained in self-play simulation.

You have a simulator or are willing to build one - pgx-compatible board games work out of the box on Jaxpot; custom environments need to expose roughly a `step / reset / observation` interface.

You want to ship on top of an open-source stack you can audit and extend - Jaxpot is permissively licensed, the code is yours to read, and we work in the open as much as possible.

// How we work

Environment first. Self-play next. League play and eval to close it out.

Every engagement starts with the environment and reward - the two things that decide whether self-play training will produce something useful. Then a baseline PPO + self-play run on Jaxpot. Then league play and evaluation discipline before handoff.

Environment, reward, and a working baseline

Wrap (or build) the environment in `pgx`-compatible form. Reward design - sparse terminal reward (win/lose), shaped intermediate signals where the math justifies them, adversarial probes hand-crafted before training so reward hacking surfaces early. Baseline scripted and random opponents available from day one for win-rate eval.

PPO + self-play on Jaxpot

Hydra-configured experiment in Jaxpot's `config/experiment/<game>/<config>.yml` format. Vectorized JAX rollouts (1000s of parallel envs per GPU), ResNet/MLP backbone chosen per environment. Snapshot opponent archive with a configurable cadence so the policy plays against past selves. W&B + TensorBoard logging from the first iteration.

League play, evaluation, handoff

League play if a single self-play policy collapses to a narrow strategy. ELO / TrueSkill ladder against archived checkpoints to verify genuine improvement. Win-rate vs scripted and random baselines as a sanity floor. Hand off the Jaxpot config + checkpoints + W&B project + a runbook so the team can re-run and extend the experiment as the environment evolves.

// Expert insight

“Every successful self-play run we've shipped follows the same curve. First few hours: nothing. The reward is bouncing around the floor and the win-rate vs scripted baselines is at random. Then a few rollouts produce decent strategies - purely by chance - and the policy learns from them. Quality compounds exponentially. The unintuitive part isn't the algorithm. It's keeping the team's nerve through the warmup and trusting that exponentials start invisibly.”

Karol Gawron

Head of R&D @ bards.ai

Jaxpot on GitHub

// Why bards.ai

Why us, instead of RLlib + a notebook.

Generic RL frameworks work for textbook problems. Self-play at production quality needs vectorized JAX environments, league management, and the eval discipline that separates a converging run from a circular one. We built Jaxpot because the existing tools didn't ship that combination - and we ship it open.

Jaxpot - our open-source RL framework

JAX-native, pgx-compatible environments, PPO + self-play + league play, Hydra configs, W&B/TensorBoard logging. Tic-tac-toe converges in minutes on CPU; Dark Hex 7×7 overnight on a single H100. Permissive license, public on GitHub at github.com/bards-ai/Jaxpot.

JAX-native vectorized environments

1000s of parallel games per GPU via `pgx`, with multi-GPU support out of the box. The rollout throughput that decides whether self-play converges in days or weeks.

League play + population-based training

Concurrent training of N policies, configurable opponent sampling (latest, random archived, ELO-weighted), automatic baseline eval. The patterns that prevent self-play collapse on non-transitive game structure.

ELO + held-out probe evaluation

Self-play training can look like it's improving while the policy circles. ELO ladders against frozen checkpoint pools, win-rate vs scripted opponents, and held-out adversarial probes catch the failure mode before it ships.

10+ peer-reviewed publications

CLARIN-PL spinoff. The team has reviewed RL papers in NeurIPS / ICML cycles and reproduced the ones worth reproducing - including the AlphaZero family and modern PPO variants.

We tell you when self-play is the wrong tool

Many decision-making problems aren't game-theoretic. Imitation learning, behavioral cloning, or offline RL fits better. We say so before you spend the GPU budget.

// FAQ

Common questions about RL for agents

RLlib and SB3 are general-purpose - they support a long list of algorithms but their abstractions weren't designed around vectorized JAX rollouts and league management for self-play. Jaxpot is narrower (PPO + self-play + league play) but optimized for that combination: pgx-compatible vectorized envs, JAX throughput, snapshot archives, ELO eval as first-class. Spinning Up is for learning, not for shipping. If you want to extend RLlib to do what Jaxpot does, you can - we just shipped the result.

Anything implementing the `pgx.core.Env` interface - Go (9×9 demonstrated), Dark Hex (7×7 example), Tic-Tac-Toe (the tutorial), and the rest of pgx's library (chess, shogi, backgammon, various card games). Custom environments need to expose roughly a `step / reset / observation` interface - straightforward to add for any deterministic or stochastic-but-simulable game environment. For complex non-game environments (robotic sim, market sim), wrapping is part of the engagement.

Tic-tac-toe converges in minutes on CPU with ~2,000 PPO iterations - fastest sanity check we use. Dark Hex 7×7 trains overnight on a single H100 to a checkpoint that beats every scripted baseline. Go 9×9 is in the 1–3 day range on multi-GPU depending on the depth target. Production-quality strategy in larger games (Hex 11×11, complex multi-agent) takes serious clusters and 1–2 weeks of training time. The bottleneck is rollout throughput; vectorized JAX makes that the lever that matters.

Self-play loss curves are not a reliable signal - the agent can train for a week and end up at a strategy that only beats its current self. We track three things continuously. ELO / TrueSkill against a frozen pool of past checkpoints. Win-rate vs scripted baselines (random opponent, simple heuristics, prior production agent if applicable). And held-out adversarial probes - hand-crafted situations where a healthy agent should make a specific decision. Self-play that looks like it's improving but only against itself is the most common failure mode; these checks catch it.

For two-player zero-sum games with full information (Hex, Go, chess), the math says it should converge to a Nash equilibrium given enough time and the right algorithm. In practice, modern PPO + self-play + league play gets within striking distance of that for most game sizes the engagement is realistic for. Non-transitive games (rock-paper-scissors-shaped dynamics) need population training to avoid cycles. Hidden-information games (Dark Hex, poker variants) need careful handling of belief states - the engagement is more involved but the patterns are well-studied.

Yes - most engagements involve some custom environment work. Pgx-compatible games are the easiest path; for non-game environments (robotic sim, market sim, scheduling, supply-chain) we adapt the rollout loop and reward hooks. The constraint is environment speed: if a single env step takes 100ms and the network forward is 1ms, you're rollout-bound and we'll spend time speeding the env up before we touch the algorithm.

Jaxpot is open-source under a permissive license - free to use, audit, fork. The engagement is everything around it: environment design, reward shaping, training infra (multi-GPU, league management, eval discipline), debugging the runs that didn't converge, and handing off a reproducible Hydra config + checkpoint + runbook so your team can iterate after we leave. The code is the same code we ship in OSS; the engineering judgment around it is what you're paying for.

// Let's run it

Ship a self-play agent that actually wins.

Tell us your environment, your reward signal, and the target you'd accept. We'll come back with an algorithm choice (PPO + self-play, league play, or a different RL family entirely) and a number - usually within a business day.

Book a meeting View Jaxpot on GitHub

Karol Gawron

Head of R&D @ bards.ai