Jaxpot: Train self-play RL agents FAST by parallelizing environments on GPU

Fast self-play RL with GPU environments - how we built Jaxpot for PPO, AlphaZero-style training, and imperfect-information games like Dark Hex.

Karol Kłusek

Apr 27, 2026 · 7 min read

Jaxpot: Train self-play RL agents FAST by parallelizing environments on GPU

We got hired to build a poker bot. We quickly realized we needed to run a ridiculous number of games. Around 1M+… so speeding up the process was worth it. That's why we choose Jax. For the training pipeline we needed self-play (PPO and AlphaZero style), league training, imperfect-information environments, and the ability to iterate fast without rewriting the training loop. So we built all of that. Now we're opensourcing it as Jaxpot.

In this post you will learn how to efficiently train agents on GPU like this:

here's the replay of selfplay between 1900 and 1800 iteration of the Dark Hex agent trained with Jaxpot.

Train your RL agent in Colab

We'll start with simple self-play agent. It takes about 90 seconds on a free Colab GPU. After that, you'll switch config and train on Dark Hex, a simple imperfect-information game.

Here's the notebook. Run the training pipeline and come back to read the rest!

Open Colab Notebook →

Code walkthrough

Below you can see four Hydra config files - this is the recipe for our experiment.

1Path("config/env/tic_tac_toe/default.yaml"),2Path("config/model/tic_tac_toe_mlp.yaml"),3Path("config/logger/wandb_public.yaml"),4Path("config/experiment/tic_tac_toe/colab.yaml"),

4 config files that defines our training

Let's first open the model definition from config/model/tic_tac_toe_mlp.yaml because it's a simple MLP (Multi Layer Perceptron) definition is very short! It's just 2 hidden layers of the shape 64.

1_target_: jaxpot.models.architectures.mlp.MLPModel2hidden_dims: [64, 64]

MLP model definition

Next let's inspect the config/experiment/tic_tac_toe/colab.yaml file. This is the tic tac toe experiment adjusted to fit on T4 GPU available for free on Colab. Let's analyze selected sections from this file.

1# ---- PPO trainer hyperparameters ----2trainer:3  num_epochs: 4 # PPO updates per batch4  batch_size: 2048 # Samples per optimization step; larger is smoother but slower5  auxiliary_losses: []6  clip_eps: 0.2 # PPO clipping strength; prevents too-large policy updates7  entropy_coeff: 0.001 # Final exploration bonus weight (lower = more greedy policy)8  entropy_coeff_start: 0.01 # Initial exploration bonus; to encourage trying new moves9  entropy_decay_iterations: 1000 # Number of iterations to decay exploration

PPO training parameters: rollout length, batch size, entropy schedule, and optimization settings.

1# ---- Core training setup ----2seed: 42 # Random seed for reproducibility.3lr: 1e-3 # Learning rate.4lr_schedule: "constant"5multi_gpu: false # Use one GPU/CPU process (simpler and Colab-friendly).6use_target_selfplay: false

Agent model training parameters

1# ---- Environment mix and rollout collection ----2selfplay_num_envs: 1024 # Parallel games against the current policy3random_num_envs: 1024 # Parallel games against random opponents for grounding4league_num_envs: 0 # No league opponents in this quickstart5archive_num_envs: 0 # No archived past policies sampled as opponents6random_warmup_iters: 2000 # Keep random-opponent mix for early stability7league_add_every: 0 # Never add models to a league (league disabled here)8base_unit: 64 # allocation chunk size for batched rollouts9num_steps: 9 # Rollout length per game (Tic-Tac-Toe max moves is 9)10total_iters: 1000 # Total train iterations (main training length)11grad_accum_steps: 1 # Gradient accumulation. Here: update every batch directly12gamma: 0.99 # Reward discount factor (how much future rewards matter)13gae_lambda: 0.95

Environment parameters: number of parallel environments and game-specific settings.

1# ---- Evaluation (faster settings for tutorial) ----2eval:3  - _target_: jaxpot.evaluator.random.RandomEvaluator4    eval_every: 250 # Run evaluation every 250 train iterations5    num_envs: 8192 # number of parallel envs6    num_steps: ${num_steps} # Use the same episode length as training7    deterministic: false # Sample actions stochastically during eval8    name: eval_vs_random # Metric name9  - _target_: jaxpot.evaluator.random.RandomEvaluator10    eval_every: 25011    num_envs: 819212    num_steps: ${num_steps}13    deterministic: true # Always take argmax action (no sampling noise)14    name: eval_vs_random_deterministic

Evaluation setup: how often we test the current agent and which opponents it plays against.

With the config ready we just need to log in to Weights & Biases to later inspect the training metrics. You can also use TensorBoard instead, but we prefer wandb.

!uv run wandb login

A short script that will trigger our training pipeline. You can see in the notebook that we will overwrite some of the experiment parameters here for the ease of use in the tutorial. Try changing these parameters and see how it impacts the training!

1TOTAL_ITERS = 1002SELFPLAY_NUM_ENVS = 10243RANDOM_NUM_ENVS = 10244!uv run python train_selfplay.py

Reading The Training Output

When the run starts, Jaxpot prints a TUI dashboard.

The dashboard is useful while the run is active, but curves are better for understanding training. You should see something like this after the run:

The model seems to have learned most of the information about the game during first 500 steps.

we can see it achieves 95% winrate against an opponent performing random moves. The draw rate changes due to the randomization.

We can see it achieves 95% winrate against an opponent performing random moves. The draw rate changes due to the randomization.

For Tic-Tac-Toe, a random opponent is very easy to exploit, so the win rate climbs quickly. It is a sanity check - the training loop works, and the evaluation pipeline is reporting.

You can start the training for Dark Hex from the same notebook and read the rest of article while it's training.

Jaxpot

It is built around three practical ideas:

PPO and AlphaZero-style training. PPO gives you a strong policy-gradient baseline for self-play. AlphaZero-style components are useful when you want search and value-guided planning.
JAX. Vectorized rollouts and training, compiled, and run efficiently on accelerators. Pushing the training speed to the hardware limit.
Hydra configs. Experiments are composed from small config files for the game, model, trainer, evaluator, and logger. Changing the game or training setup requires just changing the config.

Why It Is A Good Fit For Imperfect-Information Games

In perfect information games like Chess, Go, and standard Hex, both players see the full board. The game state is public.

Imperfect-information games split the state into public and private parts. Poker hides cards. Liar's Dice hides dice. Dark Hex hides opponent stones unless you collide with them.

That changes everything. The optimal strategy is not a single fixed policy. A poker bot that always plays the highest win-rate line becomes predictable and easy to exploit.

Self-play already helps: the opponent keeps changing, so the policy cannot overfit to a fixed strategy. Entropy scheduling keeps the policy exploring early in training, which matters when the game demands mixed strategies.

On top of that, Jaxpot supports League play. The agent trains against frozen snapshots of itself from earlier in the run, and opponents it struggles against get higher sampling weight. That prevents the policy from "forgetting" how to beat older versions of itself. When the league fills up, surplus opponents move to an archive that can reactivate if the agent starts losing to them again.

Challenge: Train On Dark Hex

In the video you can see a 1900 step model playing against 1800 step model both trained with Jaxpot. Keep in mind the black stones are not visible for the pink player. We can see them only in the replay. This is the difference between Hex and Dark Hex.

Now let's switch to a game that is still small enough to understand, but much more interesting.

Dark Hex is an imperfect-information version of Hex.

In normal Hex, both players see the whole board. In Dark Hex, each player sees:

empty cells according to their own view
their own stones
opponent stones only when revealed through failed placement attempts

The true board exists, but the agent does not get to see it. This is a much better demonstration of why environment design matters. The agent is acting under uncertainty.

Jaxpot already includes the Dark Hex environment so you can start the training in the same notebook as before. Just scroll to the bottom of the notebook. Let's see if you can improve the Dark Hex Agent winrate by changing the parameters? After some changes you should see a training run like this:

The yellow line shows the winrate against random opponent. You can see the model immediately started always winning agains it. Using random for training is a easy way to teach model how to exploit the opponent's mistakes.

The Red line shows the loss of the value head. It keeps improving despite the fact the winrate with random opponent is flat.

Let us know what game agent you would like to see next!

Written by

Karol Kłusek

ML Engineer @ bards.ai

Working on reinforcement learning and self-play systems. Author of Jaxpot.

Like this article?

Get practical insights on AI, product, and growth sent to your inbox.