// Custom Fine-tuning

Beating AI detectors with a GRPO-trained rewriter

We tried SFT. We tried DPO. Both produced outputs that still read as AI to detectors. GRPO trained from scratch hit near-perfect detection-evasion in a 20-hour run on 8xH100 - and ships today as Surfer's AI Content Humanizer.

offices

USA, Poland

size

60-200 employees

industry

Martech (SaaS tool)

revenue

$25M+ ARR

// Outcomes

The numbers that matter

  • 0%

    AI-detection score post-humanize

  • 20h

    GRPO training on 8xH100

  • 50K

    words per humanize request

01 · How do you make LLM output that doesn't read as LLM output?

The Challenge

Surfer's content tools generate AI-written drafts at scale. The drafts are useful - but downstream, customers run them through AI-detection tools (Originality.ai, GPTZero, Copyleaks, Turnitin), and the same characteristics that make LLM output coherent also make it easy to flag.

The product question was direct: build a rewriter that takes a draft, produces an output that passes the major detectors, and keeps the meaning and structure intact. Treat it as a research problem, because the obvious approaches don't work.

02 · SFT and DPO both fell short

What didn't work

Attempt 1 - SFT on paraphrase pairs. We pulled pre-2021 (pre-ChatGPT) human-written text from Common Crawl, used GPT to paraphrase it into AI-style, and trained the model to reverse the paraphrase: given AI-style input, output the human original.

It did not work. The model learned to produce text that was slightly different from GPT-style, but still recognizable as machine-generated by every detector we tested. SFT's loss function rewarded recovering the original tokens, not evading detection.

Attempt 2 - DPO on chosen/rejected pairs. Chosen = human-style outputs, rejected = GPT-paraphrased. Margins improved on the training metric.

But the rejected probability was still going up - just slightly less than chosen. Worse, the model started drifting to a third style nobody wanted. Optimizing "prefer A over B" doesn't tell the model what A actually is. It just nudges. With ambiguous data, the nudge picks up reward-hacking patterns instead of the target style.

03 · True RL with custom reward functions

GRPO from scratch

GRPO (Group Relative Policy Optimization) is true reinforcement learning: the model generates several rewrites for the same input, each one is scored by a reward function, and the relative scores update the policy. No demonstration data needed - the reward function is the supervision.

Reward design (v1).

  • Score from a custom AI-detector we trained ourselves - keeps us off the leaderboard treadmill of any single third-party tool.
  • Score from Originality.ai (the detector our customers actually run).
  • Format reward - output must follow a structured XML reasoning + answer schema.

The training pattern was the textbook GRPO curve.

  • First ~2 hours: zero good outputs. Reward function returning near-floor scores. Looked like nothing was happening.
  • Around 2.5h: a few rollouts produced outputs that scored well. The model learned from those.
  • From there: exponential improvement. The model produces more good outputs → it learns from more good outputs → quality compounds.
  • By ~20h on 8xH100: near-perfect score on the held-out eval. Our dataset became the limiting factor, not the algorithm.

GRPO v2 - production-grade tuning.

We dropped the XML reasoning structure (model didn't need it). Added two new rewards: structural similarity (preserve markdown headings, paragraphs, lists) and semantic similarity (the rewrite still means what the input means). Tightened the prompt to keep the original style register - formal stays formal, casual stays casual.

What's hard, today.

  • Reward hacking. The model finds loopholes - weird punctuation, run-on sentences, structures that score well on the detector but read poorly. KL regularization to the original model is a guardrail, but on its own it's not strong enough.
  • Style preservation under aggressive humanization. The more you push detection-evasion, the more you risk drifting away from the original voice. The tradeoff is the engagement.

04 · Shipped as Surfer's AI Content Humanizer

Result

The GRPO-trained rewriter ships in production today as Surfer's AI Content Humanizer. Paste up to 50,000 words of AI-generated text and the output passes major AI detectors at near-zero detection rates - while preserving meaning and markdown structure.

  • 100% AI-detected input → 0% AI-detected output (on the detectors customers actually run).
  • 20-hour GRPO training run on 8×H100 to reach the production-grade quality bar.
  • Custom AI-detector trained in-house - keeps the optimization signal honest and not coupled to a single third-party tool that might patch tomorrow.

// What they say

They do more than what we ask them to do. Bards.ai team always suggests solutions that we couldn't figure out without them.
Bartlomiej Korpus

Bartlomiej Korpus

CTO @ SurferSEO

// Ready to ship?

Let's build something that delivers numbers like these.

Book a meeting