March 16, 2026

How We Taught an AI to Nitpick Docs

Daniel Abdelsamed

Daniel Abdelsamed

Over the last couple years, I’ve iterated on what we at Apollo call the “AI Librarian.” It’s a suite of AI tools that sits on top of our documentation platform. From among those tools, the one I’m most proud of (at least for now) is the AI Style Guide Review. This tool intelligently and contextually applies our style guide, then brings the results right to the authors in GitHub.

Let me tell you: I got a lot wrong at first. In this post, I explain the three things I did to finally get it right: 

  • Restructured the style guide so AI could make sense of it 
  • Wired the review into our existing workflow
  • Made it fast enough and cheap enough to run on every commit

But before I dive into the nitty-gritty details, let’s talk about the problem.

Scaling style

The problem we had was all too common for documentation teams in the software-development industry: Apollo’s docs were written by dozens of engineers across many teams, and although we had a style guide with clear rules about voice, tense, terminology, and formatting, enforcing the guidelines was almost impossible. Writers ended up reviewing changes with fast-approaching deadlines and, most often, had to spend their time fixing blatant typos instead of making substantive changes that actually made the writing good.

Some tools, like Vale, existed to help address that problem, but they were limited to traditional static analysis. They caught the patterns I told them to catch and nothing else. We needed something that could understand context, not simply match a regex. We wanted something that could ask, “Does this sentence read well?” and actually answer the question correctly. So I turned to AI.

Starting up an AI editor

My first attempt turned out how you’d probably expect. I took our existing style guide (the one written for humans), concatenated it with a prompt that said “review this diff against these rules,” and pointed it at incoming doc changes. The output was bad—not catastrophically bad, but chatty and full of false-positives that effectively made the output useless. The model flagged issues that weren’t violations while missing obvious ones. It hallucinated rules that didn’t exist, and it gave vague feedback like “consider rewording for clarity” without saying what was wrong or how to fix it.

What I soon came to realize was that the issue wasn’t with the AI, it was with the style guide. The guide was written to explain rules to people who could infer patterns from explanations. AI can’t do that—at least, not in the same way. When you tell a human “avoid passive voice,” they apply years of linguistic knowledge. When you tell a model the same thing, it pattern-matches inconsistently.

We had to change the guide to match the way the AI thinks, so we rewrote the style guide as a pattern library. Every rule became a set of concrete do-don’t pairs, with logic described explicitly:

1### Verb Tense and Voice23*   **Present Tense:** Favor the present tense.4    Do this: "The client then sends a request to the server."5    Don't do this: "The client will then send a request to the server."6    Reason: Future tense is longer and rarely provides more clarity.78*   **Active Voice:** Favor the active voice for clarity and brevity.9    Do this: "Apollo Studio manages your graph."10    Don't do this: "Your graph is managed by Apollo Studio."11    Reason: Passive voice is less direct and longer.1213*   **Passive Voice Exceptions:** Passive voice is acceptable to14    de-emphasize a subject or emphasize an object.15    Do this: "Over 50 conflicts were found in the file."16    Don't do this: "You created over 50 conflicts in the file."17    Reason: Placing responsibility on the reader might be discouraging.

We applied the pattern across the entire guide: voice, headings, product names, changesets. Here’s an example of how we encoded the opinionated Apollo voice:

1### Voice23The Apollo voice is:4*   Approachable5*   Positive6*   Encouraging7*   Helpful8*   Opinionated/Authoritative910Opinionated voice prescribes a specific "happy path" to accomplish a goal.1112Do this: "To achieve optimal performance, configure your server with X."13Don't do this: "You can configure your server with X, Y, or Z."14Reason: This is unopinionated and lays out options rather than prescribing a path.

And if you’re wondering—yes—I used AI to help generate the initial rewrite. But our docs team reviewed and edited every example. The difference in output quality was immediately obvious. Same model. Same prompt structure. Wildly better results.

If there’s one takeaway for me from this entire project, it’s this: the way you encode knowledge for an AI can matter more than the model you choose. A mediocre model with a well-structured prompt often outperforms a frontier model with a vague prompt.

Getting AI into our workflow

A tool that produces good output but lives in the wrong place is a tool nobody uses. Our docs platform already had a custom deploy preview system, so I wired the style guide review into the same pipeline. Now, when a commit is pushed, the build forks: one path produces the deploy preview, and the other runs the style-guide analysis and surfaces results as GitHub status checks and annotations.

PullRequestReviewer orchestrates the whole flow. When a PR event comes in, the reviewer creates a GitHub check run, parses the diff to extract changed .mdx lines with their real line numbers, and hands the lines off to the review engine:

1const statusWriter = new StatusWriter(octokit, pr.head);2const check = await statusWriter.create("AI Style Review");34// Parse the diff to extract changed lines with real line numbers5const files: ChangeRequestInput = new Map();6for await (const file of pr.getFiles()) {7  if (!filename.endsWith(".mdx")) continue;89  for (const line of patch?.split("\n") ?? []) {10    if (line.startsWith("@@")) {11      // Parse hunk header: @@ -oldStart,oldLines +newStart,newLines @@12      const hunkMatch = line.match(/@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/);13      if (hunkMatch) realLineNumber = parseInt(hunkMatch[1], 10);14      continue;15    }16    if (line.startsWith("+")) {17      files.get(filename)?.push({18        lineNumber: realLineNumber,19        content: line.slice(1), // Strip the '+' prefix20      });21    }22  }23}

Results land as annotations on the GitHub check run, with each suggestion mapped to a specific file and line:

1await statusWriter.complete(check, {2  conclusion,3  output: {4    title: "Style Review Completed",5    summary: `The pull request has ${changes.length} style issues.`,6    annotations: changes.map((change) => ({7      annotation_level: change.severity,8      path: change.file,9      start_line: change.lineNumber,10      end_line: change.lineNumber,11      message: `${change.reason}\n\n\`\`\`suggestion\n${change.suggestion}\n\`\`\``,12    })),13  },14});

Why not PR reviews?

I initially tried using GitHub Pull Request reviews to deliver suggestions. On paper, it’s the perfect UX. Engineers get inline suggestions they can accept with one click. In practice, on large documentation changes with dozens of suggestions, the PR review becomes so noisy it visibly degrades the performance of the “Files changed” tab. GitHub’s UI just isn’t built for that volume of automated feedback.

Annotations solved the noise problem but created a new one: they didn’t support GitHub’s suggestion blocks. Engineers could see exactly what was wrong, but they had to manually implement the fix, when they should just have to select an Accept button. That’s a real UX downgrade, and there’s no way around it with what GitHub provides today.

Building the dashboard

What I could control was the experience outside of GitHub. We already had an internal-only view of our docs site for other tooling, so I built a dashboard on top of it. Each review got its own page with comprehensive observability: cost breakdown, token usage, duration, and every suggestion grouped by file.

Each suggestion showed a visual diff between the original line and the proposed fix, color-coded by severity:

Now engineers can pick the suggestions they want, then choose between two paths. For local iteration, they generate a patch. It’s a curl | sh one-liner that fetches a server-generated script and applies it:

1const curlCommand = useMemo(() => {2  const url = new URL(signedPatchBaseUrl);3  const encoded = encodeSuggestionsUrlSafe(selectedIndices, suggestions.length);4  url.searchParams.set("s", encoded);5  return `curl -sL "${url}" | sh`;6}, [selectedIndices, signedPatchBaseUrl]);

The selection is encoded as a compact binary bitfield. Each suggestion is one bit, packed into bytes and base64-encoded for URL safety:

1/**2 * Format:3 * - 16-bit unsigned int (big-endian) for total suggestion count4 * - 1 bit per suggestion index (1 = selected, 0 = not selected)5 * - Zero-padded to the byte boundary6 * - Base64url encoded7 *8 * Example: 12 suggestions with indices 1, 2, 4, 6, 8, 9 selected9 * - Bytes: [0x00, 0x0C, 0b01101010, 0b11000000]10 * - Base64: "AAxqwA"11 */12export function encodeSuggestions(13  selectedIndices: number[],14  totalCount: number15): string {16  const buffer = new ArrayBuffer(2 + Math.ceil(totalCount / 8));17  const view = new DataView(buffer);18  const uint8 = new Uint8Array(buffer);1920  view.setUint16(0, totalCount, false); // Big-endian count header2122  const selectedSet = new Set(selectedIndices);23  for (let i = 0; i < totalCount; i++) {24    if (selectedSet.has(i)) {25      const byteIndex = 2 + Math.floor(i / 8);26      const bitIndex = 7 - (i % 8); // MSB first27      uint8[byteIndex]! |= 1 << bitIndex;28    }29  }30  return uint8ToBase64(uint8);31}

For faster iteration, there’s a Commit to PR button that applies selected suggestions as an atomic Git commit directly to the PR branch—no local checkout needed:

1const handleCommitToPr = async () => {2  const response = await fetch(3    `${BASE_URL}/internal-api/ai-review/${reviewId}/commit`,4    {5      method: "POST",6      headers: { "Content-Type": "application/json" },7      body: JSON.stringify({ suggestionIndices: indices }),8    }9  );10  // On success: "Committed 5 suggestions! Commit SHA: a1b2c3d4"11};

Behind the scenes, SuggestionCommitter validates that the PR branch hasn’t moved (SHA check), fetches current file contents, applies the line replacements, and creates a single multi-file commit through the GitHub Git Data API:

1export class SuggestionCommitter {2  async commit(options: SuggestionCommitOptions) {3    // Step 1: Validate PR state (SHA hasn't changed)4    const prValidation = await this.prClient.validateForCommit(5      prNumber, expectedSha6    );78    // Step 2: Group changes by file9    const changesByFile = this.groupSuggestionsByFile(suggestions);1011    // Step 3: Validate content still matches12    const validation = await this.fileClient.validateContent(13      currentHeadSha, contentValidations14    );1516    // Step 4: Fetch files and apply replacements17    const fileContents = await this.fileClient.getMultipleFiles(18      filePaths, currentHeadSha19    );2021    // Step 5: Create atomic multi-file commit22    await this.gitCommitClient.createMultiFileCommit({23      branch: branchRef,24      parentSha: currentHeadSha,25      files: updatedFiles,26      message: commitMessage,27    });28  }29}

It’s not a one-click acceptance on GitHub, but it’s good enough for engineers to actually use it.

Making it fast and cheap at the same time

Here’s where things got really interesting. Netlify background functions (and AWS Lambda more broadly) are limited to 15 minutes of execution time. For a typical commit touching a handful of files, the review finished in a couple of minutes. But documentation doesn’t always change in small increments. A major release might touch hundreds of files with thousands of changed lines. On those commits, we were hitting the 15-minute wall. And even when we didn’t time out, nobody wanted to wait that long for results.

Parallelism was the solution: split the work across multiple concurrent calls. But parallelism multiplied cost, and the style-guide prompt is big. Really big. Every parallel call needed the full style guide as input context, and input tokens aren’t free.

So I ran the numbers on our historical usage and three optimizations became clear:

1. Input caching

The style guide and system prompt are identical across every call. CacheManager hashes the style guide content, checks for an existing cache, and creates one if needed. Subsequent calls reference the cache by name instead of including the full text:

1export class CacheManager {2  async getStyleGuideCache(model: string): Promise<string> {3    const currentHash = this.getStyleGuideHash();45    // If we already have a valid, unexpired cache, return it6    if (this.cachedContentName && this.styleGuideHash === currentHash) {7      const cache = await this.client.caches.get({8        name: this.cachedContentName,9      });10      if (new Date(cache.expireTime) > new Date()) {11        return cache.name;12      }13    }1415    // Create new cache with style guide as system instruction16    return this.createStyleGuideCache(model, currentHash, displayName);17  }1819  private async createStyleGuideCache(model, hash, displayName) {20    const styleGuideText = fs.readFileSync(STYLE_GUIDE_PATH, "utf-8");2122    const systemInstruction = [23      "You are an expert technical writer reviewing a documentation PR.",24      "",25      "Here is our style guide:",26      "<style-guide>",27      styleGuideText,28      "</style-guide>",29      "",30      "If it looks like you are inside a code snippet, do not provide feedback.",31    ].join("\n");3233    const cache = await this.client.caches.create({34      model,35      config: { displayName, ttl: "3600s", systemInstruction },36    });37    return cache.name;38  }39}

That alone cut costs dramatically for parallel workloads. You pay for the style guide tokens once, then dramatically reduce the per-call cost.

2. Line-level granularity

Instead of sending the model a full-file diff, I send it one line at a time, with a few surrounding lines for context. Each line gets a structured prompt, with section-heading detection and code-block awareness:

1const CONTEXT_LINES_BEFORE = 5;2const CONTEXT_LINES_AFTER = 5;3export const REVIEW_MODEL = "gemini-3-flash-preview";45// Lines are pre-filtered to skip trivial content6private shouldSkipLine(line: LineWithContext): SkipReason {7  if (line.content === "" || line.content.trim() === "") return "empty";8  if (line.isInsideCodeBlock) return "code-block";9  if (/^import\s+/.test(line.content)) return "import-statement";10  if (line.content === "---") return "frontmatter";11  if (/^<!--.*-->$/.test(line.content.trim())) return "html-comment";12  return null; // This line needs review13}1415// Each line gets a focused prompt with context16private buildLineReviewPrompt(line: LineWithContext): string {17  const parts = [`Review this single line from file: ${line.filename}`];1819  if (line.sectionHeading) {20    parts.push(`Section: ${line.sectionHeading}`);21  }2223  parts.push("Context before (do not review these):");24  for (const ctx of line.contextBefore) {25    parts.push(`  ${ctx.lineNumber}: ${ctx.content}`);26  }2728  parts.push(">>> LINE TO REVIEW <<<");29  parts.push(`  ${line.lineNumber}: ${line.content}`);30  parts.push(">>> END LINE TO REVIEW <<<");3132  return parts.join("\n");33}

The review response uses structured output (a JSON schema that the model has to conform to), which constrains the output and makes it easier to parse:

1const response = await thread.generateJsonWithUsage(2  g.object({3    needsChange: g.boolean(),4    suggestion: g.optional(5      g.object({6        newContent: g.string().description("The suggested new content"),7        reason: g.string().description(8          "Be succinct and direct. Do not reference 'the style guide'. " +9          "To the reader, you ARE the style guide."10        ),11        severity: g.enum("notice", "warning", "failure"),12      })13    ),14  })15);

Doing that made each individual call tiny—small enough that I could drop from Gemini 3 Pro to Gemini 3 Flash without losing quality, and further reducing cost.

3. Batched parallel execution

With cached prompts and a cheaper model, parallelism becomes affordable. Lines get batched and reviewed concurrently with a worker pool:

1const MAX_CONCURRENT_REVIEWS = 30;23// Review each line individually with concurrent execution4const lineResults = await this.runWithConcurrency(5  linesWithContext,6  MAX_CONCURRENT_REVIEWS,7  (line) => this.reviewLine(line)8);910// Simple worker pool implementation11private async runWithConcurrency<T, R>(12  items: T[],13  limit: number,14  task: (item: T) => Promise<R>15): Promise<R[]> {16  const results: R[] = [];17  let index = 0;1819  const workers = Array.from(20    { length: Math.min(limit, items.length) },21    async () => {22      while (index < items.length) {23        const current = index++;24        results[current] = await task(items[current]!);25      }26    }27  );2829  await Promise.all(workers);30  return results;31}

(A note of caution: most AI providers have dynamic rate limits, so handle 429s carefully—you will get them.)

The result: what used to take up to 15 minutes now finishes in under 60 seconds, even on our largest commits. Cost is calculated in real-time using live GCP pricing with sensible fallbacks:

1// Fallback pricing per million tokens (USD)2const FALLBACK_PRICING = {3  "gemini-3-flash-preview": {4    inputPricePerMillionTokens: 0.50,5    outputPricePerMillionTokens: 3.00,6    cachedInputPricePerMillionTokens: 0.05,7  },8  "gemini-3-pro-preview": {9    inputPricePerMillionTokens: 2.00,10    outputPricePerMillionTokens: 12.00,11    cachedInputPricePerMillionTokens: 0.20,12  },13};

At this point, the most expensive run has cost about ten cents. Most small changes have cost a fraction of one cent.

Discovering an accuracy bonus

What I was most surprised by during this whole process was this: Line-by-line review was significantly more accurate than bulk review. When the AI analyzed an entire file, it seemed to lose focus. The AI would miss violations in one section while over-flagging in another. Narrowing the scope to a single line forced precision. False positives dropped. False negatives dropped. The quality improvement wasn’t marginal; it was a drastic improvement.

Every review also produces a comprehensive observability record: duration, token breakdown, cost, logs, and every suggestion stored for later analysis.

1export type AIReviewRecord = {2  reviewId: string;3  prNumber: number;4  repo: string;5  sha: string;6  status: "running" | "completed" | "failed" | "timeout";7  startedAt: string;8  completedAt?: string;9  duration?: number;10  tokenUsage: TokenUsage;11  cost: CostBreakdown;12  linesReviewed: number;13  changesFound: number;14  filesReviewed: number;15  model: string;16  logs: LogEntry[];17  suggestions?: StoredSuggestion[];18};

If you’re building something like this

The implementation that worked for us wasn’t complicated: encode knowledge as examples, not explanations; surface results where engineers already work; and optimize for speed and cost at the same time. None of that is specific to documentation. I’m describing principles that apply to any situation in which you’re using AI to enforce standards at scale.

What’s more, the “non-AI” work mattered. I spent more time rewriting the style guide, picking the right GitHub integration surface, building a dashboard, and wiring up atomic commits through the Git Data API than I spent on anything related to the model itself. The work wasn’t glamorous, but it determined whether people changed their behavior based on what the AI told them. 

The AI Librarian isn’t finished, but I built it relying on what is perhaps the least glamorous insight of all: the prompt can matter more than the model.

Written by

Daniel Abdelsamed

Daniel Abdelsamed

Read more by Daniel Abdelsamed