Why AI Crawlers Can't Read Next.js App Router Sites
Iwan Efendi3 min
Next.js App Router outputs RSC Flight payload, not plain HTML. Plain HTTP crawlers like Claude's web_fetch get structure but miss content. Here's the fix.
Baca dalam IDID
I was testing something with Claude — I asked it to fetch one of my SnipGeek articles directly from its URL. It came back with just the title tag. The article body was completely empty.
My first instinct was to blame my own code.
The obvious suspect: maybe the article pages were still client-side rendered, sending only a shell HTML and injecting content via JavaScript after load. This is a classic Next.js mistake when
I ran a deeper diagnostic directly against the live URL:
The response was 101KB — not an empty shell. Keywords like
Zero traditional HTML tags. The content is entirely inside the RSC payload.
When you execute this, you'll see:
This confirms that the HTML document is populated using the JSON-like
The old Pages Router emitted raw HTML:
Google can read everything. The problem is specific to crawlers that rely on plain HTTP without JavaScript rendering.
When facing this issue, you have two primary mitigation patterns to choose from.
Here is a simplified version of the Next.js Route Handler (
Response:
Full article content, readable as plain text. No browser, no JavaScript needed.
The next step I'm planning: implement
Q: Does this affect Googlebot's ability to index my Next.js App Router site?
No. Googlebot uses a headless Chromium renderer and executes JavaScript fully before indexing. It can read RSC Flight payloads without any workaround. This issue is specific to plain HTTP crawlers — including many AI tools and custom scrapers — that fetch the raw response without running JavaScript.
Q: Will adding the
First Diagnosis: Client-Side Rendering?
"use client" ends up on a page component by accident.
I asked Antigravity to audit the full codebase. The result was surprisingly clean:
[locale]/blog/[slug]/page.tsx→ ✅ Server Component[locale]/notes/[slug]/page.tsx→ ✅ Server Component- MDX compiled server-side via
next-mdx-remote/rsc→ ✅ generateStaticParamspresent → ✅
Second Diagnosis: RSC Flight Format
curl -s https://snipgeek.com/notes/how-to-read-ai-build-failed-logs | grep -i 'article\|content\|body\|prose' | head -20content, article, and prose appeared hundreds of times. But when I dug into the actual content, this is what I found:
{"className":"text-lg text-foreground/80 prose-content","children":"$L1d"}
$L1d is not article text. It's a reference to a React Server Component chunk — Next.js App Router's RSC Flight streaming format. The full article content is there, but encoded as a payload that requires the React runtime to decode into readable HTML.
Confirmation:
curl -s https://snipgeek.com/notes/how-to-read-ai-build-failed-logs | grep '<p>'
# Total <p> tags: 0
# Total <h2> tags: 0Reproducible Fetch Example
To see this in action, you can run a simple Node.js script. It attempts to fetch the raw HTML and search for paragraph tags:// test-fetch.js
fetch("https://snipgeek.com/notes/how-to-read-ai-build-failed-logs")
.then(res => res.text())
.then(html => {
console.log("HTML Size:", (html.length / 1024).toFixed(1) + " KB");
console.log("Has Paragraphs (<p>):", html.includes("<p>"));
console.log("Has RSC Payload data:", html.includes("__next_f"));
});HTML Size: 101.2 KB
Has Paragraphs (<p>): false
Has RSC Payload data: true
__next_f stream script blocks rather than standard HTML paragraphs.
This Isn't a Bug — It's an Architecture Trade-off
<p>, <h2>, full readable content in the HTTP response. App Router switched to RSC Flight — a streaming format optimised for hydration performance, but unreadable without React runtime.
For SEO, this is fine:
| Crawler | Can Read Content? | Reason |
|---|---|---|
| Googlebot | ✅ | Headless Chrome, full JS render |
| Bingbot | ✅ | Same — full JS render |
| AI crawlers (GPTBot, ClaudeBot) | ⚠️ | Depends — some render JS, some don't |
Claude via web_fetch | ❌ | Plain HTTP fetch, no JS execution |
Mitigation Patterns for App Router Sites
Pattern 1: JSON/Markdown API Endpoints (The SnipGeek Approach)
The cleanest fix is to offer an alternative, machine-readable endpoint. I added Route Handlers in Next.js that serve article content as plain JSON — no RSC format, no JavaScript required:GET /api/posts/[slug]?locale=en → English article JSON
GET /api/posts/[slug]?locale=id → Indonesian article JSON
GET /api/notes/[slug]?locale=en → English note JSON
GET /api/notes/[slug]?locale=id → Indonesian note JSON
src/app/api/notes/[slug]/route.ts) implementing this pattern:
import { NextResponse } from "next/server";
import { getNoteBySlug } from "@/lib/notes";
export async function GET(
request: Request,
{ params }: { params: Promise<{ slug: string }> }
) {
const { slug } = await params;
const { searchParams } = new URL(request.url);
const locale = searchParams.get("locale") || "en";
try {
const note = await getNoteBySlug(slug, locale);
if (!note) {
return NextResponse.json({ error: "Not Found" }, { status: 404 });
}
return NextResponse.json(
{
slug: note.frontmatter.slug,
title: note.frontmatter.title,
description: note.frontmatter.description,
content: note.content, // Raw MDX/Markdown string
},
{
headers: {
"X-Robots-Tag": "noindex", // Crucial: avoid SEO duplicate content issues
"Cache-Control": "public, max-age=3600",
},
}
);
} catch (err) {
return NextResponse.json({ error: "Internal Server Error" }, { status: 500 });
}
}Pattern 2: Headless Browser Rendering for AI User Agents
If you cannot expose a dedicated API, you can configure your server or reverse proxy to route AI user agents (likeChatGPT-User or ClaudeBot) through a prerendering service (such as Puppeteer or Prerender.io). This spins up a headless browser, executes the React bundle, and returns the fully rendered HTML.
A few decisions I made during implementation of the API routes:
- Locale fallback — if an
idversion doesn't exist, it falls back toenwithisFallback: truein the response. X-Robots-Tag: noindex— prevents Google from indexing the API route as a duplicate of the main page.Cache-Control: public, max-age=3600— caches responses to avoid repeated serverless invocations.translationUrls— a field listing the full API URL for each available locale, useful for tools consuming the API.
curl -s "https://snipgeek.com/api/posts/ubuntu-26-04-beta-sudah-bisa-didownload?locale=id"{
"slug": "ubuntu-26-04-beta-sudah-bisa-didownload",
"locale": "id",
"isFallback": false,
"translationAvailable": ["en", "id"],
"translationUrls": {
"en": "/api/posts/ubuntu-26-04-beta-sudah-bisa-didownload?locale=en",
"id": "/api/posts/ubuntu-26-04-beta-sudah-bisa-didownload?locale=id"
},
"title": "Ubuntu 26.04 Beta Sudah Rilis — Tapi Jangan Buru-Buru Install",
"description": "...",
"date": "2026-03-30",
"tags": ["ubuntu", "linux", "beta"],
"content": "\nSaya nunggu beta Ubuntu 26.04 ini sambil setengah semangat..."
}Safe Change
This API route lives entirely under
/api/* — a separate namespace that cannot conflict with or break any existing page routing. It's a purely additive change.What's Next
llms.txt — an emerging standard (similar to robots.txt but for AI) that lists all SnipGeek content URLs in a format that LLM crawlers can process easily.
For the curious, the relevant specs are in the Next.js Route Handlers docs and the React Server Components reference.
If you hit this same wall with your own Next.js site, adding a plain JSON API route is probably the fastest fix. If you ran into confusing build output along the way, How to Fix AI Build Failed Logs covers how to interpret what the toolchain is actually telling you. Let me know if it works for you.
FAQ
/api/posts/[slug] route hurt my SEO by creating duplicate content?
Not if you set the X-Robots-Tag: noindex response header on the API route, which prevents search engines from indexing it. The canonical page at /blog/[slug] remains the only indexed version. The API route is invisible to Google's ranking system.
Q: Why doesn't Next.js just output plain HTML for static pages?
It's a deliberate trade-off. RSC Flight format enables efficient streaming, partial hydration, and server/client component boundaries — all features that make App Router faster at runtime. Plain HTML output would sacrifice those gains. For most use cases the performance win is worth it, but it does create a blind spot for non-JS crawlers.
Q: Can I use this same approach for other frameworks like Remix or Astro?
The specific RSC Flight format is a Next.js App Router concern. Remix by default outputs plain HTML from loaders, and Astro's static output is already plain HTML. If you're on those frameworks and AI crawlers can't read your content, the cause is more likely to be JavaScript-injected content or a SPA-style client router rather than RSC encoding.
References
Topics
Topics in this note
Explore related ideas through the topics connected to this note.
Share this article
Discussion
Preparing the comments area...