Skip to content
SerpNapSerpNap
SEO

The Complete AEO Technical Stack: robots.txt, llms.txt & AI Crawler Management for 2026

Your content strategy means nothing if AI crawlers can't find and understand your pages. This guide covers the complete technical infrastructure for Answer Engine Optimization — from controlling 30+ AI bot user agents in robots.txt to implementing llms.txt and structuring content for machine extraction.

SerpNap Team
March 10, 2026
20 min read
Executive Summary / AI Insights

The AEO Infrastructure Blueprint

Answer Engine Optimization has a content layer (what to write) and a technical infrastructure layer (how to serve it to AI). This guide covers the infrastructure: managing 30+ AI crawler user agents in robots.txt, implementing llms.txt for content discovery, and structuring schema markup for machine extraction. With Gartner forecasting 25% of search traffic shifting to AI by 2026, this infrastructure is no longer optional.

Key Takeaways

GPTBot (training) and ChatGPT-User (search) are separate — block one, allow the other.
30+ AI crawler user agents exist across 4 categories: training, search index, user assistant, and autonomous agent.
llms.txt is a curated content map for AI, NOT a replacement for robots.txt — they serve different purposes.
Article schema with dateModified is critical: the vast majority of ChatGPT citations come from recently updated content, typically within the last 10-12 months.
FAQPage schema still helps AI extract Q&A pairs even though Google no longer shows FAQ rich results for most sites.
AI crawlers don't execute JavaScript — server-side rendering is mandatory for AI visibility.
Source: Gartner · Search Volume Forecast, 2024-2028

We've implemented the AEO technical stack across dozens of websites — from SaaS companies to local service businesses — and the pattern is consistent: sites that properly configure their AI crawler access, provide an llms.txt, and implement structured schema see measurable increases in AI referral traffic within 4-8 weeks. The technical foundation isn't glamorous, but it's what separates sites that get cited from sites that get ignored.

This guide covers that technical layer: the 30+ AI bot user agents you need to manage in your robots.txt, the llms.txt specification that gives AI systems a curated map of your content, and the schema markup patterns that make your pages machine-readable. If the GEO Content Playbook is what to write, this guide is how to serve it to AI.

What Is Answer Engine Optimization (AEO)?

Answer Engine Optimization (AEO) is the technical infrastructure layer that enables AI answer engines — ChatGPT, Perplexity, Google AI Overviews, Gemini, and Claude — to discover, crawl, parse, and cite your content. It covers robots.txt AI crawler rules, llms.txt content maps, schema markup, and server-side rendering. With Gartner projecting 25% of search traffic shifting to AI by end of 2026, AEO is now a critical infrastructure investment.

AEO sits at the intersection of three concerns: access control (which AI bots can crawl your site), content discovery (how AI systems find your most important pages), and content structure (how your information is formatted for machine extraction). Each of these maps to a specific technical implementation: robots.txt, llms.txt, and schema markup, respectively.

Understanding AI Crawlers: The Complete 2026 Taxonomy

There are now over 30 distinct AI crawler user agents active on the web. Understanding the difference between them is essential for making the right access control decisions, because blocking the wrong bot can either give away your training data for free or cut off your AI search visibility entirely.

Category 1: AI Training Scrapers

Training scrapers collect web content to build and refine AI language models. Allowing these bots means your content may be used to train future models. Blocking them does notaffect whether AI search products cite you — that's handled by separate crawler user agents.

User AgentCompanyPurpose
GPTBotOpenAIModel training data collection
ClaudeBotAnthropicTraining data for Claude models
Google-ExtendedGoogleGemini/Vertex AI training
Applebot-ExtendedAppleFoundation LLM development
BytespiderByteDanceTikTok/Doubao LLM training
CCBotCommon CrawlOpen web data repository
meta-externalagentMetaLlama model training
cohere-training-data-crawlerCohereEnterprise AI training
DiffbotDiffbotStructured data for AI
FacebookBotMetaSpeech recognition training
PanguBotHuaweiMultimodal LLM training
ChatGLM-SpiderZhipu AIChinese LLM training
omgiliWebz.ioCrawl data sold for AI
webzio-extendedWebz.ioExtended crawl for models
FirecrawlAgentFirecrawlWeb-to-LLM conversion

Category 2: AI Search Index Crawlers

Search index crawlers build the knowledge base that AI search products use to answer user queries. These are the bots you want to allow — they drive your AI search visibility. Blocking them is equivalent to blocking Googlebot for traditional search.

User AgentCompanyProduct
OAI-SearchBotOpenAISearchGPT / ChatGPT search
Claude-SearchBotAnthropicClaude search feature
PerplexityBotPerplexityPerplexity answer engine
BravebotBraveBrave Search AI answers
YouBotYou.comYou.com AI search
ExaBotExaSemantic search
Amzn-SearchBotAmazonAlexa AI search
AzureAI-SearchBotMicrosoftAzure AI search
meta-webindexerMetaMeta AI search
Google-CloudVertexBotGoogleVertex AI Search
PetalBotHuaweiPetal Search
LinkupBotLinkupEnterprise AI search
Cloudflare-AutoRAGCloudflareRAG service indexing

Category 3: AI User Assistants

User assistant bots fetch content in real-time when a user asks an AI assistant a question. These are user-initiated — a human is actively requesting information. Blocking these bots prevents your content from appearing when someone directly asks an AI about topics you cover.

User AgentCompanyContext
ChatGPT-UserOpenAIReal-time browsing in ChatGPT
Claude-UserAnthropicClaude browsing for users
Gemini-Deep-ResearchGoogleGemini research feature
Perplexity-UserPerplexityUser query responses
MistralAI-UserMistralMistral browsing
Amzn-UserAmazonAlexa information retrieval
DuckAssistBotDuckDuckGoAI-assisted answers
PhindBotPhindDeveloper answer engine
kagi-fetcherKagiKagi AI assistant

Category 4: AI Agents (Autonomous)

AI agents are autonomous bots that perform multi-step tasks on behalf of users. These are the newest category and the most unpredictable — they navigate websites, fill forms, and interact with content autonomously.

User AgentCompanyPurpose
ChatGPT AgentOpenAIAutonomous task completion
GoogleAgent-MarinerGoogleBrowser-based interaction
NovaActAmazonMulti-step task agent
Manus-UserButterfly EffectAutonomous navigation
DevinCognition AISoftware engineering agent
Pro Tip: The critical distinction most people miss: GPTBot (training scraper) and ChatGPT-User (user assistant) are completely separate user agents. Blocking GPTBotprevents your content from being used for OpenAI's model training, but does not prevent ChatGPT from citing you when a user asks a question. Same for ClaudeBot vs. Claude-User. This distinction is the foundation of a smart AI crawler strategy.

The robots.txt Strategy for the AI Era

A modern robots.txt strategy follows one core principle: allow AI search and user-initiated bots (they drive visibility) while blocking pure training scrapers (they consume your content without driving traffic). This approach maximizes your AI search presence — which is fundamentally different from traditional SEO optimization — while protecting your content from being used as training data without compensation.

Here is the complete, production-ready robots.txt configuration for 2026. You can also use our free robots.txt generator to create a customized version for your site. Copy and adapt for your domain:

# ============================================
# robots.txt — AI Crawler Policy (2026)
# ============================================

# === Standard Search Engines (ALLOW) ===
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Applebot
Allow: /

# === AI SEARCH CRAWLERS (ALLOW — drives visibility) ===
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bravebot
Allow: /

User-agent: YouBot
Allow: /

User-agent: Amzn-SearchBot
Allow: /

User-agent: AzureAI-SearchBot
Allow: /

User-agent: meta-webindexer
Allow: /

User-agent: Google-CloudVertexBot
Allow: /

# === AI USER ASSISTANTS (ALLOW — user-initiated) ===
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Gemini-Deep-Research
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: PhindBot
Allow: /

User-agent: Amzn-User
Allow: /

# === AI TRAINING SCRAPERS (BLOCK) ===
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: cohere-training-data-crawler
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: ChatGLM-Spider
Disallow: /

User-agent: omgili
Disallow: /

User-agent: webzio-extended
Disallow: /

User-agent: FirecrawlAgent
Disallow: /

User-agent: FacebookBot
Disallow: /

# === AI AGENTS (BLOCK by default) ===
User-agent: NovaAct
Disallow: /

User-agent: Manus-User
Disallow: /

# === ARCHIVE (ALLOW) ===
User-agent: archive.org_bot
Allow: /

# === DEFAULT ===
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml
Important: robots.txt Is Voluntary

The robots.txt standard is a voluntary protocol. Reputable crawlers from major companies (Google, OpenAI, Anthropic, Perplexity) respect it. But not all bots do. For critical content protection, combine robots.txt with server-side user agent filtering. That said, for most websites, robots.txt is sufficient for managing legitimate AI crawlers.

Alternative Strategies

The configuration above follows the "protect training data, allow search visibility" approach. But there are legitimate reasons to choose different strategies:

  • Maximum visibility (allow everything)— If you want your content to influence AI model training (building long-term brand presence in the models themselves), allow all crawlers including training scrapers. This is a valid strategy for brands that want to be "baked into" future AI knowledge.
  • Maximum protection (block everything except standard search) — For publishers concerned about AI using their content without licensing. Major publishers like NYT, WSJ, and Reuters have taken this approach. Note: this significantly reduces AI search visibility.
  • Selective content exposure — Allow AI crawlers on marketing pages and blog content but block premium/gated content. Use Disallow: /premium/ paths for content you want to protect.

Next.js robots.ts Implementation

For Next.js applications, implement robots.txt as a TypeScript route handler:

// src/app/robots.ts
import type { MetadataRoute } from "next";

const BASE_URL = process.env.NEXT_PUBLIC_APP_URL || "https://yoursite.com";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      // Standard search engines
      { userAgent: "Googlebot", allow: "/" },
      { userAgent: "Bingbot", allow: "/" },
      { userAgent: "Applebot", allow: "/" },

      // AI search crawlers (ALLOW)
      { userAgent: "OAI-SearchBot", allow: "/" },
      { userAgent: "Claude-SearchBot", allow: "/" },
      { userAgent: "PerplexityBot", allow: "/" },
      { userAgent: "Bravebot", allow: "/" },
      { userAgent: "ChatGPT-User", allow: "/" },
      { userAgent: "Claude-User", allow: "/" },
      { userAgent: "Gemini-Deep-Research", allow: "/" },
      { userAgent: "Perplexity-User", allow: "/" },

      // AI training scrapers (BLOCK)
      { userAgent: "GPTBot", disallow: "/" },
      { userAgent: "ClaudeBot", disallow: "/" },
      { userAgent: "Google-Extended", disallow: "/" },
      { userAgent: "CCBot", disallow: "/" },
      { userAgent: "Bytespider", disallow: "/" },
      { userAgent: "meta-externalagent", disallow: "/" },

      // Default
      { userAgent: "*", allow: "/" },
    ],
    sitemap: `${BASE_URL}/sitemap.xml`,
  };
}

WordPress robots.txt Setup

In WordPress, manage robots.txt through your SEO plugin or by creating a physical file:

  • Yoast SEO — Go to Yoast SEO → Tools → File editor. Paste the robots.txt content from the template above.
  • Rank Math — Go to Rank Math → General Settings → Edit robots.txt.
  • Manual file — Create a physical robots.txt file in your WordPress root directory (typically /var/www/html/robots.txt or /public_html/robots.txt).

llms.txt: The Treasure Map for AI Systems

The llms.txt file is a Markdown file served at your domain root (https://yoursite.com/llms.txt) that provides LLM-friendly documentation about your site. While robots.txt controls access (what bots can crawl), llms.txt controls understanding (what your site is about and where to find key content). Think of it as a curated table of contents designed specifically for AI context windows.

llms.txt vs. robots.txt: Different Jobs

A common misconception is that llms.txtis "robots.txt for AI." They serve fundamentally different purposes:

Aspectrobots.txtllms.txt
PurposeAccess controlContent discovery and curation
FormatCustom syntaxMarkdown
ControlsWhich bots can crawl which pathsWhat content is most important
AnalogyA security gateA treasure map
AdoptionUniversal (1994 standard)Emerging (2024+)
EnforcementVoluntary but widely respectedOptional, no enforcement

Google's Gary Illyes has compared llms.txtto the keywords meta tag — implying limited direct search impact. However, the SEO community's consensus is more nuanced: while Google may not directly use llms.txt, AI systems with smaller context windows (like API-based RAG applications) benefit significantly from a curated content index. And as more AI systems adopt it, early implementation builds a competitive advantage.

The llms.txt File Format

The specification (from llmstxt.org) defines a simple Markdown structure:

# Your Company Name

> Brief one-line description of what your company does.

Additional context about your company, expertise, and what
makes your content authoritative. This section can include
paragraphs and lists but NO headings.

## Core Pages

- [About Us](https://yoursite.com/about): Company background and team
- [Services](https://yoursite.com/services): What we offer
- [Pricing](https://yoursite.com/pricing): Plans and pricing

## Documentation

- [Getting Started](https://yoursite.com/docs/start): Step-by-step guide
- [API Reference](https://yoursite.com/docs/api): API documentation
- [Glossary](https://yoursite.com/glossary): Key term definitions

## Blog

- [AI Implementation Guide](https://yoursite.com/blog/ai-guide): Complete guide
- [Case Studies](https://yoursite.com/case-studies): Client results

## Optional

- [Press Kit](https://yoursite.com/press): Brand assets
- [Careers](https://yoursite.com/careers): Open positions

Key Rules for llms.txt

  1. H1 title (required) — your site or company name only. Nothing else in the H1.
  2. Blockquote (optional but recommended) — a single-line summary of what your company does and its core value proposition.
  3. Body text (optional) — additional context with paragraphs and lists, but no headings in this section.
  4. H2 sections (optional) — organize your links into logical categories.
  5. Link format - [Page Name](URL): Brief description
  6. "Optional" section — AI systems can skip this section when working with limited context, so put non-essential pages here.

Best Practices for Effective llms.txt

  • Curate ruthlessly— don't list every page. Include your 20-30 most important and authoritative pages. Quality over quantity.
  • Lead with your strongest content — put your most cited, most comprehensive pages in the first section after the header.
  • Use descriptive link text— the link text should tell the AI what the page is about. "Complete Guide to X" is better than "Learn More."
  • Include your glossary and FAQ pages — these are the highest-value pages for AI citation. Always include them.
  • Update regularly — when you publish significant new content, add it to your llms.txt. Remove pages that are outdated.
  • Consider a companion llms-full.txt — for larger sites, provide an expanded version with more pages and fuller descriptions.

Next.js llms.txt Implementation

In Next.js, serve llms.txt as a route handler or as a static file in the public directory:

Option 1: Static file (simplest)

Create public/llms.txt with your Markdown content. It will be served at https://yoursite.com/llms.txtautomatically.

Option 2: Dynamic route handler

// src/app/llms.txt/route.ts
const BASE = "https://yoursite.com";

export function GET() {
  const content = `# Your Company Name

> One-line description of what your company does.

We specialize in [expertise] and have served [audience]
since [year].

## Core Pages

- [About Us](${BASE}/about): Our team, mission, and credentials
- [Services](${BASE}/services): Complete service offerings
- [Blog](${BASE}/blog): Industry insights and guides
- [Glossary](${BASE}/glossary): Key term definitions
- [Tools](${BASE}/tools): Free tools and resources

## Top Resources

- [SEO Guide](${BASE}/blog/seo/guide): Complete guide
- [AI Search Guide](${BASE}/blog/seo/ai-search): AI optimization
`;

  return new Response(content, {
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "Cache-Control": "public, max-age=86400",
    },
  });
}

WordPress llms.txt Setup

For WordPress, create a physical file at your site root:

# SSH into your server
nano /var/www/html/llms.txt

# Or upload via FTP to your root directory
# The file should be accessible at: yoursite.com/llms.txt

If your WordPress installation uses .htaccess rewrites that interfere with static file serving, add this rule before the WordPress rewrite block:

# Serve llms.txt directly
RewriteRule ^llms\.txt$ - [L]
RewriteRule ^llms-full\.txt$ - [L]

Schema Markup That Maximizes AI Extraction

Schema.org structured data gives AI systems machine-readable context about your content that goes beyond what they can infer from the HTML alone. The Princeton GEO study (arXiv:2311.09735) found that citing sources improved AI visibility by 27.8% — and schema markup is the machine-readable equivalent of citing sources. It tells AI systems exactly what your content is, who wrote it, when it was published, and why it's authoritative.

If you need to generate JSON-LD schema quickly, our free schema generator creates valid LocalBusiness and FAQPage markup you can paste directly into your page. For a deeper dive on all schema types, see our complete schema markup guide.

The Schema Priority Matrix for AEO

Schema TypeWhere to PlacePriorityAEO Impact
OrganizationHomepageCriticalEstablishes entity in AI knowledge graphs
WebSiteHomepageCriticalSite identity for AI systems
Article + PersonEvery blog postCriticalAuthor E-E-A-T for citation confidence
BreadcrumbListEvery pageHighContent hierarchy for AI understanding
FAQPagePages with FAQ sectionsHighDirectly extractable Q&A pairs
HowToTutorial/guide pagesHighStep-by-step extraction for AI
SpeakableKey articlesMediumVoice search and AI assistant extraction
DefinedTermGlossary pagesMediumDefinition extraction for AI
DatasetResearch/data pagesMediumOriginal data discovery

Article Schema with Author E-E-A-T

Article schema is the most important schema type for content sites. AI systems use it to verify author credentials, check publication dates, and assess content authority. The dateModified field is especially critical — analysis suggests that the vast majority of ChatGPT citations come from recently updated content (typically within the last 10-12 months), and pages with clear timestamps appear to receive significantly more citations than undated pages.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Your Article Headline (Max 110 Characters)",
  "description": "150-160 character description as the answer capsule.",
  "datePublished": "2026-03-10T08:00:00+00:00",
  "dateModified": "2026-03-10T12:00:00+00:00",
  "author": {
    "@type": "Person",
    "name": "Author Name",
    "url": "https://yoursite.com/authors/author-name",
    "jobTitle": "Senior SEO Strategist",
    "worksFor": {
      "@type": "Organization",
      "name": "Your Company"
    },
    "sameAs": [
      "https://linkedin.com/in/authorname",
      "https://twitter.com/authorname"
    ],
    "knowsAbout": ["SEO", "AI Search", "Content Strategy"]
  },
  "publisher": {
    "@type": "Organization",
    "name": "Your Company",
    "logo": {
      "@type": "ImageObject",
      "url": "https://yoursite.com/logo.png"
    }
  },
  "isAccessibleForFree": true,
  "inLanguage": "en-US"
}

FAQPage Schema for AI Q&A Extraction

Google now limits FAQ rich results to government and health authority sites. However, FAQPage schema still helps AI systems extract Q&A pairs, making it valuable for AEO even without the visual rich result in Google Search. Every page with a FAQ section should have matching FAQPage schema.

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is answer engine optimization?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Answer engine optimization (AEO) is the practice of structuring website content and technical infrastructure so AI-powered answer engines can discover, parse, and cite your content. It covers robots.txt configuration for AI crawlers, llms.txt implementation, schema markup, and content formatting for machine extraction."
      }
    },
    {
      "@type": "Question",
      "name": "Do I need both robots.txt and llms.txt?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes — they serve different purposes. robots.txt controls which AI bots can access your site (access control). llms.txt tells AI systems what your most important content is and where to find it (content discovery). Using both together gives AI systems both the permission and the guidance to cite your content effectively."
      }
    }
  ]
}

Speakable Schema for Voice + AI Assistants

Speakable markup identifies content optimized for text-to-speech and AI assistant extraction. It tells AI systems which paragraphs on your page contain the most important, quotable information.

{
  "@context": "https://schema.org",
  "@type": "WebPage",
  "speakable": {
    "@type": "SpeakableSpecification",
    "cssSelector": [
      ".article-headline",
      ".article-summary",
      ".key-takeaway"
    ]
  }
}

Speakable rules: Target 2-3 sentences per speakable section (~20-30 seconds of audio). Mark headlines and summaries, not captions. Rewrite marked content to read clearly aloud. Use cssSelector over xPath for easier maintenance. For a full guide, check our structured data implementation guide.

Organization Schema: Establishing Your Entity

Every site needs Organization schema on the homepage. It establishes your entity in AI knowledge graphs and provides the cross-reference points (via sameAs) that AI systems use to verify your identity — this is foundational for getting cited by AI answer engines. Critical properties for AI citation:

  • sameAs — links to Wikipedia, Wikidata, Crunchbase, LinkedIn, and social profiles. This is how AI systems verify your entity across platforms.
  • knowsAbout— explicitly declares your expertise areas so AI systems know what topics you're authoritative on.
  • publishingPrinciples — links to your editorial policy page, signaling editorial standards to AI quality assessors.
  • founder + foundingDate — verifiable company information that builds entity confidence.

Content Structure Patterns for AI Extraction

Beyond schema markup, the HTML structure of your content determines how effectively AI systems can extract and cite specific information. Content structure for AI extraction follows three principles: clear heading hierarchy, self-contained sections, and machine-parseable data formats. Each principle maps to specific HTML patterns that you can implement today.

Heading Hierarchy Rules

  • One H1 per page — the page title, matching or closely related to the headline in your Article schema.
  • H2 for major sections— each H2 should be phrased as or closely match a search query. "What is answer engine optimization?" is better than "Overview."
  • H3 for subsections — break detailed sections into scannable subsections.
  • Never skip levels— don't jump from H2 to H4. AI systems use heading hierarchy to understand content relationships.

Answer Blocks After Every H2

Place a 40-60 word direct answer immediately after every H2 heading. This is the highest-impact structural change you can make for AI extraction. The answer block should be self-contained — AI systems should be able to extract just that paragraph as a complete, accurate answer. For detailed patterns and examples, see our GEO Content Playbook.

Table Formatting for AI Parsing

AI systems extract tabular data effectively when formatted cleanly. Use semantic HTML tables (not CSS grid or flexbox layouts) with clear header rows:

  • Use clear, descriptive column names in the header row
  • Keep cell content under 50 characters
  • Use consistent units within columns
  • Include a descriptive H3 heading above every table
  • Avoid merged cells, nested tables, or complex layouts

Server-Side Rendering for AI Crawlers

Most AI crawlers do notexecute JavaScript. If your content is rendered client-side (React SPA, Vue SPA without SSR), it's invisible to the majority of AI crawlers. Every content page must be server-side rendered (SSR) or statically generated (SSG).

  • Next.js — use Server Components (default in App Router) and generateStaticParams() for pre-rendering.
  • WordPress — inherently server-rendered, no additional work needed.
  • Verify — test with curl https://yoursite.com/page to confirm content appears in the raw HTML without JavaScript execution.

Verifying AI Crawler Activity on Your Site

After implementing your robots.txt and llms.txt, verify that AI crawlers are actually visiting your site and accessing the right content. Server log analysis is the most reliable method for confirming AI crawler activity and identifying which pages they prioritize.

Server Log Analysis Commands

Run these commands against your server access logs to monitor AI crawler activity:

# Count total AI crawler hits
grep -E "(GPTBot|ClaudeBot|PerplexityBot|OAI-SearchBot|Claude-SearchBot|ChatGPT-User)" access.log | wc -l

# See which pages Perplexity is crawling most
grep "PerplexityBot" access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

# Check if ChatGPT users are browsing your site
grep "ChatGPT-User" access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

# Monitor which training scrapers are ignoring your robots.txt
grep -E "(GPTBot|ClaudeBot|Bytespider)" access.log | awk '{print $1}' | sort -u | wc -l

GA4 Referral Traffic Tracking

In Google Analytics 4, check for AI platform referrals:

  • chat.openai.com — traffic from ChatGPT citations
  • perplexity.ai — traffic from Perplexity citations
  • gemini.google.com — traffic from Google Gemini
  • claude.ai — traffic from Claude citations

Create a custom channel group in GA4 called "AI Search" that groups all AI platform referrals together. This gives you a single metric for AI-driven traffic that you can trend over time.

The Complete AEO Technical Checklist

The full AEO technical stack can be implemented in four weekly phases: robots.txt AI crawler configuration (week 1), llms.txt creation and deployment (week 2), schema markup for Article, FAQ, and Organization types (week 3), and server-side rendering verification with log monitoring (week 4). Complete each phase in order — they build on each other.

Phase 1: Access Control (Week 1)

  • Configure robots.txt with AI crawler rules (allow search bots, block training scrapers)
  • Verify robots.txt is accessible at yoursite.com/robots.txt
  • Reference your sitemap in robots.txt
  • Ensure XML sitemap has accurate lastmod dates
  • Verify every page has a self-referencing canonical tag

Phase 2: Content Discovery (Week 2)

  • Create llms.txt at your domain root with your 20-30 most important pages
  • Optionally create llms-full.txt with expanded content
  • Add Organization schema on your homepage
  • Add WebSite schema on your homepage
  • Populate sameAs with all your business profiles

Phase 3: Content Structure (Week 3)

  • Add Article + Person schema to every blog post with author credentials
  • Add BreadcrumbList schema to every page
  • Add FAQPage schema to all pages with FAQ sections
  • Add HowTo schema to tutorial and guide pages
  • Add Speakable markup to your top 5-10 articles

Phase 4: Verification (Week 4)

  • Test all schemas with Google Rich Results Test
  • Validate with Schema.org Validator
  • Verify SSR: run curl on your key pages to confirm content is in the raw HTML
  • Check server logs for AI crawler activity
  • Set up GA4 custom channel group for AI referral traffic
  • Run baseline GEO audit across ChatGPT, Perplexity, and Gemini

Frequently Asked Questions

What is Answer Engine Optimization (AEO)?

Answer Engine Optimization (AEO) is the practice of configuring your website's technical infrastructure so AI-powered answer engines like ChatGPT, Perplexity, Google AI Overviews, and Claude can discover, crawl, understand, and cite your content. AEO covers three core areas: access control via robots.txt for AI crawlers, content discovery via llms.txt, and content structure via schema markup and HTML formatting patterns.

Should I block GPTBot in robots.txt?

It depends on your goals. Blocking GPTBot prevents your content from being used to train future OpenAI models, but it does not prevent ChatGPT from citing you. ChatGPT uses ChatGPT-User and OAI-SearchBot for real-time retrieval — those are separate user agents. Most sites benefit from blocking GPTBot (training) while allowing ChatGPT-User and OAI-SearchBot (search visibility). The same logic applies to ClaudeBot vs. Claude-User.

Is llms.txt required for AI search visibility?

No — llms.txtis not required, and many AI platforms don't currently read it directly. However, it is increasingly adopted by RAG-based applications, API-connected AI tools, and enterprise AI systems. Implementing it now is a low-effort investment (a single Markdown file) that positions your site ahead of competitors as adoption grows. Think of it like implementing schema markup in 2015 — early movers gained significant advantages.

Does schema markup directly improve AI citations?

Schema markup helps AI systems understandyour content but doesn't guarantee citations on its own. The primary value is accuracy: Article schema tells AI the publication date, author credentials, and topic. FAQPage schema provides structured Q&A pairs that AI can extract directly. Organization schema establishes your entity in knowledge graphs. Combined with well-structured content, schema markup significantly increases citation probability — especially for Google AI Overviews, which use the same index as Google Search.

How often should I update my robots.txt and llms.txt?

Update robots.txt whenever new AI crawlers emerge — roughly quarterly in the current landscape. Update llms.txt whenever you publish significant new content or restructure your site. A good cadence is monthly for llms.txt updates. For the latest AI crawler user agent list, monitor the ai-robots-txt GitHub repository which tracks new crawler additions.

Do local businesses need AEO?

Yes — and local businesses should take a simpler approach. Allow all AI crawlers (including training scrapers) for maximum visibility. Create a basic llms.txt listing your services, service areas, and contact information. Implement LocalBusiness schema with areaServedand service-specific schemas. For local businesses, the "find a plumber near me" query in AI assistants is increasingly where customers come from — blocking any AI crawler reduces this visibility.

Putting It All Together

The AEO technical stack is three layers working together: robots.txt controls which AI bots can access your site, llms.txttells them where to find your best content, and schema markup helps them understand what they find. Without all three, you're leaving AI visibility on the table.

The good news: implementing the complete stack takes 2-4 weeks for most sites. The robots.txt configuration is a one-time setup with quarterly updates. The llms.txt file is a single Markdown file. Schema markup is a one-time per-template implementation. Once these foundations are in place, your content is positioned to earn citations from every major AI platform.

For the content strategy that sits on top of this technical foundation, see our GEO Content Playbook — the companion guide covering answer capsules, statistics formatting, and citation-optimized writing patterns.

Need a quick health check? Start with our free SEO checker to identify technical gaps, generate your robots.txt and schema markup, then run a Neural Audit to measure your current AI search visibility. The full toolkit is free at SerpNap.com/tools.

Want to check your site's SEO? Run a free SEO audit or get a quote.

Ready?

Put AI to Work for You.

Book a free 30-minute assessment. We'll map exactly which AI tools will save you time and money — with a clear timeline and pricing.

Free assessmentNo commitmentResults in 2 weeks