Why does a Wikidata entity ID help with GEO?

Wikidata is the structured knowledge graph that Google, ChatGPT, and other LLMs use to disambiguate real-world entities. When a Person schema on your site includes a sameAs link to your Wikidata entry (e.g. https://www.wikidata.org/wiki/Q138716797), you are telling AI crawlers: 'this page is about the verified entity in the knowledge graph.' Research by Kalyan et al. (2023) found that entity-grounded pages receive significantly higher citation rates in LLM-generated answers.

What is the most common GEO mistake on portfolio sites?

The single most common GEO mistake on personal portfolio and consulting sites is rendering the hero heading text only via JavaScript typing animations. AI crawlers do not execute JavaScript in the same way browsers do, so the most important statement on the page — your name, title, and value proposition — is invisible to them. The fix is to add visually-hidden static text in the HTML that duplicates what the animation shows.

How do statistics and citations improve GEO performance?

AI answer engines prefer to cite pages that back up claims with verifiable data. A 2023 study by Aggarwal et al. ('GEO: Generative Engine Optimization') found that pages containing statistics with source attribution achieve ~40% higher citation rates in AI-generated answers compared to pages with equivalent content but no quantified claims. Linking to primary sources (McKinsey, WEF, IEEE) also boosts E-E-A-T signals.

GEO Audit of My Portfolio Site: What I Found

Q: What is Generative Engine Optimisation (GEO)?

Generative Engine Optimisation (GEO) is the practice of structuring a website's content, schema markup, and entity signals so that AI-powered answer engines — ChatGPT, Perplexity, Google AI Overviews, Claude — are more likely to surface and cite it when answering relevant queries. Unlike traditional SEO which targets ranked blue links, GEO targets the AI-generated summaries that now appear above organic results.

Earlier this year, a friend of mine — also an AI engineer — mentioned that when he asked ChatGPT to name AI Solution Architects in India, his profile came up but mine didn't. We had comparable LinkedIn profiles, similar GitHub activity, and both had a personal site. So I sat down and did something I should have done much earlier: a proper GEO audit of my own site.

What I found surprised me. Not because the issues were exotic, but because they were obvious in retrospect — the kind of mistakes you make when you're optimising for Google's 2018 crawler instead of the LLM-based answer engines that now handle a significant share of information queries. This post documents the six real problems I found, the exact fixes I applied, and ends with a reusable checklist you can run on your own portfolio or consulting site today.

What is GEO and why should you care in 2026?

Generative Engine Optimisation (GEO) is the practice of structuring content, schema markup, and entity signals so that AI-powered answer engines — ChatGPT, Perplexity, Google AI Overviews, Claude — are more likely to surface and cite your site when answering relevant queries.

Traditional SEO gets you ranked in the blue-link results below the fold. GEO gets you cited in the AI-generated summary at the top of the page — or mentioned when someone asks an AI assistant a direct question. The traffic mechanics are different: instead of click-through rates on ranked results, you compete for inclusion in a synthesised answer that may not require the user to click anything at all.

A 2023 research paper by Aggarwal et al. — "GEO: Generative Engine Optimization" — found that pages with statistics and cited sources achieved up to 40% higher citation rates in LLM-generated answers compared to pages with equivalent prose but no quantified claims.

For personal brands — AI engineers, consultants, speakers — GEO matters because it determines whether an AI assistant recommends you by name when someone asks "Who are the top AI Solution Architects in India?" or "Who should I hire to build a GenAI system for my enterprise?". That is a high-intent query with real commercial value.

My audit methodology

I ran a structured checklist across three layers:

Crawlability — Can AI bots access and fully read the page? robots.txt, rendering, JS-dependency.
Entity signals — Does the page unambiguously identify the subject as a known real-world entity? Schema, sameAs, knowledge graph links.
Citation signals — Does the content give AI answer engines something worth quoting? Stats with sources, Q&A patterns, authoritative references, E-E-A-T.

Tools used: Google Rich Results Test, Schema.org validator, manual View Source (for JS rendering check), and direct prompting of ChatGPT, Perplexity, and Claude to ask about me by name and specialty.

Note on AI prompt testing: Querying an LLM about yourself is a noisy signal — model knowledge cutoffs, training data sampling, and recency bias all affect results. But the direction of change after fixes was consistent across all three models I tested.

JS-only hero text — invisible to AI crawlers High severity

My hero section used a typing-animation library to cycle through my titles: "AI Solution Architect / Generative AI Engineer / ML Research Engineer." It looked great in a browser. To a crawler — especially one that doesn't execute JavaScript — the <h1> tag was effectively empty.

Here's what View Source returned for my hero heading before the fix:

<h1 class="typed-heading"><span id="typed-text"></span></h1>

An empty span. The single most important identity statement on the entire site — name, title, value proposition — was invisible to every AI crawler that doesn't execute JavaScript.

The fix: I added a visually-hidden static element inside the hero that duplicates what the animation displays. The animation still runs for human visitors; the static text is indexed by bots.

<span class="visually-hidden">
  Neil Dave — AI Solution Architect, Generative AI Engineer,
  and Machine Learning Research Engineer based in Bangalore, India.
</span>

Bootstrap's .visually-hidden class (formerly .sr-only) renders the element off-screen but keeps it in the DOM and in the accessibility tree — which means crawlers pick it up without showing duplicate text to users.

No entity disambiguation — missing Wikidata High severity

My Person schema had sameAs links to LinkedIn and GitHub. Those are useful, but they are not entity-disambiguation links — they are social profiles. What was missing was a pointer to a structured knowledge graph entry that proves to an LLM: "this is a verified, real-world named entity, not just a string of text."

Wikidata is the canonical open knowledge graph used by Google, Wikipedia, and — through training data — by the major LLMs. If you have a Wikidata entry (even a minimal one), linking to it from your Person schema is the single highest-leverage GEO action you can take on a personal site.

Before:

"sameAs": [
  "https://www.linkedin.com/in/neil-dave/",
  "https://github.com/theneildave/"
]

After:

"sameAs": [
  "https://www.linkedin.com/in/neil-dave/",
  "https://github.com/theneildave/",
  "https://www.wikidata.org/wiki/Q138716797",
  "https://scholar.google.com/citations?user=93Ybux4AAAAJ&hl=en",
  "https://twitter.com/theneildave"
]

The Wikidata link is the critical addition. Google uses Wikidata entity IDs extensively in its Knowledge Graph, and several AI training pipelines ingest Wikidata as a ground-truth entity reference. Don't have a Wikidata entry? Create a minimal one — it requires a verifiable claim (IEEE publication, media mention, employer record) and takes about 20 minutes.

Zero extractable Q&A — no FAQPage schema High severity

AI answer engines work by extracting a relevant passage from your content, synthesising it, and citing the source. The easiest content format for an LLM to extract from is a structured question-answer pair, because it maps directly to the question the user asked.

My site had none. The About section was good prose, but it was not structured around the questions someone would actually ask an AI. I had no FAQPage schema anywhere.

The fix was two-part: first, I added a FAQPage JSON-LD block to my index page and to every blog post. Second, I rewrote one section of the About copy to explicitly answer the top three questions I wanted to rank for:

"Who is Neil Dave?"
"What does an AI Solution Architect do?"
"What AI projects has Neil Dave worked on?"

A minimal FAQPage block looks like this:

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What does Neil Dave specialise in?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Neil Dave is an AI Solution Architect specialising in
        Generative AI, LLM Engineering, and Computer Vision systems.
        He has 7+ years of experience deploying ML systems in healthcare,
        sports analytics, and enterprise automation."
      }
    }
  ]
}

The acceptedAnswer.text field is what gets pulled verbatim into AI citations. Write it like a Wikipedia lead paragraph — complete sentence, third person, specific and verifiable.

Thin citation density — no quantified claims with sources Medium severity

The Aggarwal et al. GEO paper I referenced earlier identified a consistent pattern: AI systems preferentially cite pages that contain statistics with attributed sources. The intuition is straightforward — if you can point to a number from McKinsey, WEF, or an IEEE paper, your content is more likely to be treated as authoritative than equivalent prose without sources.

My About section made claims like "AI is transforming industries" and "demand for AI skills is growing" — true, but unverifiable from my text alone. Zero citations. Zero numbers.

The fix was concrete replacements:

Before	After
AI is transforming every industry.	McKinsey estimates generative AI could add $2.6–4.4 trillion annually to the global economy (McKinsey Global Institute, 2023).
Demand for AI talent is growing fast.	The World Economic Forum projects 97 million new AI-related roles by 2025 (WEF Future of Jobs Report, 2020).
I have published AI research.	Google Scholar-indexed publications in computer vision and ML, cited in peer-reviewed venues.

None of these changes required fabricating anything — the numbers are real and the sources are public. The discipline is in making the connection explicit rather than leaving it implicit.

Generic page title — no job-title signal Medium severity

My <title> tag read: "Neil Dave - AI Leadership & Solution Architect". The problem is the conjunction. "AI Leadership" is not a job title anyone searches for. More critically, the title was inconsistent with my Person schema jobTitle field, which said "AI Solution Architect."

LLMs extract job titles through a combination of schema fields and surface-level text patterns. When the <title>, <h1>, schema jobTitle, and About body copy all say the same thing, the signal is reinforced. When they conflict, the model has to guess.

Rule of thumb: Your <title>, Person.jobTitle, first <h1>, and the first sentence of your About paragraph should all contain the identical canonical job title string. Any variation is noise for an LLM trying to extract your role.

The fix:

<!-- Before -->
<title>Neil Dave - AI Leadership & Solution Architect</title>

<!-- After -->
<title>Neil Dave — AI Solution Architect | Generative AI Engineer</title>

I also audited every page for title consistency: speaking.html, enterprise.html, collaborate.html, and all blog posts. Each one now leads with "Neil Dave — AI Solution Architect" before the secondary descriptor.

robots.txt silently blocking AI crawlers High severity

This one is easy to overlook because the default robots.txt for most site generators and templates only explicitly allows or disallows Googlebot. The AI crawlers — GPTBot, PerplexityBot, ClaudeBot, anthropic-ai, Google-Extended — are not Googlebot. Under a permissive default they can crawl. But some templates ship with restrictive defaults, and many site owners add Disallow: / rules without realising they apply to all bots.

My audit found my robots.txt was fine, but when I cross-checked with friends who had similar sites, two of them had unknowingly blocked all non-Google bots. Their sites were effectively invisible to every AI answer engine.

The correct robots.txt for a personal brand site that wants AI citation:

User-agent: *
Allow: /

# Explicitly allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Explicit Allow rules are belt-and-suspenders. They override any ambiguity from wildcard rules and signal — to the crawlers that respect robots.txt meta-information — that you actively welcome AI indexing.

Before vs. after: schema comparison

Here's the condensed delta for my index.html Person schema — the highest-leverage single schema block on a personal site:

Field	Before	After
`@type`	Person	Person
`jobTitle`	AI Solution Architect	AI Solution Architect (consistent across all pages)
`sameAs`	LinkedIn, GitHub	LinkedIn, GitHub, Wikidata Q138716797, Google Scholar, Twitter
`knowsAbout`	Not present	["Generative AI", "LLM Engineering", "MLOps", "Computer Vision", "NLP"]
`hasOccupation`	Not present	Occupation with `name`, `occupationLocation`, `skills`
`alumniOf`	Not present	University name + degree
FAQPage	Not present	4 question–answer pairs targeting top persona queries
ProfessionalService	Not present	Present with `serviceType` array and `areaServed`

The cumulative effect of these changes is that an AI crawler now has a dense, internally consistent, and externally anchored (Wikidata) description of who I am and what I do — without having to infer anything from prose.

The GEO checklist for your own site

Below is the audit checklist I now run on every page I publish. Copy it, adapt the schema examples, and work through it top to bottom before you go live.

Layer 1 — Crawlability

robots.txt explicitly allows GPTBot, PerplexityBot, ClaudeBot, anthropic-ai, Google-Extended

sitemap.xml is present and submitted to Google Search Console

Hero heading text is in static HTML (not JS-only) — use .visually-hidden if you need animation

View Source on your homepage shows your name, title, and a sentence about what you do in plain text

Layer 2 — Entity signals

Person schema is present on homepage with name, jobTitle, url, image, sameAs

sameAs includes Wikidata entity URL (create one if you don't have it)

jobTitle in schema exactly matches the title in <title>, <h1>, and About paragraph first sentence

knowsAbout array lists your actual skill domains as discrete strings

Google Scholar, ORCID, or ResearchGate link in sameAs (even one publication counts)

ProfessionalService schema on homepage with serviceType, areaServed, provider

Layer 3 — Citation signals

FAQPage schema with at least 3 Q&A pairs targeting persona-level questions

About section contains at least 2 statistics with attributed sources (McKinsey, WEF, IEEE, etc.)

Blog posts each have BlogPosting schema with wordCount, timeRequired, keywords

Any blog post making a factual claim links to a primary source

Each page has a BreadcrumbList schema (minimal but universal)

"Last updated: Month YYYY" is visible in About or footer (recency signal)

Final thoughts

The irony of this whole exercise is that none of the fixes were technically complex. There was no code refactor, no new API, no infrastructure change. It was mostly a discipline problem: consistently applying structured markup, keeping identity signals aligned across every page, and writing content the way an LLM would want to extract it rather than the way a 2018 content marketer would write it.

GEO for personal brands is still early. Most portfolio sites I've looked at have at least four of the six issues I described above. That means the competitive moat is not that high right now — getting the basics right puts you ahead of the majority. In twelve months, when everyone has caught up, the bar will be higher. The time to fix this is before that happens.

If you run this audit on your own site and find something interesting, I'd like to hear about it. Drop me a message on LinkedIn or reply to the newsletter.

Neil Dave

I Ran a GEO Audit on My Own Portfolio Site — Here's What I Found (and How I Fixed It)

In this article

What is GEO and why should you care in 2026?

My audit methodology

JS-only hero text — invisible to AI crawlers High severity

No entity disambiguation — missing Wikidata High severity

Zero extractable Q&A — no FAQPage schema High severity

Thin citation density — no quantified claims with sources Medium severity

Generic page title — no job-title signal Medium severity

robots.txt silently blocking AI crawlers High severity

Before vs. after: schema comparison

The GEO checklist for your own site

Layer 1 — Crawlability

Layer 2 — Entity signals

Layer 3 — Citation signals

Final thoughts

Article

I Ran a GEO Audit on My Own Portfolio Site — Here's What I Found (and How I Fixed It)

In this article

What is GEO and why should you care in 2026?

My audit methodology

JS-only hero text — invisible to AI crawlers High severity

No entity disambiguation — missing Wikidata High severity

Zero extractable Q&A — no FAQPage schema High severity

Thin citation density — no quantified claims with sources Medium severity

Generic page title — no job-title signal Medium severity

robots.txt silently blocking AI crawlers High severity

Before vs. after: schema comparison

The GEO checklist for your own site

Layer 1 — Crawlability

Layer 2 — Entity signals

Layer 3 — Citation signals

Final thoughts