An AI visibility score is a single number, usually on a 0 to 100 scale, that estimates how often and how prominently AI engines like ChatGPT, Google AI Overviews, and Perplexity mention your brand when people ask about your topic.
It is a useful number and an easy one to misread. There is no official version of it, every tool builds it differently, and the same brand can score 40 in one tool and 70 in another.
What Is an AI Visibility Score?
An AI visibility score estimates how present your brand is inside AI-generated answers. It rolls up, into one 0 to 100 number, how often engines like ChatGPT, Perplexity, Gemini, Claude, and Google AI Overviews name you when someone asks a question in your category.
The shift it tries to capture is real. Traditional SEO asks where you rank on a page of links. AI search rarely shows that page. It returns a written answer, and with those answers pulling clicks away from blue links, your only question is whether you are in it. A visibility score is an attempt to put a number on that.
Two things about it matter from the start.
First, it is not an official metric. OpenAI, Google, Anthropic, and Perplexity do not publish a visibility score, and none of them endorse one. Google's own documentation confirms it requires no special files or schema to appear in AI features, and it offers no score for doing so. Every number you see is built by a third-party tool that picks its own prompts, engines, and math.
Second, it is not your ranking. You can rank first in Google for a term and still be absent from the AI answer to the same question, because the engine pulled from sources that explained the topic more plainly. Visibility measures presence in the answer, not position on a page. Watching that presence over time is the job of ongoing AI visibility tracking; the score is the snapshot it produces.
How an AI Visibility Score Is Calculated
Under the hood, almost every tool follows the same four steps, even when the marketing language differs.
- Build a prompt set. Pick the questions a real buyer would ask in your category, from broad ("best project management software") to specific ("project management tool for agencies").
- Run them across engines, more than once. Send each prompt to ChatGPT, Perplexity, Gemini, and Google AI Overviews, and repeat the runs, because the answers move.
- Score the signals. Record whether you were mentioned, whether you were cited, where you appeared, and how you were described.
- Normalize to a 0 to 100 scale. Weight those signals and compress them into one comparable number.
The signals themselves are where tools agree. Across the vendors that publish their methodology, including Semrush and Amplitude, the same handful shows up.
The Signals That Go Into the Score
| Signal | What it measures |
|---|---|
| Mention rate | How often you appear at all across the prompt set |
| Citation rate | How often the answer links or attributes a claim to your site |
| Position | Whether you show up first, in the middle, or last in the answer |
| Coverage / consistency | Whether you appear across many engines and repeated runs, not just once |
| Sentiment | Whether the model describes you positively, neutrally, or negatively |
| Competitor share | Your slice of the mentions versus named rivals, also called share of voice |
A Simplified Example
Most tools weight those signals and add them up. One common shape: 40% mention rate, 25% citation rate, 15% position, 10% sentiment, 10% cross-engine consistency. Run 100 prompts, get mentioned in 70, cited in 40, recommended first in 25, with positive sentiment across most, and you might land near 60 out of 100.
The exact weights are the part no two vendors share. That one design choice is why the same brand scores differently depending on who runs the math, and it is why the same inputs produce different numbers from different vendors.
What Counts as a Good AI Visibility Score?
The honest answer is that there is no universal benchmark, and any tool that quotes you one is overselling its own number.
A score only means something relative to the prompt set and engines behind it. A 60 built from 30 broad questions is not the same as a 60 built from 200 narrow, high-intent ones. Because every tool picks its own inputs, no cross-tool "good" threshold can exist.
Tools do publish ranges, and they are worth knowing as rough orientation. Most treat 0 to 30 as very low visibility, 30 to 50 as limited, 50 to 70 as an emerging presence, 70 to 85 as strong, and 85 and up as category leadership. In a competitive niche, even a 40 to 50 can mean you are showing up regularly.
Read those as orientation, not goals. What actually tells you something is your own trend and your share against named competitors. A score that climbs from 35 to 50 over a quarter while your closest rival slips is a real signal. The same 50, read in isolation with no history and no competitor column, tells you almost nothing.
Why Two Tools Give You Two Different Scores
Run the same brand through two AI visibility tools and you will often get two very different numbers. That is not a bug in either tool. It is the direct result of four choices each one makes on its own.
The prompt set. This is the biggest one. One tool tests 30 prompts, another tests 300, and they rarely overlap. Your brand might dominate the questions one tool happens to ask and be invisible in the other's. Same brand, different exam.
The engine mix. One score blends ChatGPT, Perplexity, and Gemini evenly. Another leans on ChatGPT because it has the most users. If you are strong in Perplexity and weak in ChatGPT, those two choices alone push your score in opposite directions.
The weighting. Vendors split the signals differently. A tool that rewards raw mentions flatters a brand that gets name-dropped often; one that rewards citations favors a brand whose pages actually get linked. Same data, different verdict.
The sampling. Some tools run each prompt once. Others run it repeatedly and average. Because AI answers shift between runs, a one-shot tool and an averaging tool can disagree on the same day.
The rule that follows: never compare a score from one tool against a score from another. A 72 in one product is not "better" than a 65 in a different one. Pick a single tool, learn how it builds its number, and compare it only to itself over time.
The Catch: AI Answers Are Probabilistic
There is one property of AI search that every score has to wrestle with. The same question does not always get the same answer.
Ask ChatGPT one prompt ten times and your brand might appear in seven responses, then four, then nine. That is not measurement error, it is how the models work. Research on large language models found accuracy swings of up to 15% across repeated runs even at deterministic settings, so a tool that samples a prompt once is reading noise as if it were signal. The instability shows up at the brand level too: AirOps's 2026 State of AI Search report found only about 30% of brands stay visible from one answer to the next, and just 20% hold across five consecutive runs.
This is why a credible score samples many prompts, across multiple engines, repeated over time. A recent paper on measuring AI search visibility, bluntly titled "Don't Measure Once," argues that visibility should be treated as a distribution rather than a single reading, because one-off observations are unreliable.
That includes us: geotoolbox's own score is built from a single GEO Scan pass across up to seven engines, one sample rather than a census, which is why we tell users to read the trend across scheduled scans instead of any single reading.
The takeaway for your own score: the last few points are noise. Whether you sit at 73 or 76 this week does not matter. Read the number as a direction, not a precise measurement.
Vanity Score vs Credible Score
Two tools can both hand you a confident 74, and only one earned it. The difference is method.
Here is the test to run on any AI visibility score before you trust it.
| What to check | Vanity score | Credible score |
|---|---|---|
| Prompt set | Hidden, or a handful of prompts | Disclosed, sized to your category, real buyer questions |
| Engines | One engine, usually ChatGPT | Multiple engines, weighted to where your audience asks |
| Sampling | Each prompt run once | Prompts repeated and averaged over time |
| Grounding | Assumes the AI could read you | Confirms AI crawlers can actually fetch and render your pages |
| Output | A single number | A number plus per-engine breakdown, competitor share, and a trend |
A score that fails the left column is not useless, but it is fragile. It moves when the tool quietly changes its prompts, and it cannot tell you why you rose or fell. A score built from the right column survives scrutiny: you can see which prompts you lost, on which engine, against which competitor, and decide what to fix.
One row gets skipped more than any other, and it quietly caps every other signal. That row is grounding.
Reachability: The Signal Most Scores Skip
Every signal in that table assumes one thing: that the AI engine could actually read your page in the first place. If it could not, the score is measuring a ceiling you set yourself.
This is the part most visibility scores ignore. They test the answer side, whether you got mentioned, and never check the supply side, whether the model could fetch and render your content at all. Those are different failures with the same symptom of a low number.
Three things commonly block the supply side. Your robots.txt may disallow an AI crawler, often by accident in a blanket rule, and each AI crawler is its own user agent; OpenAI's documentation confirms GPTBot and OAI-SearchBot are controlled through robots.txt. A firewall or bot filter may return a block to the crawler while serving a browser normally. Or your content may render only after JavaScript the crawler does not run, so the bot indexes an empty shell.
In the scans geotoolbox runs, that trio is the most common cause of a zero-visibility reading, well ahead of weak content. None of it shows up in a score that only counts mentions. A brand can spend months "improving its content" while a single firewall rule keeps every AI engine from seeing the page.
A trustworthy score has to be grounded in reachability: confirm the engines can fetch and render you before you read anything into how often they cite you.
How to Read and Use Your Score
Once you trust how a score is built, using it is straightforward.
Pull the per-engine breakdown first, because a healthy blended number can hide a real gap, strong on ChatGPT and absent on Perplexity, that only the split reveals. Track your share against named competitors, since visibility in AI answers is a near zero-sum fight for a few citation slots.
Then connect the number to action. A score is a thermometer, not a treatment. When it tells you something is wrong, a one-time AI visibility audit finds the specific cause, the on-page and reachability fixes live in our AI search playbook, and an AI rank tracker confirms whether the fix moved the number. Score to spot the problem, audit to diagnose it, track to prove the fix worked.
Frequently Asked Questions
What is a good AI visibility score? Treat any threshold a vendor quotes as marketing until you have seen their prompt list. The practical test before trusting a "good": ask for the prompts behind the number. If the vendor will not show them, the benchmark is unfalsifiable and belongs in a pitch deck, not a report.
How is an AI visibility score calculated? Prompts run across engines, signals weighted, normalized to 0-100; the weights are the vendor's secret sauce. The practical consequence: if you ever switch tools, export your prompt set and re-baseline, because the score will jump for methodology reasons alone. Annotate the switch in any report a client will see.
Why does my AI visibility score change every time I check it? Single-run swings of a few points are baseline noise, not a trend. Before acting on a drop, wait for two or three consecutive readings moving the same direction; that is the cheapest way to separate model randomness from something you actually broke or fixed.
Why do two tools give my brand different AI visibility scores? Different prompts, engines, weights, and sampling. The actionable version: pick the tool whose prompt set you can see and edit, then stay with it, because switching resets the only comparison that works, your own history.
Is an AI visibility score a vanity metric? A quick pressure test: trigger the same report twice in one week. If the number comes back identical to the decimal, the tool is not resampling a probabilistic system, and the precision is cosmetic. Real sampling shows wobble; honest tools show it to you.
Is an AI visibility score the same as my Google ranking? No, and the correlation is looser than most expect. Ranking helps because engines lean on search indexes to pick sources, but a page-two site with quotable, well-sourced answers can out-appear a page-one site that buries its answer under preamble.
Where to Start
A score is only worth as much as the method behind it, so start with the part most tools skip: whether AI engines can reach your pages at all. geotoolbox's free AI-Readiness Score flags a crawler block in seconds, and the paid AI search checker fetches your page as the major AI crawlers, flags a render gap, and grades how citable the page is. Run that first, fix what it surfaces, then track your visibility over time so the number you watch is one you can trust.
Sources
- Don't Measure Once: Measuring Visibility in AI Search (GEO) - Schulte, Bleeker, Kaufmann, arXiv, 2026
- Non-Determinism of "Deterministic" LLM Settings - Atil et al., arXiv, 2024
- AI features and your website - Google Search Central
- Overview of OpenAI Crawlers - OpenAI developer documentation
- Google's AI Overviews are hurting clicks: Pew study - Pew Research Center, reported by Search Engine Land, 2025