Performance Marketing

AI Search Collapses Jurisdictions — Operators Must Push Back

May 22, 2026 · 9 MIN READ

TL;DR: AI search systems don’t just translate queries — they decide which version of reality gets surfaced based on which corpus is largest. Multilingual regions like Catalonia expose the same retrieval collapse that threatens operators running campaigns across U.S. states, regulated verticals, and sub-national markets. If your entity signals don’t encode jurisdiction explicitly, the model defaults to the dominant corpus — and your ads, content, and leads follow.

Language Identification Is Already Broken Under the AI Layer

Before any operator worries about AI Overviews, there is a more fundamental problem worth understanding. Google’s language-identification infrastructure has misclassified Catalan — a language with roughly 9 million speakers and co-official status in Catalonia — as Occitan, a language with around 200,000 speakers in southern France. This happens from a Barcelona IP, on a Barcelona-anchored query, at a company that has operated in Barcelona for over 20 years. Google’s own Search Liaison account acknowledged the Catalan demotion problem publicly in January 2023, posted its response in Catalan, and committed to fixes. Classical SERPs improved later that year. The language-identification layer underneath was never structurally repaired.

When AI Overviews arrived, they were built on top of that same pipeline. Every misclassification that existed before the synthesis layer now gets amplified: the model produces a synthesized answer and cites sources that were pulled from the wrong corpus to begin with. A Catalan-language query about a 600-year-old local tradition returns citations to state tourism portals and hotel chains instead of the regional government office that has formally administered the tradition for centuries. The model isn’t fabricating — it’s faithfully synthesizing the corpus it retrieved. The corpus was wrong before the model touched it.

This matters outside Catalonia because the failure mode is architectural, not geographic. The same semantic collapse mechanism — retrieval embeddings that can’t separate sub-national signals and default to the dominant corpus — operates in any market where jurisdiction and language diverge. Which, for operators running performance media across multiple U.S. states, is every market they touch.

What the Catalan Experiments Show About Retrieval Behavior

Running paired queries in Catalan and Spanish from the same Barcelona IP across ChatGPT and Google AI Overviews produced four consistent findings. First, vocabulary and source plurality diverged: the Catalan-language query about Catalan independence surfaced the concept of dret a decidir alongside historical institutional references, while the Spanish-language version anchored on the 1978 Constitution and the 2017 referendum’s illegality. Same engine, same geography, two non-overlapping retrieval pools triggered purely by language string.

Second, commercial retrieval shifted and the engine expressed doubt about the minority language. The Catalan version of a “best accountants for freelancers in Barcelona” query was autocorrected to a query about ice cream shops. The Spanish version returned paid ads from Talenom, Declarando, and Horus Firm. The Catalan version returned zero paid ads. The absence of SEM bids is itself a training signal: the model reads the lack of commercial activity as evidence that the language isn’t commercially serious and weights retrieval accordingly. Less bidding, less visibility, less signal — the mechanism teaches itself.

Third, cultural authority was reassigned based on language. A recipe search for calçots — a vegetable that exists only in Catalonia and has no other-language name — returned a suggestion to switch results to Spanish, with no AI Overview generated at all. The system decided that a search for a Catalan-only food product, written in Catalan, from Catalonia, was better answered in a different language.

Fourth, inconsistency was worse than consistent error. The same query returned correct Catalan answers in some sessions and Spanish answers in others, for no surfaced reason. A site owner cannot fix intermittent failures the system doesn’t explain.

The Slop Loop Closing on Minority Corpora

There is a second, slower mechanism compounding this problem. LLMs generate low-quality content in minority languages — through direct translation features and through downstream SEO tools producing automated articles. That content gets indexed, crawled, and fed back into the next training corpus. The model that doesn’t understand Catalan well produces the Catalan content that trains the next model. A 2024 Princeton study found over 5% of newly created English Wikipedia articles showed signs of AI generation. MIT Technology Review reported in September 2025 that volunteers on four African-language Wikipedia editions estimated 40–60% of their articles were uncorrected machine translations. The Greenlandic Wikipedia edition was recommended for closure in 2025 after AI tools had produced content native speakers rated as incomprehensible.

Wikipedia’s response is instructive. In March 2026, the English Wikipedia community voted to prohibit LLM-generated article content across its 7.1 million articles. If a platform with strong volunteer governance and explicit neutrality policies concluded that AI-generated text damages knowledge integrity, operators should not assume that retrieval pipelines downstream of Wikipedia produce better answers than Wikipedia itself was willing to publish.

Sub-national governments in Catalonia and the Basque Country have already started training their own foundation models — the Aina Project and the Latxa models — because standard global LLMs perform measurably worse on those languages than on Spanish. When governments fund their own LLMs to counter retrieval collapse, the underlying mechanism is real and structural, not theoretical.

What This Means for High-CAC Verticals

Every vertical DIGI MIRROR operates in runs into a version of this problem. The Catalonia case is the clearest demonstration because two languages share one geography, making the collapse visible to anyone who switches languages and watches the system reassign authority. But the same dynamics surface wherever jurisdiction and corpus weight diverge.

In legal marketing, California’s CCPA and Texas’s data privacy regime are written in the same language but represent different jurisdictional realities. The privacy literature is heavily California-weighted. When an AI Overview synthesizes a generic “what privacy rights do I have” answer, it defaults toward whichever jurisdiction has more authority signals — typically the larger corpus, not the user’s state. Law firm marketing programs that don’t explicitly encode state-level regulatory identifiers into schema, copy, and knowledge-graph entries will find their pages collapsed into national defaults. A mass tort firm running Texas-specific intake campaigns needs the model to surface Texas — and that requires deterministic hooks, not folder structure alone.

In iGaming, licensing regimes vary dramatically by state. A New Jersey operator and a Michigan operator are operating under genuinely different regulatory environments. iGaming marketing programs that treat “US” as a single retrieval target are building on the same flawed assumption Google’s pipeline makes about Catalan — that the dominant corpus represents the correct answer for every user in the geography.

In forex and crypto, geo-identification drift is already documented. Operators running forex acquisition campaigns across multiple jurisdictions — say, UK FCA-regulated versus offshore — cannot afford the model conflating their regulatory positioning. The same applies to crypto lead generation programs targeting users in distinct regulatory environments. If the training corpus weights one jurisdiction over another, synthesis answers will reflect the dominant corpus, not the user’s actual regulatory context.

In trucking recruitment, state-specific CDL licensing requirements, HOS rules, and pay structures vary enough that a generic “CDL jobs in the US” synthesis answer is not the same as a Texas-specific or California-specific one. CDL recruitment marketing programs should treat each major operating state as a distinct entity in structured data, not as a variant of a national page.

The Fixes Are Structural, Not Cosmetic

The diagnostic for sub-national retrieval collapse is the same as the diagnostic for international retrieval collapse. Run the same prompt from multiple jurisdictions. Ask the model what the relevant regulation is. Ask it who the authority is. Ask it what the commercial options are. If the answer collapses to the dominant jurisdiction regardless of the stated location, the content has a fragmentation problem inside what looked like a single market.

Start with a full marketing audit that specifically examines how AI Overviews are currently synthesizing answers about your regulated product or service. Screenshot the citation panels. Identify which authorities the model credits for your category. If those authorities are the wrong jurisdiction, you have an entity-signal gap, not a content quality gap.

The structural fixes require encoding jurisdiction explicitly at every layer. Schema.org’s areaServed operates at any geographic granularity; use it down to the state, county, or municipality where it matters. Pair it with explicit copy markers: regulator names, state-specific license numbers, region-specific terminology the model can use as a deterministic hook. Reinforce sub-national grounding through Wikidata’s jurisdiction property (P1001) and explicit language properties. Knowledge graphs feed directly into the entity-context layer AI systems use when deciding what corpus to retrieve from. If your entity is modeled at national granularity, the model has no reason to surface sub-national specificity.

Watch the secondary signals the same way the Catalan case revealed them. If no one bids on your state-specific terminology in paid search, that absence trains the model to treat your jurisdiction as commercially unserious. Performance ads programs that maintain state-specific bidding and ad copy provide a commercial signal the retrieval layer actually reads. Operators who consolidate to national campaigns to reduce complexity are, in effect, voting to be collapsed into the dominant corpus.

Finally, apply precision targeting by jurisdiction at the entity level, not just the campaign level. The same audience segmentation logic that makes a Mexican Spanish user different from a Spain Spanish user applies to a Texas user versus a California user in any regulated vertical. The model’s retrieval pipeline sees those signals, or it sees nothing and defaults to the largest available corpus.

The Model Can Tell — If You Give It Enough to Work With

The question every operator running multi-jurisdictional campaigns needs to answer is not whether their content is localized. It’s whether the model can tell the difference between their jurisdictions without being told explicitly. In Catalonia, the model sometimes gets it right — formal, erudite Catalan queries for cultural topics are correctly identified and answered from Catalan-language sources. The failures concentrate in commercial and popular queries, exactly where the cost of getting it wrong is highest. The same pattern will appear in U.S. regulated verticals: synthesis answers will be correct for brand-name queries and wrong for the commercial intent queries that actually drive intake and account opens.

Multilingual regions are the leading indicator, not the edge case. The architectural flaw they expose — a vector space that can’t reliably separate jurisdiction from meaning, sitting on a language-identification layer that gets things wrong intermittently — is present in every synthesis pipeline. The brands operating well across Spain and Mexico already built the fix for languages. The same techniques are now required for any pair of jurisdictions, in any language combination, where corpus weight and user reality diverge.

Originally reported by Search Engine Land, May 2026.

// EXPLORE

Get a playbook for your vertical

Forex

Language Identification Is Already Broken Under the AI Layer

What the Catalan Experiments Show About Retrieval Behavior

The Slop Loop Closing on Minority Corpora

What This Means for High-CAC Verticals

The Fixes Are Structural, Not Cosmetic

The Model Can Tell — If You Give It Enough to Work With

Get a playbook for your vertical

Forex lead gen

Crypto & Web3

Law firm marketing

More in Performance Marketing

Microsoft Advertising Drives Real Gains When You Stop Mirroring Google

HubSpot May 2026 Updates Shift Automation for Operators

ChatGPT Source Selection Is Mechanical — Build for It