Session 9.2: Training Data: The Long Game

Course → Module 9: AI Search and Entity Recognition

Session 2 of 7

How Models Learn About Entities

Large language models like GPT-4, Gemini, and Claude are trained on massive datasets: web crawls (Common Crawl), Wikipedia, books, news archives, academic papers, and more. During training, the model processes billions of pages and learns patterns, facts, and associations.

If your entity appears frequently and consistently across these training sources, the model "knows" you. When a user asks about your industry, your company name may surface from the model's parametric memory. If your entity does not appear in training data, the model has no knowledge of you. Real-time retrieval may find your website, but without training data knowledge, the model assigns less confidence to you as an entity.

Training data is the long game. By the time a model is trained, it is too late to add yourself. You must already be present in the sources that will be included in next year's training runs.

Training Data Sources and Their Weight

Not all training data sources carry equal weight. Sources that are curated, high-quality, and frequently included in training datasets have more influence on what the model knows.

Source	Training Data Likelihood	Entity Signal Strength	Your Control Level
Wikipedia	Very high (explicitly included)	Very strong	Low (must meet notability criteria)
Wikidata	Very high (structured facts)	Strong	Moderate (can create entry if eligible)
Major news sites	High	Strong	Low (requires press coverage)
Government / regulatory databases	High	Strong	Low (must be registered/listed)
Industry publications	Moderate	Moderate	Moderate (guest articles, interviews, citations)
Common Crawl (general web)	Moderate (filtered)	Varies	High (your own website)
Social media	Low to moderate	Weak	High (your own profiles)

The Wikipedia and Wikidata Effect

Wikipedia is the single most influential training data source for entity recognition in AI models. Every major LLM is trained on Wikipedia. If your entity has a Wikipedia article, the model has processed detailed, structured information about who you are, what you do, and how you relate to other entities.

Wikidata is equally important but in a different way. Wikidata provides structured facts (not prose) that models can process as clean entity relationships: "Organization X, founded in 2005, headquartered in Jakarta, industry: industrial equipment." This structured representation makes entity facts precise and unambiguous in the model's knowledge.

graph TD WP["Wikipedia Article"] --> TD["AI Model Training Data"] WD["Wikidata Item"] --> TD NW["News Mentions"] --> TD CC["Common Crawl
(Your Website)"] --> TD TD --> MK["Model's Knowledge
of Your Entity"] MK --> AI["AI Search Responses"] style WP fill:#222221,stroke:#6b8f71,color:#ede9e3 style WD fill:#222221,stroke:#6b8f71,color:#ede9e3 style MK fill:#222221,stroke:#c8a882,color:#ede9e3

Press Mentions as Training Data Entries

Major news publications (Reuters, Bloomberg, Kompas, Tempo, industry trade publications) are almost certainly included in training datasets. A single mention in Reuters carries more training-data weight than 100 mentions on small blogs, because Reuters is a high-confidence, curated source that models are explicitly trained on.

This means press coverage has a dual purpose: traditional PR value (brand awareness, credibility) and training data value (AI models learn about you). When planning press strategy, prioritize publications that are likely training data sources over publications that merely have web traffic.

Building a Training Data Presence

You cannot control what is included in a model's training dataset. But you can increase the probability that your entity appears by being present in the right sources.

Action	Training Data Source	Difficulty	Timeline
Create a Wikidata entry (if eligible)	Wikidata	Moderate	1 to 2 weeks
Publish on industry trade publications	Industry pubs, Common Crawl	Moderate	1 to 3 months
Get mentioned in major news	News archives	Hard	Variable
Publish original research that gets cited	Common Crawl via citations	Hard	3 to 6 months
Work toward Wikipedia notability	Wikipedia	Very hard	6 to 24 months

The Training Data Cycle

Major AI models are retrained or updated periodically. GPT-4 models have training data cutoffs that are updated every few months. This means actions you take today will begin appearing in AI model knowledge within 3 to 12 months, depending on the source and the model's update cycle.

This is why training data is the long game. There is no quick fix. The entities that invest in training data presence now will have a compounding advantage as AI search becomes the dominant interface.

AI visibility has a lag measured in months. The entity that appears in next year's AI training data is the one building its presence in authoritative sources today.

Assignment

Search for your company on sources likely included in AI training data: Wikipedia, Wikidata, 3 major national news sites, 3 industry publications, and government/regulatory databases. Count how many mentions exist. For each mention, note whether the information is accurate and consistent with your master entity data. Identify the highest-value source where you could realistically gain a presence in the next 6 months.

Training Data: The Long Game