The Research-Then-Write Pipeline
Session 7.4 · ~5 min read
Most people use AI by saying "write me an article about X." The model generates from training data. Training data is a compressed, averaged, and potentially outdated representation of everything the model saw during training. The result reads like a summary of summaries, because that is essentially what it is.
The research-then-write pipeline flips the order. First, gather sources. Then, feed those sources to the AI as context. The AI writes from your curated sources, not from its internal compression of the internet.
The Two Approaches
training data"] B --> C["Output: averaged,
generic, possibly
outdated"] end subgraph "Research-Then-Write" D["Research: gather
sources about X"] --> E["Feed sources
as context"] E --> F["Model writes from
your sources"] F --> G["Output: current,
specific, citable"] end style C fill:#2a2a28,stroke:#c47a5a,color:#ede9e3 style G fill:#2a2a28,stroke:#6b8f71,color:#ede9e3
The difference is not subtle. Direct generation produces content that sounds informed but often is not. Research-then-write produces content that is informed because the information is right there in the context window.
The Pipeline Architecture
A research-then-write pipeline has two distinct stages, each with its own tools, parameters, and quality checks.
| Stage | Input | Process | Output | Quality Check |
|---|---|---|---|---|
| 1. Research | Topic + research questions | Search APIs (Tavily, Google), extract key data | Research brief (sources, data, quotes) | Sufficient coverage? Sources reliable? |
| 2. Write | Research brief + system prompt + outline | LLM API with sources as context | Draft content | Claims match sources? Voice correct? |
Stage 1: Research
The research stage is not "search for the topic and see what comes up." It is targeted querying based on specific research questions you define before the search begins.
research questions"] B --> C["For each question:
query Tavily or
Google Search API"] C --> D["Collect results:
10-20 sources"] D --> E["Filter: remove
unreliable, irrelevant,
duplicate sources"] E --> F["Extract: pull key
data points, quotes,
statistics from each"] F --> G["Assemble research brief:
organized by subtopic
with citations"] style B fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style G fill:#2a2a28,stroke:#6b8f71,color:#ede9e3
The research questions matter. "What is remote work?" produces generic results. "What percentage of Fortune 500 companies have permanent remote work policies as of 2025?" produces specific, usable data. Write your research questions the way a journalist would: specific enough that the answer is a fact, not an overview.
The quality of your research stage determines the ceiling of your writing stage. No amount of prompt engineering can compensate for thin research. Invest the time upfront.
Stage 2: Write
The writing stage takes the research brief as context and your system prompt as voice constraints, then generates content that synthesizes the gathered information into your format and voice.
The system prompt for this stage includes a critical instruction: write based only on the provided sources. Do not add information from training data unless explicitly instructed. This constraint prevents the model from filling gaps with hallucinated data. If the research brief does not cover a point, the model should either skip it or flag it as needing additional research.
The Research Brief Format
A research brief is a structured document, not a dump of search results. It organizes findings by subtopic, includes source citations for every data point, and separates facts from interpretations.
| Section | Contents | Purpose |
|---|---|---|
| Topic summary | One paragraph overview of the topic | Orients the model |
| Key findings | Bullet points with data, each cited | The factual backbone of the content |
| Source details | Full list of sources with URL, date, credibility notes | Enables citation in the output |
| Gaps identified | Questions the research did not answer | Prevents hallucination to fill gaps |
| Contradictions | Where sources disagree | Alerts the writer (human or AI) to handle nuance |
Results Comparison
Content generated through research-then-write is measurably different from direct generation. It contains specific data points instead of vague claims. It cites sources instead of saying "according to experts." It reflects current information instead of training data snapshots. And it is defensible, because every claim can be traced back to a verifiable source.
The trade-off is time. A research-then-write pipeline takes longer per piece than a direct generation. The research stage adds 5-15 minutes per article (automated) or 30-60 minutes (manual). For content where accuracy and credibility matter, this investment pays for itself in trust and authority.
Further Reading
- Tavily Search API Reference (Tavily Documentation)
- Grounding with Google Search (Gemini API Documentation)
- Prompt Engineering Overview (Anthropic Documentation)
Assignment
- Build a two-step pipeline for one piece of content. Step 1: use Tavily or Google Search to research the topic, collecting 5-10 sources. Assemble these into a research brief following the format above.
- Step 2: feed the research brief as context to your LLM API, with instructions to write based only on the provided sources. Include your voice fingerprint and formatting requirements in the system prompt.
- Compare this output to a direct generation (same topic, same system prompt, no research brief). Which is more accurate? More specific? More trustworthy? Document the differences.