Session 7.4: The Research-Then-Write Pipeline

Course → Module 7: APIs as Research Tools

Session 4 of 7

Most people use AI by saying "write me an article about X." The model generates from training data. Training data is a compressed, averaged, and potentially outdated representation of everything the model saw during training. The result reads like a summary of summaries, because that is essentially what it is.

The research-then-write pipeline flips the order. First, gather sources. Then, feed those sources to the AI as context. The AI writes from your curated sources, not from its internal compression of the internet.

The Two Approaches

graph LR subgraph "Direct Generation" A["Prompt: Write about X"] --> B["Model uses
training data"] B --> C["Output: averaged,
generic, possibly
outdated"] end subgraph "Research-Then-Write" D["Research: gather
sources about X"] --> E["Feed sources
as context"] E --> F["Model writes from
your sources"] F --> G["Output: current,
specific, citable"] end style C fill:#2a2a28,stroke:#c47a5a,color:#ede9e3 style G fill:#2a2a28,stroke:#6b8f71,color:#ede9e3

The difference is not subtle. Direct generation produces content that sounds informed but often is not. Research-then-write produces content that is informed because the information is right there in the context window.

The Pipeline Architecture

A research-then-write pipeline has two distinct stages, each with its own tools, parameters, and quality checks.

Stage	Input	Process	Output	Quality Check
1. Research	Topic + research questions	Search APIs (Tavily, Google), extract key data	Research brief (sources, data, quotes)	Sufficient coverage? Sources reliable?
2. Write	Research brief + system prompt + outline	LLM API with sources as context	Draft content	Claims match sources? Voice correct?

Stage 1: Research

The research stage is not "search for the topic and see what comes up." It is targeted querying based on specific research questions you define before the search begins.

graph TD A["Define topic"] --> B["Write 5-10 specific
research questions"] B --> C["For each question:
query Tavily or
Google Search API"] C --> D["Collect results:
10-20 sources"] D --> E["Filter: remove
unreliable, irrelevant,
duplicate sources"] E --> F["Extract: pull key
data points, quotes,
statistics from each"] F --> G["Assemble research brief:
organized by subtopic
with citations"] style B fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style G fill:#2a2a28,stroke:#6b8f71,color:#ede9e3

The research questions matter. "What is remote work?" produces generic results. "What percentage of Fortune 500 companies have permanent remote work policies as of 2025?" produces specific, usable data. Write your research questions the way a journalist would: specific enough that the answer is a fact, not an overview.

The quality of your research stage determines the ceiling of your writing stage. No amount of prompt engineering can compensate for thin research. Invest the time upfront.

Stage 2: Write

The writing stage takes the research brief as context and your system prompt as voice constraints, then generates content that synthesizes the gathered information into your format and voice.

The system prompt for this stage includes a critical instruction: write based only on the provided sources. Do not add information from training data unless explicitly instructed. This constraint prevents the model from filling gaps with hallucinated data. If the research brief does not cover a point, the model should either skip it or flag it as needing additional research.

The Research Brief Format

A research brief is a structured document, not a dump of search results. It organizes findings by subtopic, includes source citations for every data point, and separates facts from interpretations.

Section	Contents	Purpose
Topic summary	One paragraph overview of the topic	Orients the model
Key findings	Bullet points with data, each cited	The factual backbone of the content
Source details	Full list of sources with URL, date, credibility notes	Enables citation in the output
Gaps identified	Questions the research did not answer	Prevents hallucination to fill gaps
Contradictions	Where sources disagree	Alerts the writer (human or AI) to handle nuance

Results Comparison

Content generated through research-then-write is measurably different from direct generation. It contains specific data points instead of vague claims. It cites sources instead of saying "according to experts." It reflects current information instead of training data snapshots. And it is defensible, because every claim can be traced back to a verifiable source.

The trade-off is time. A research-then-write pipeline takes longer per piece than a direct generation. The research stage adds 5-15 minutes per article (automated) or 30-60 minutes (manual). For content where accuracy and credibility matter, this investment pays for itself in trust and authority.

Assignment

Build a two-step pipeline for one piece of content. Step 1: use Tavily or Google Search to research the topic, collecting 5-10 sources. Assemble these into a research brief following the format above.
Step 2: feed the research brief as context to your LLM API, with instructions to write based only on the provided sources. Include your voice fingerprint and formatting requirements in the system prompt.
Compare this output to a direct generation (same topic, same system prompt, no research brief). Which is more accurate? More specific? More trustworthy? Document the differences.

The Research-Then-Write Pipeline