Session 10.2: Batch Architecture: CSV as Input, Structured Output

Course → Module 10: Batch Processing & Scale

Session 2 of 8

The Production Manifest

Batch processing starts with a structured input file. Not a list of topics in a text document. Not a folder of notes. A proper manifest: a CSV or spreadsheet where each row is one piece of content to produce, and each column is a parameter your pipeline needs.

The manifest is your production order. It defines everything that gets built, how it gets built, and what constraints apply. Your pipeline script reads it row by row, runs the agent chain for each row, and saves the output in a structured folder. Human involvement drops to reviewing outputs, not configuring each piece.

Manifest Structure

The columns in your manifest correspond to the inputs your pipeline requires. At minimum:

Column	Purpose	Example Value
id	Unique identifier for tracking	blog-042
topic	Content topic or title	Why remote onboarding fails
audience	Target reader	HR directors at mid-size SaaS companies
angle	Specific thesis or perspective	The problem is not the tools; it is the absence of informal trust-building
word_count	Target length	1200
voice_variant	Which voice profile to use	professional
research_questions	Semicolon-separated questions	What studies exist on remote onboarding?; What is the attrition rate...
required_elements	What the piece must include	At least 2 data points; one case study
forbidden_elements	What the piece must not include	No bullet lists; no rhetorical questions in headings
status	Pipeline tracking	pending / researched / drafted / reviewed / published

The Processing Flow

flowchart TD A["CSV Manifest"] --> B["Script reads row"] B --> C["Build pipeline inputs
from row columns"] C --> D["Run agent chain"] D --> E["Save output to
structured folder"] E --> F["Update status column"] F --> G{"More rows?"} G -- Yes --> B G -- No --> H["Batch complete"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#6b8f71,color:#ede9e3 style C fill:#222221,stroke:#8a8478,color:#ede9e3 style D fill:#222221,stroke:#c8a882,color:#ede9e3 style E fill:#222221,stroke:#6b8f71,color:#ede9e3 style F fill:#222221,stroke:#8a8478,color:#ede9e3 style G fill:#222221,stroke:#c47a5a,color:#ede9e3 style H fill:#222221,stroke:#c8a882,color:#ede9e3

The script does not process rows that already have a non-pending status. If you stop the batch and restart it, it picks up where it left off. This is idempotent processing: running the script again does not re-process completed items.

Output Folder Structure

Each manifest row produces outputs across multiple pipeline stages. Organize them consistently:

output/
├── blog-042/
│   ├── research-brief.json
│   ├── outline.md
│   ├── draft-v1.md
│   ├── review.json
│   ├── draft-final.md
│   ├── output.html
│   ├── output.pdf
│   └── metadata.json
├── blog-043/
│   └── ...
└── batch-log.csv

Every intermediate artifact is preserved. If the final output has a problem, you can trace it back to the specific pipeline stage where the problem originated. The batch log records timestamps, token usage, costs, and error counts per item.

Batch Validation

Before running a batch, validate the manifest itself:

Check	What It Catches
No empty required columns	Missing topics, missing audiences
No duplicate IDs	Two rows that would overwrite each other's output
Word counts within range	Typos (120 instead of 1200) or unrealistic targets
Voice variant exists	References to voice profiles that have not been created
Research questions parseable	Malformed semicolon-separated lists

Run validation before processing. A manifest error on row 47 that crashes the script after rows 1 through 46 have completed is a waste of 46 rows of processing time.

Incremental Batching

You do not have to process the entire manifest at once. Incremental batching means processing 5 to 10 rows, reviewing the outputs, adjusting prompts or manifest entries if needed, and then processing the next 5 to 10. This catches systematic issues early instead of discovering after 100 rows that the voice variant was wrong.

The manifest is the single source of truth for what your pipeline produces. If it is not in the manifest, it does not get built. If it is in the manifest with incorrect parameters, it gets built incorrectly. Invest time in the manifest before pressing Enter on the batch.

Assignment

Create a production manifest for a batch of 10 pieces of content:

Define columns for every parameter your pipeline needs.
Fill in all 10 rows with real content specifications (not placeholders).
Run manifest validation: check for empty fields, duplicate IDs, and parseable research questions.
Process the first 2 rows through your pipeline. Review the outputs.

If the first 2 outputs meet your quality standards, process the remaining 8. If not, identify the issue, fix the manifest or pipeline, and re-run the first 2 before proceeding. This is your first real batch run.

Batch Architecture: CSV as Input, Structured Output