Complete Workflow Guide¶

End-to-end workflow: Documents → Preprocess → QA Generation → Verification → Human Annotation → Retriever Training → RAG Deployment

Prerequisites¶

Install Docs2Synth: pip install -e ".[dev]"
Copy config.example.yml to config.yml
Add API keys to config.yml (add to .gitignore)

🚀 Quick Start: Automated Pipeline¶

For a fully automated end-to-end pipeline, simply run:

docs2synth run

This single command automatically executes all stages:

Preprocess documents from config.yml input directory
Generate QA pairs using configured strategies
Verify QA pairs for quality and correctness
Preprocess retriever training data
Train retriever model
Validate trained model performance
Reset RAG vector store
Ingest documents into RAG system

When to use automated pipeline:

Quick prototyping and testing
Automated CI/CD workflows
When you don't need manual annotation
Processing datasets with high-quality auto-generated QA pairs

When to use manual workflow:

Need fine-grained control over each stage
Want to manually review and annotate QA pairs
Experimenting with different configurations
Processing complex documents requiring human oversight

Manual Workflow¶

For more control over each stage, follow the manual step-by-step workflow below.

Stage 1: Data Setup¶

mkdir -p data/raw/my_documents
cp /path/to/your/documents/* data/raw/my_documents/

Configure config.yml:

preprocess:
  input_dir: ./data/raw/my_documents/
  output_dir: ./data/processed/
  processor: docling

qa:
  strategies:
    - strategy: semantic
      provider: openai
      model: gpt-4o-mini
      temperature: 0.7

  verifiers:
    - strategy: meaningful
      provider: openai
      model: gpt-4o-mini
    - strategy: correctness
      provider: openai
      model: gpt-4o-mini

agent:
  keys:
    openai_api_key: "sk-..."  # Add your keys directly here
    anthropic_api_key: "sk-ant-..."
    # config.yml is in .gitignore (safe)

Stage 2: Preprocess¶

docs2synth preprocess data/raw/my_documents/

Output: data/processed/*.json with text, bboxes, reading order

Stage 3: QA Generation¶

# Batch generate QA pairs
docs2synth qa batch

# Single document
docs2synth qa run data/processed/document.json

Output: JSON updated with qa field containing questions and answers

Stage 4: Verification¶

# Verify all documents
docs2synth verify batch

# Single document
docs2synth verify run data/processed/document.json

Output: JSON updated with verification results (meaningful, correctness)

Stage 5: Human Annotation¶

docs2synth annotate

Opens Streamlit UI at http://localhost:8501: - Review QA pairs with document images - Approve/reject pairs - Add notes - Track progress

Stage 6: Retriever Training¶

Preprocess Data¶

docs2synth retriever preprocess \
  --json-dir data/processed/ \
  --image-dir data/raw/my_documents/ \
  --output data/retriever/train.pkl \
  --require-all-verifiers

Output: train.pkl DataLoader ready for training

Train Model¶

docs2synth retriever train \
  --data-path data/retriever/train.pkl \
  --mode standard \
  --lr 1e-5 \
  --epochs 10 \
  --device cuda

Output: - Checkpoints: models/retriever/checkpoints/checkpoint_epoch_N.pth - Final model: models/retriever/final_model.pth - Training curves: models/retriever/checkpoints/training_curves.png

Validate Model¶

docs2synth retriever validate \
  --model models/retriever/final_model.pth \
  --data data/retriever/val.pkl

Output: ANLS scores, metrics plots, detailed analysis

Stage 7: RAG Deployment¶

Ingest Documents¶

docs2synth rag ingest \
  --processed-dir data/processed/ \
  --processor docling

Query¶

# CLI query
docs2synth rag run -q "What is the invoice total?" -s iterative

# Show iterations
docs2synth rag run -q "Your question" --show-iterations

Launch Demo¶

docs2synth rag app

Opens at http://localhost:8501

Configuration Reference¶

Minimal `config.yml`¶

data:
  root_dir: ./data
  processed_dir: ./data/processed

preprocess:
  processor: docling
  input_dir: ./data/raw/my_documents/
  output_dir: ./data/processed/

qa:
  strategies:
    - strategy: semantic
      provider: openai
      model: gpt-4o-mini
      temperature: 0.7
  verifiers:
    - strategy: meaningful
      provider: openai
      model: gpt-4o-mini
    - strategy: correctness
      provider: openai
      model: gpt-4o-mini

retriever:
  preprocessed_data_path: ./data/retriever/train.pkl
  checkpoint_dir: ./models/retriever/checkpoints
  learning_rate: 1e-5
  epochs: 10

rag:
  vector_store:
    type: chroma
    persist_directory: ./data/rag/vector_store
  embedding:
    model: sentence-transformers/all-MiniLM-L6-v2
  strategies:
    naive:
      retriever:
        k: 5
      generator:
        provider: openai
        model: gpt-4o-mini
    iterative:
      retriever:
        k: 5
      generator:
        provider: openai
        model: gpt-4o-mini
      max_iterations: 3

agent:
  keys:
    # Option 1: Direct keys (config.yml is in .gitignore)
    openai_api_key: "sk-..."
    anthropic_api_key: "sk-ant-..."

    # Option 2: Environment variables
    # openai_api_key: "${OPENAI_API_KEY}"
    # anthropic_api_key: "${ANTHROPIC_API_KEY}"

Tips & Troubleshooting¶

Data Preparation¶

Start with 10-20 documents for testing
Use high-quality images (300 DPI+)
Test with docs2synth datasets download docs2synth-dev

QA Generation¶

Use multiple strategies for diversity
Monitor API costs (use gpt-4o-mini for dev)
Aim for >80% verification pass rate

Training¶

Minimum 500-1000 annotated QA pairs
Use 80/20 train/validation split
Monitor validation ANLS (target >0.7)
Try different modes: standard, layout, layout-gemini

Common Issues¶

GPU OOM:

# Reduce batch size
docs2synth retriever train --batch-size 4

# Or use CPU
docs2synth retriever train --device cpu

Low verification pass rate:

# Review failed QA pairs
docs2synth verify batch --verifier-type meaningful

# Adjust QA generation prompts in code

RAG returns irrelevant results:

# Increase retrieval candidates
# Edit config.yml: rag.strategies.naive.retriever.k: 10

# Or use re-ranking strategy
docs2synth rag run -q "question" -s rerank

Next Steps¶

Scale to larger document collections
Fine-tune with more annotated data
Customize QA strategies for your domain
Deploy RAG API for production
Benchmark on standard datasets (FUNSD, CORD, DocVQA)

Additional Resources¶

Document Processing - OCR processor details
QA Generation - Strategy customization
Retriever Training - Model architecture
RAG Deployment - RAG strategies
CLI Reference - All commands
GitHub Issues