Skip to content

Docs2Synth

A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding

A complete pipeline for converting, synthesizing, and training retrievers for your document datasets

Docs2Synth Framework


✨ Key Features

📄 Document Processing

Extract structured text and layout from PDFs and images using Docling, PaddleOCR, or PDFPlumber. Support for complex layouts and 80+ languages.

Learn more →

🤖 Agent-Based QA Generation

Automatically generate high-quality question-answer pairs using LLMs. Built-in verification with meaningfulness and correctness checkers.

Learn more →

🧠 Retriever Training

Train custom document retrievers using LayoutLMv3 or BERT on your annotated data. Support for layout-aware and semantic retrieval.

Learn more →

🚀 RAG Deployment

Deploy RAG systems instantly with naive, iterative, or custom strategies. Vector store integration with semantic search.

Learn more →

📈 Benchmarking

Comprehensive evaluation with ANLS, Hit@K, MRR, and NDCG metrics. Track training progress and model performance.

Learn more →

âš™ Extensible Pipeline

Modular architecture with pluggable components. Easy customization of QA strategies, verifiers, and retrieval methods.

Get Started →


🔌 MCP Integration

Run Docs2Synth as a remote MCP server (SSE transport) for AI agents like Claude Desktop, ChatGPT, and Cursor. Access document processing capabilities from your AI tools.

# Start remote MCP server
docs2synth-mcp sse --host 0.0.0.0 --port 8009

Learn more about MCP Integration →

Installation

CPU Version (includes all features + MCP server):

pip install docs2synth[cpu]

GPU Version (includes all features + MCP server):

# Standard GPU installation (no vLLM)
pip install docs2synth[gpu]

# With vLLM for local LLM inference (requires CUDA GPU)
# 1. Install PyTorch with CUDA first:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# 2. Install docs2synth with vLLM:
pip install docs2synth[gpu,vllm]

# 3. Uninstall paddlex to avoid conflicts with vLLM:
pip uninstall -y paddlex

vLLM and PaddleX Conflict

PaddleX conflicts with vLLM. If you need vLLM support for local LLM inference, you must uninstall paddlex after installation: pip uninstall -y paddlex

Minimal Install (CLI only, no ML/MCP features):

pip install docs2synth

From GitHub (Development)

pip install git+https://github.com/AI4WA/Docs2Synth.git

Quick Start

Automated setup (recommended):

git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
./setup.sh         # Unix/macOS/WSL
# setup.bat        # Windows

The script installs uv and sets up everything automatically.

Manual setup:

git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
pip install -e ".[dev]"
cp config.example.yml config.yml
# Edit config.yml and add your API keys

See the README for details.

Workflow

graph LR
    A[Documents] --> B[Preprocess]
    B --> C[QA Generation]
    C --> D[Verification]
    D --> E[Human Annotation]
    E --> F[Retriever Training]
    F --> G[RAG Deployment]

🚀 Quick Start: Automated Pipeline

Run the complete end-to-end pipeline with a single command:

docs2synth run

This automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment.

Manual Step-by-Step Workflow

For more control, run each step individually:

# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/

# 2. Generate QA pairs
docs2synth qa batch

# 3. Verify quality
docs2synth verify batch

# 4. Annotate (opens UI)
docs2synth annotate

# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10

# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag app

Complete Workflow Guide →

Architecture

Docs2Synth/
├── integration/    # Integration utilities
├── preprocess/     # Document preprocessing
├── qa/            # QA generation and verification
├── retriever/     # Retriever training and inference
├── rag/           # RAG strategies
└── utils/         # Logging, timing, and utilities

Contributing

We welcome contributions! Please see our GitHub repository for guidelines.

License

This project is licensed under the MIT License.

Support