README.md

# Reinsurance RAG Assistant

CLI prototype for asking questions over a synthetic internal reinsurance advisory
knowledge base.

The corpus contains fictional documents for a reinsurance consulting and analytics
company: meeting notes, proposals, actuarial reports, team notes, CSV extracts, and JSON
summaries.

## What Is Included

Implemented:

- 30 synthetic corpus documents in `data/corpus/`
- 15 evaluation questions in `data/eval/questions.jsonl`
- ChromaDB index builder
- Local embeddings with `sentence-transformers/all-MiniLM-L6-v2`
- CLI question answering with one-shot and interactive chat modes
- OpenRouter-backed answer generation
- Retrieval evaluation
- LLM-as-judge answer evaluation

## Setup

Create and activate a virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate
```

Install dependencies:

```bash
pip install -r requirements.txt
```

Create a local `.env` file:

```text
OPENROUTER_API_KEY=your_key_here
GENERATION_MODEL=openai/gpt-5-mini
JUDGE_MODEL=openai/gpt-5-mini
```

The default generation AND judge model is `openai/gpt-5-mini` through OpenRouter.

## Build The Index

The ChromaDB index is not committed. Build it locally from the corpus:

```bash
python scripts/build_index.py
```

This creates `data/index/`.

## Ask Questions

Ask one question:

```bash
python -m src.app ask "What attachment point was finally selected for Triglav Adriatic's property catastrophe renewal?"
```

To inspect which chunks were retrieved:

```bash
python -m src.app ask "Was there conflicting guidance about Triglav Adriatic's attachment point?" --show-retrieved
```

Run interactive chat mode:

```bash
python -m src.app chat
```

Type `exit` or `quit` to stop.

Interactive mode also supports retrieval debugging:

```bash
python -m src.app chat --show-retrieved
```

## Run Retrieval Evaluation

```bash
python scripts/run_eval.py --retrieval-only --top-k 7
```

Current retrieval result:

```text
pass: 13
partial: 0
fail: 0
review: 2
```

The 2 review cases are intentionally unanswerable questions and require manual answer
review.

The retrieval evaluation checks required evidence coverage. A question passes when all
expected source documents appear in the retrieved set. Extra retrieved documents are
allowed because retrieval provides context for the answer generator; the final LLM prompt
is responsible for citing only directly supporting sources.

## Run Answer Evaluation

```bash
python scripts/run_eval.py --answers
```

Answer evaluation runs the full RAG pipeline and then uses an OpenRouter judge model to
grade the generated answer. The judge receives the question, expected answer, key facts,
expected sources, retrieved sources, and actual answer.

Current answer evaluation result:

```text
pass: 15
partial: 0
fail: 0
```

Evaluation runs also save machine-readable JSON files:

```text
data/eval/results/retrieval_eval_latest.json
data/eval/results/answer_eval_latest.json
```

## Example Questions

```text
What attachment point was finally selected for Triglav Adriatic's property catastrophe renewal?
Was there conflicting guidance about Triglav Adriatic's attachment point?
Which consultant should answer bordereaux quality questions for Balkan Motor Pool and why?
What cyber catastrophe treaty did Nova Kredit buy?
```

## RAG Pipeline

1. Load documents from `data/corpus/`.
2. Split documents into chunks.
3. Embed chunks with `sentence-transformers/all-MiniLM-L6-v2`.
4. Store embeddings and metadata in ChromaDB.
5. Embed the user question.
6. Retrieve a moderately broad candidate set from ChromaDB.
7. Rerank candidates using normalized, capped keyword-overlap scoring.
8. Select final chunks for generation.
9. Send the question and retrieved context to OpenRouter.
10. Return a grounded answer with source citations.

The keyword rerank uses only meaningful terms from the user question. It calculates the
percentage of question terms that appear in each candidate chunk, then applies a capped
boost to the Chroma distance. This lets exact wording help without allowing long
questions to overpower vector similarity.

Date and numeric tokens are included in keyword matching because exact values such as
meeting dates, percentages, limits, deductibles, and attachment points matter in this
domain.

## Chunking Strategy

Markdown documents shorter than 900 words are kept as one full-document chunk.

Longer Markdown documents are split by headings. If a section is still too long, it is
split by paragraphs.

CSV, JSON, and TXT files are kept as full-document chunks.

This keeps short notes readable as complete evidence while still splitting long reports
into focused sections.

## Known Limitations

- The answer generator depends on OpenRouter being configured.
- The corpus is synthetic, so performance should not be treated as production quality.
- Exact CSV/table-cell questions can still be brittle when rows contain sparse or blank
  values.
- The CLI prints only text output; there is no web interface.