PLAN.md

# Execution Plan

This document defines the proposed implementation plan for the 5element candidate task:
building a prototype conversational assistant over a synthetic internal knowledge base.

No implementation should start until this plan is reviewed and approved.

## 1. Assignment Summary

The final solution must provide:

- A Python-based conversational assistant for querying a synthetic internal document corpus.
- At least 30 synthetic documents.
- At least 3 different document types or structures.
- Realistic noisy data, including outdated and partially contradictory information.
- At least 5 documents containing tables or structured data.
- Answers with source citations.
- Uncertainty or refusal when the corpus does not contain enough information.
- At least one successful multi-hop answer using information from multiple documents.
- An evaluation set with at least 15 questions and expected answers.
- Automatic evaluation and reported results.
- A concise `REPORT.md` of no more than 2 pages.
- A clean repository with setup and usage instructions.

## 2. Proposed Product Scope

The prototype will be a command-line RAG assistant for a fictional reinsurance advisory
and analytics company.

### Domain

Reinsurance consulting and analytics for insurance companies.

This domain is a strong fit because it is specific, realistic, and familiar to the
candidate. It also naturally supports the assignment requirements:

- Treaty renewals.
- Facultative placements.
- Catastrophe exposure analysis.
- Solvency II capital modelling.
- Claims trend analysis.
- Portfolio profitability reviews.
- Pricing support.
- Underwriting guideline reviews.
- Retrocession strategy.

### Interface

The solution will use a CLI as the only user interface.

### Language

The main corpus and evaluation set will be in English.

Slovenian question support is not part of the planned scope. The target is a reliable
English-language prototype.

## 3. Four-Phase Roadmap

The whole project will be organized around the four main assignment sections from the PDF.

### Phase 1: Dataset and Question Generation

Create the synthetic reinsurance corpus and the evaluation questions before building the
RAG system.

Main outputs:

- `scripts/generate_corpus.py` with 30 document specs and 30 detailed prompts.
- At least 30 generated documents in `data/corpus/`.
- Manually written `data/eval/questions.jsonl` with at least 15 questions.

Key rules:

- The 30 document prompts must be written in accordance with the 15 evaluation questions.
- The corpus must include at least 3 clearly different document types or structures.
- To satisfy the PDF requirement safely, the corpus should use mixed formats where useful:
  mostly Markdown documents, plus a small number of CSV or JSON structured files.
- At least 5 documents must contain tables or structured data.
- The corpus must include realistic noise: contradictions, outdated facts, short documents,
  and longer reports.
- At least 2 generated documents should be long enough to credibly represent reports of
  more than 5 pages, with multiple sections, tables, and appendices.

### Phase 2: RAG System

Build the CLI assistant over the generated corpus.

Main outputs:

- Document loader.
- Chunker.
- Local embedding wrapper using `sentence-transformers/all-MiniLM-L6-v2`.
- ChromaDB index under `data/index/`.
- Retriever.
- OpenRouter-backed answer generator.
- CLI-only question answering flow.

### Phase 3: Evaluation

Run the assistant against the planned evaluation set and report results.

Main outputs:

- `scripts/run_eval.py`.
- Deterministic scoring for expected facts, expected sources, refusal behavior, and
  contradiction handling.
- Optional OpenRouter LLM-as-judge if an API key is available.
- Failure examples for the final report.

### Phase 4: Report and Final Packaging

Prepare the project for review.

Main outputs:

- `README.md`.
- `REPORT.md`.
- Clean final repo.
- Rebuilt index.
- Final evaluation run.
- Demo video outline.

## 4. Proposed Folder Structure

```text
.
├── PLAN.md
├── README.md
├── REPORT.md
├── requirements.txt
├── .env.example
├── data/
│   ├── corpus/
│   │   ├── meeting_minutes/
│   │   ├── proposals/
│   │   ├── project_reports/
│   │   ├── actuarial_reports/
│   │   ├── structured_data/
│   │   └── team_notes/
│   ├── eval/
│   │   └── questions.jsonl
│   └── index/
├── scripts/
│   ├── generate_corpus.py
│   ├── build_index.py
│   ├── run_eval.py
│   └── inspect_retrieval.py
├── src/
│   ├── __init__.py
│   ├── app.py
│   ├── config.py
│   ├── documents.py
│   ├── chunking.py
│   ├── embeddings.py
│   ├── index_store.py
│   ├── retrieval.py
│   ├── generation.py
│   ├── rag.py
│   └── evaluation.py
```

## 5. File Responsibilities

### Root Files

#### `PLAN.md`

Planning document created before implementation.

It describes the intended repository structure, libraries, architecture, RAG pipeline,
evaluation approach, and execution phases.

#### `README.md`

User-facing setup and usage guide.

Expected contents:

- Project overview.
- Installation instructions.
- Environment variables.
- How to build the index.
- How to ask questions.
- How to run evaluation.
- Example questions.
- Known limitations.

#### `REPORT.md`

Required final report, limited to approximately 2 pages.

Expected contents:

- Chosen architecture and tradeoffs.
- Changes made compared with the initial plan.
- Where the system fails, with concrete examples.
- What would be improved with a full week.
- How AI assistants were used during development.

#### `requirements.txt`

Python dependency list.

This keeps setup simple for reviewers.

#### `.env.example`

Template for environment variables.

Likely variables:

- `OPENROUTER_API_KEY`, for corpus generation, answer generation, and optional LLM judging.
- `GENERATION_MODEL`, defaulting to `openai/gpt-5-mini`.
- `JUDGE_MODEL`, defaulting to `openai/gpt-5-mini`.

The real `.env` file should not be committed.

## 6. Data Folder Responsibilities

### `data/corpus/`

Contains the synthetic knowledge base.

The corpus will include at least 30 documents across multiple document types:

- Meeting minutes.
- Project proposals.
- Project reports.
- Actuarial or technical reinsurance reports.
- Team notes or lessons learned.
- Structured CSV or JSON files for data-heavy artifacts such as bordereaux extracts,
  treaty comparison tables, exposure summaries, or claims triangles.

Each document should include realistic metadata, such as:

- Document title.
- Client name.
- Date.
- Document type.
- Project name.
- Author or team.

The documents should include deliberate data imperfections:

- Outdated information.
- Contradictory facts.
- Very short documents.
- At least 2 long multi-section documents that credibly represent reports of more than
  5 pages.
- Tables or structured data.

The corpus should not rely only on filenames for citations. Documents should contain
section headings or structured fields that make section-level citations possible where
practical.

### `data/eval/questions.jsonl`

Manually written evaluation set with at least 15 carefully controlled questions.

The questions should be created in accordance with the 30 document specs in
`scripts/generate_corpus.py`, not generated randomly by a separate script.

Each line will contain one JSON object:

```json
{
  "id": "q001",
  "category": "simple_lookup",
  "question": "Which attachment point was selected for Triglav Adriatic's property catastrophe renewal?",
  "expected_answer": "The final renewal recommendation selected a EUR 7.5m attachment point.",
  "expected_sources": ["2024-09-18_triglav_adriatic_property_cat_renewal_final.md"],
  "must_refuse": false
}
```

Planned categories:

- Simple lookup questions.
- Multi-hop questions.
- Unanswerable questions.
- Contradiction or trick questions.
- Temporal questions involving outdated versus newer documents.

### `data/index/`

Local persisted vector index and metadata.

This directory can be regenerated from `data/corpus/`.

## 7. Script Responsibilities

### `scripts/generate_corpus.py`

Generates the synthetic document corpus.

Responsibilities:

- Define 30 document specs.
- Define one detailed generation prompt per document.
- Use reusable prompt templates for document types such as meeting minutes, proposals,
  actuarial reports, claims reviews, bordereaux notes, and team notes.
- Ensure the generated corpus covers the manually planned evaluation questions.
- Ensure at least 3 clearly different document types or structures.
- Ensure at least 5 documents contain tables or structured data.
- Generate mostly Markdown documents, with selected CSV or JSON files for structured
  artifacts when useful.
- Include planned contradictions, outdated facts, short documents, and at least 2 long
  report-style documents that credibly represent more than 5 pages.
- Save generated documents into `data/corpus/`.

This script acts as the practical dataset design artifact. The relationship between
documents, facts, contradictions, and evaluation questions should be visible in the
document specs and prompts.

### `scripts/build_index.py`

Builds or rebuilds the retrieval index.

Steps:

1. Load documents from `data/corpus/`.
2. Parse metadata.
3. Split documents into chunks.
4. Generate embeddings.
5. Save vector index and chunk metadata to `data/index/`.

### `scripts/run_eval.py`

Runs the evaluation set.

Steps:

1. Load `data/eval/questions.jsonl`.
2. Ask each question through the same RAG pipeline used by the CLI.
3. Compare answer against expected answer and expected sources.
4. Score results.
5. Print summary by category and overall score.

### `scripts/inspect_retrieval.py`

Debugging helper for inspecting retrieved chunks for a question.

This is useful for understanding failure cases and writing an honest report.

Example usage:

```bash
python scripts/inspect_retrieval.py "Which clients had property catastrophe treaty renewal work?"
```

## 8. Source File Responsibilities

### `src/app.py`

CLI entry point.

Expected commands:

```bash
python -m src.app ask "What retention did we recommend for Triglav Adriatic's property catastrophe renewal?"
python -m src.app chat
```

Responsibilities:

- Parse command-line arguments.
- Load configuration.
- Initialize the RAG pipeline.
- Print answers, confidence, and sources.

### `src/config.py`

Central configuration.

Responsibilities:

- Read environment variables.
- Define default paths.
- Define retrieval defaults.
- Define model names.

### `src/documents.py`

Document loading and parsing.

Responsibilities:

- Read Markdown, text, JSON, or CSV files if included.
- Extract document metadata.
- Preserve source path and document type.
- Return normalized document objects.

### `src/chunking.py`

Chunking logic.

Responsibilities:

- Split documents into semantically useful chunks.
- Prefer section-aware splitting for Markdown.
- Preserve chunk metadata:
  - source document
  - section title
  - document date
  - client
  - document type

### `src/embeddings.py`

Embedding model wrapper.

Responsibilities:

- Load the embedding model.
- Embed documents.
- Embed user queries.
- Hide provider-specific details from the rest of the code.

### `src/index_store.py`

Vector index persistence.

Responsibilities:

- Create a vector index.
- Save index and metadata.
- Load index and metadata.
- Search the index by query embedding.

### `src/retrieval.py`

Retrieval logic.

Responsibilities:

- Retrieve top-k relevant chunks.
- Optionally apply simple metadata filtering.
- Optionally rerank results if time allows.
- Return chunks in a format suitable for answer generation.

### `src/generation.py`

Answer generation.

Responsibilities:

- Build the final LLM prompt.
- Include retrieved context.
- Instruct the model to cite sources.
- Instruct the model to refuse unsupported answers.
- Instruct the model to flag contradictions.
- Return a structured answer.

### `src/rag.py`

Main orchestration layer.

Responsibilities:

1. Accept a user question.
2. Retrieve relevant chunks.
3. Pass chunks to the generator.
4. Return answer, citations, confidence, and retrieved evidence.

This keeps the CLI and evaluation using the same core pipeline.

### `src/evaluation.py`

Evaluation implementation.

Responsibilities:

- Load evaluation questions.
- Run the RAG pipeline.
- Score answer quality.
- Score source citation accuracy.
- Detect whether refusal behavior was correct.
- Produce category-level metrics.

## 9. Proposed Libraries

### Core Python

- `python-dotenv`
  - Loads configuration from `.env`.

- `argparse`
  - Built-in CLI parsing.
  - Used instead of adding `typer`.

- Plain dictionaries and small helper functions.
  - Used instead of Pydantic models.
  - Keeps the prototype easy to read and avoids unnecessary dependencies.

Recommended choice: built-in `argparse`, no Pydantic, no Typer.

### Document Processing

- `markdown`
  - Optional, only if we need Markdown parsing.

- Built-in file handling and regexes
  - Enough for simple frontmatter-like metadata and section splitting.

Recommended choice: keep parsing simple and transparent.

### Embeddings and Vector Search

Recommended local option:

- `sentence-transformers`
  - Local embeddings.
  - Good enough for a small, controlled English corpus.

- `chromadb`
  - Local vector database with simple persistence.
  - Stores embeddings, chunk text, and metadata together.
  - Easier to inspect and debug than managing a separate vector index and metadata file.

Recommended embedding model:

```text
sentence-transformers/all-MiniLM-L6-v2
```

Possible stronger fallback if retrieval quality is weak:

```text
sentence-transformers/all-mpnet-base-v2
```

Recommended choice: `sentence-transformers/all-MiniLM-L6-v2` + ChromaDB.

Reasoning:

- Works locally.
- Reviewers can run it without creating an embedding API account.
- ChromaDB keeps chunk text and metadata easy to inspect.
- MiniLM is small, fast, common, and sufficient for a controlled 30-document corpus.

### LLM Answer Generation

Preferred configurable option:

- `openai`
  - Used as an OpenAI-compatible client pointed at OpenRouter.
  - OpenRouter makes it easy to switch generation and judge models without changing code.

Expected configuration:

```text
OPENROUTER_API_KEY=...
GENERATION_MODEL=openai/gpt-5-mini
JUDGE_MODEL=openai/gpt-5-mini
```

Default generation model:

```text
openai/gpt-5-mini
```

Reasoning:

- Better quality than cheaper nano or older mini models for realistic reinsurance
  documents.
- Much cheaper than full GPT-5.
- Good fit for generating the corpus without spending more than a few dollars.

Fallback option:

- A retrieval-only answer mode that prints the most relevant chunks and says generation
  requires an API key.

Reasoning:

- The task is about an AI assistant, so LLM generation is valuable.
- The repository should still be understandable and partially runnable without secrets.
- OpenRouter keeps model choice flexible while preserving a simple OpenAI-compatible API.

### Evaluation

Simple deterministic evaluation first:

- Expected source match.
- Refusal correctness.
- Keyword or fact overlap.
- Category-level scoring.

Optional LLM-as-judge:

- Use OpenRouter only if an API key is available.
- Store judge prompt and scores transparently.

Recommended approach:

- Implement deterministic evaluation as the baseline.
- Add optional LLM-as-judge only if time allows.

### Validation Instead of Unit Tests

No pytest unit test suite is planned for this 2-day prototype.

Validation effort should go into:

- corpus generation dry-runs,
- JSONL parsing checks,
- ChromaDB index rebuild checks,
- retrieval inspection,
- automatic RAG evaluation over `questions.jsonl`,
- manual review of failed cases for `REPORT.md`.

## 10. RAG Pipeline Step by Step

### Step 1: Corpus Creation

Create at least 30 synthetic documents in `data/corpus/`.

Corpus generation will be handled by `scripts/generate_corpus.py`.

The script should contain 30 planned document specs and one detailed prompt per document.
The specs should be written in accordance with the 15 manual evaluation questions so the
corpus intentionally supports simple lookup, multi-hop, temporal, contradiction, and
unanswerable test cases.

Each document should include metadata and realistic content.

Example metadata:

```markdown
---
title: Triglav Adriatic Property Catastrophe Renewal Workshop Notes
client: Triglav Adriatic
date: 2024-03-12
document_type: meeting_minutes
project: Property Catastrophe Treaty Renewal
---
```

### Step 2: Document Loading

`src/documents.py` reads every supported file from `data/corpus/`.

It returns normalized `Document` objects containing:

- document id
- source path
- title
- date
- type
- client
- raw text
- metadata

### Step 3: Chunking

`src/chunking.py` splits each document into chunks.

Chunking strategy:

- Prefer Markdown headings as natural section boundaries.
- Keep tables inside the same chunk as their surrounding explanation.
- Use a maximum chunk size to prevent overly large prompts.
- Add small overlap if needed.

Each chunk keeps citation metadata.

### Step 4: Embedding

`src/embeddings.py` converts chunks into numeric vectors.

The same model is used to embed user questions.

Initial recommended model:

```text
sentence-transformers/all-MiniLM-L6-v2
```

Possible stronger alternative:

```text
sentence-transformers/all-mpnet-base-v2
```

### Step 5: Indexing

`src/index_store.py` stores chunks, embeddings, and metadata in ChromaDB.

ChromaDB persistence should live under `data/index/`.

The index should be reproducible:

```bash
python scripts/build_index.py
```

### Step 6: Query Input

The user asks a natural language question through the CLI.

Example:

```bash
python -m src.app ask "Which projects involved catastrophe exposure modelling and what did we learn?"
```

### Step 7: Retrieval

`src/retrieval.py` embeds the question and searches the vector index.

It returns the top-k most relevant chunks with metadata.

Current default:

```text
TOP_K=7
```

### Step 8: Prompt Construction

`src/generation.py` builds a prompt containing:

- The user question.
- Retrieved context chunks.
- Source names and section labels.
- Strict answer rules.

The prompt should instruct the model to:

- Answer only from the provided context.
- Cite sources for every factual claim.
- Say when the context is insufficient.
- Mention contradictions or outdated information.
- Prefer newer dated documents when the same fact conflicts, while still naming the conflict.

### Step 9: Answer Generation

The LLM produces a concise grounded answer.

Expected output shape:

```text
Answer:
...

Sources:
- 2024-09-18_triglav_adriatic_property_cat_renewal_final.md, section "Outcome"
- 2024-03-12_triglav_adriatic_renewal_workshop.md, section "Exposure Notes"

Confidence: high
```

### Step 10: Refusal and Uncertainty Handling

The assistant should refuse or express uncertainty when:

- Retrieved chunks do not contain the answer.
- The question asks about a client, project, or technology absent from the corpus.
- Evidence is contradictory and no newer or more authoritative document resolves it.

Example response:

```text
The corpus does not contain enough information to answer this confidently.
The retrieved documents mention property catastrophe renewals, but none mention
retrocession ownership for this client.
```

### Step 11: Multi-Hop Answering

At least one evaluation question will require combining evidence from multiple documents.

Example:

Question:

```text
Which team members had prior catastrophe modelling experience before the Triglav Adriatic renewal?
```

Required evidence:

- A team note listing employee experience.
- A Triglav Adriatic project document listing assigned team members.

The answer should cite both documents.

### Step 12: Evaluation

`scripts/run_eval.py` runs all evaluation questions.

For each question:

1. Call the RAG pipeline.
2. Store the generated answer.
3. Check whether expected sources were cited.
4. Check whether refusal behavior was correct.
5. Check whether key expected facts appear.
6. Assign a score.

The final output should include:

- Overall score.
- Score by category.
- Failed examples.
- Missing source citations.
- Refusal mistakes.

## 11. Evaluation Design

Minimum required evaluation set: 15 questions.

The evaluation set should be designed together with the corpus, before the RAG pipeline is
implemented.

Reasoning:

- The corpus needs known facts, known contradictions, and known missing information.
- Multi-hop questions are easier to support when the relevant evidence is deliberately
  distributed across documents.
- Designing evaluation early prevents tuning the system only around accidental examples.
- The final report can honestly describe what the system was expected to answer from the
  beginning.

Planned split:

- 5 simple lookup questions.
- 4 multi-hop questions.
- 3 unanswerable questions.
- 2 temporal/outdated-data questions.
- 1 contradiction/trick question.

Potential scoring:

```text
answer_contains_expected_facts: 0-2 points
expected_sources_cited: 0-2 points
refusal_correctness: 0-2 points
contradiction_handling: 0-2 points, only when applicable
```

The report should include not only average scores but also examples of failures.

## 12. Synthetic Corpus Design Requirements

The corpus and evaluation questions should be designed before RAG implementation to ensure
the system can be evaluated fairly.

The dataset and evaluation questions should be designed directly in `PLAN.md` and
`scripts/generate_corpus.py`, without a separate blueprint file.

The dataset creation process should produce two implementation artifacts before coding the
RAG pipeline:

- The document corpus in `data/corpus/`.
- The evaluation set in `data/eval/questions.jsonl`.

Workflow order:

1. Create and approve `PLAN.md`.
2. Create `scripts/generate_corpus.py` with 30 document specs and 30 detailed prompts.
3. Manually create `data/eval/questions.jsonl` with 15 planned evaluation questions.
4. Generate the 30+ synthetic documents from the corpus generation script.
5. Build the RAG pipeline.
6. Run evaluation and tune.
7. Write `README.md` and `REPORT.md`.

Planned clients:

- Triglav Adriatic Insurance.
- Sava Danube Re.
- Merkur Mutual.
- Adriatic Health Insurance.
- Balkan Motor Pool.
- Ljubljana Specialty Underwriters.

Planned recurring team members:

- Maja Novak, senior treaty analyst.
- Luka Horvat, actuarial pricing consultant.
- Sara Kovač, catastrophe modelling specialist.
- Tim Zupan, claims analytics consultant.
- Eva Kralj, reinsurance placement manager.

Planned reinsurance topics:

- Property catastrophe excess-of-loss treaties.
- Motor quota share treaties.
- Health stop-loss covers.
- Facultative property placements.
- Solvency II capital modelling.
- Loss triangles and claims development.
- Bordereaux data quality.
- Exposure accumulation by CRESTA zone.
- Event limits and reinstatements.
- Retrocession protection.

Planned contradictions:

- One early renewal memo recommends a EUR 5m attachment point, while a later final report
  says EUR 7.5m was selected.
- One meeting note lists an estimated reinstatement premium that differs from the signed
  placement summary.
- One team note says a catastrophe modelling specialist is available, while a later
  staffing update says they were reassigned to another renewal.

Planned outdated data:

- Initial treaty renewal assumptions superseded by final exposure data.
- Early claims triangles replaced by updated quarter-end claims data.
- Old market pricing guidance superseded by signed placement terms.

## 13. Development Phases

The development phases should mirror the assignment sections from the PDF.

### Phase 1: Dataset and Question Generation

- Define the reinsurance clients, projects, contradictions, outdated facts, and expected
  evaluation facts in `PLAN.md`.
- Create `scripts/generate_corpus.py`.
- Include 30 document specs and 30 detailed prompts in the corpus generation script.
- Write the prompts in accordance with the 15 evaluation questions.
- Manually create `data/eval/questions.jsonl` with at least 15 questions and expected
  answers.
- Ensure the evaluation set includes simple lookup, multi-hop, unanswerable, temporal, and
  contradiction questions.
- Generate at least 30 documents.
- Ensure at least 5 documents contain tables.
- Ensure at least 3 clearly different document types or structures.
- Include selected CSV or JSON files for structured data where useful.
- Include at least 2 long report-style documents that credibly represent more than 5 pages.

### Phase 2: RAG System

- Implement document loading.
- Implement chunking.
- Implement local embeddings with `sentence-transformers/all-MiniLM-L6-v2`.
- Implement ChromaDB vector persistence.
- Implement retrieval.
- Implement answer generation through OpenRouter using the OpenAI-compatible client.
- Implement CLI.

### Phase 3: Evaluation

- Implement evaluation runner.
- Run evaluation.
- Inspect failures.
- Tune chunking, retrieval, or prompts.
- Optionally use OpenRouter for LLM-as-judge if an API key is available.

### Phase 4: Report and Final Packaging

- Write `README.md`.
- Write `REPORT.md`.
- Include honest limitations and concrete failures.
- Rebuild index from scratch.
- Run sample questions.
- Run full evaluation.
- Check repository cleanliness.
- Prepare demo video outline.

## 14. Risks and Mitigations

### Risk: Retrieval misses relevant chunks

Mitigation:

- Use source inspection script.
- Tune chunk size and top-k.
- Keep document sections well structured.

### Risk: LLM hallucinates unsupported answers

Mitigation:

- Use strict prompts.
- Include refusal examples.
- Evaluate unanswerable questions.

### Risk: Contradictions are ignored

Mitigation:

- Include contradiction-specific instructions in the generation prompt.
- Add at least one contradiction evaluation case.

### Risk: Reviewers cannot run the project

Mitigation:

- Provide clear setup instructions.
- Keep dependencies reasonable.
- Make index rebuilding straightforward.
- Include `.env.example`.

### Risk: Too much time spent on UI

Mitigation:

- Use CLI only.
- Focus on scoring categories from the assignment.

## 15. Approval Checkpoint

Implementation should not begin until this plan is approved.

After approval, the next concrete step should be corpus design:

1. Define the full synthetic document map.
2. Define intentional contradictions and outdated facts.
3. Define the evaluation questions before or alongside the corpus.