REPORT.md

# Report

## Architecture

The corpus contains 30 synthetic internal documents for a fictional reinsurance advisory
company. The documents include Markdown reports and notes, CSV files, and JSON files,
with contradictions, outdated facts, structured tables, and unanswerable topics. The
evaluation set contains 15 planned questions.

The assistant uses a simple RAG pipeline: it loads the local corpus, chunks documents,
embeds chunks with `sentence-transformers/all-MiniLM-L6-v2`, stores them in ChromaDB,
retrieves and reranks relevant chunks, then sends the selected context to
`openai/gpt-5-mini` through OpenRouter to generate an answer with citations.

The main architecture choices and compromises were:

I used RAG because it is a good fit for a relatively small internal document database.
The task asks for answers grounded in source documents, with citations and refusal when
the information is missing. RAG makes that behavior easier to inspect than a pure chat
model because the retrieved evidence can be checked directly.

I used `sentence-transformers/all-MiniLM-L6-v2` because it is local, fast, free to run,
and good enough for an English prototype. The compromise is that it is not the strongest
embedding model for larger repositories, exact table lookup, or multilingual usage.

Chunking was tuned after early retrieval tests. Short Markdown documents up to 900 words
are kept as full-document chunks, because short meeting notes often need their full local
context. Longer Markdown documents are split by headings and overlong sections are split
by paragraphs. CSV and JSON are kept as full-document chunks because of their shorter length.

Retrieval uses ChromaDB vector search first, then reranks a moderately broad candidate
pool with normalized keyword overlap. The keyword boost is capped so exact terms can
nudge ranking without overpowering vector similarity. Date and numeric tokens are included
because reinsurance questions often depend on exact meeting dates, percentages, limits,
deductibles and attachment points.

The rerank formula uses `overlap_ratio * 0.4`, where `overlap_ratio` is the share of
meaningful question terms found in the chunk. This lets exact terms improve a chunk's
score, but the maximum boost is capped at `0.4`, so keyword overlap cannot completely
override vector similarity. This was important because exact dates and numbers mattered,
while an uncapped word-count boost could over-rank long or keyword-heavy chunks.

For evaluation I used an LLM-as-judge because it is more realistic than exact string
matching for generated answers. RAGAS would be useful for a larger benchmark, but for this
prototype a small custom evaluation script was easier to understand and explain.

## Evaluation

Evaluation uses `data/eval/questions.jsonl` .

Retrieval evaluation checks whether the required source documents are retrieved. Extra
retrieved documents are allowed, because the goal is evidence coverage rather than an
exact source list match.

```text
Retrieval result:
pass: 13
partial: 0
fail: 0
review: 2
```

The two review cases are intentionally unanswerable questions with no expected source
documents. They are checked through answer evaluation to confirm that the assistant
refuses instead of hallucinating.

Answer evaluation runs the full RAG pipeline and uses an OpenRouter LLM-as-judge to grade
correctness, grounding, citations, refusal behavior, and overall answer quality. The judge
receives the question, expected answer, key facts, expected sources, retrieved sources,
and actual answer, then returns structured JSON.

```text
Answer result:
pass: 15
partial: 0
fail: 0
```

This result should be interpreted as a controlled synthetic benchmark, not a production
accuracy claim. The corpus and evaluation questions were created by the same development
process, so the evaluation has less real-world noise than an independently collected
internal document set.

Machine-readable outputs are saved in:

```text
data/eval/results/retrieval_eval_latest.json
data/eval/results/answer_eval_latest.json
```

Representative successful cases:

- Simple lookup: final Triglav Adriatic attachment point was EUR 7.5m.
- Contradiction: kickoff EUR 5m provisional guidance was superseded by EUR 7.5m final guidance.
- Multi-hop: Tim Zupan is identified using the Balkan Motor audit plus the team capability matrix.
- Unanswerable: Nova Kredit cyber treaty question is refused with `Sources: none`.

## Changed during process

The reranking logic based on word similarity was added during testing. Without it, the
assistant had more trouble with short questions that asked for specific dates, names, or
numeric values.

Chunking was originally planned with a smaller limit, but testing showed that many short
documents did not need to be chunked at all. A 900-word threshold turned out to be a good
tradeoff for this corpus.

For evaluation I first considered deterministic answer checks based on expected answers.
I then chose LLM-as-judge because it can recognize correct answers even when the wording
does not match verbatim. A deterministic judge would also need many hand-written edge
cases.

## Weaknesses

LLM-as-judge is useful for semantic grading, but it is model-based and may vary slightly
across judge models or repeated runs. The deterministic retrieval coverage check remains
a stable metric.

The corpus is synthetic and smaller than a real internal document repository.

Retrieval has good recall, but sometimes returns extra documents beyond the minimum
needed, reducing precision.

## What fails

Exact CSV/table-cell questions that ask about a specific value can still fail when rows
contain sparse or blank values. The assistant may retrieve the correct table, but the LLM
still has to read the cell positions correctly.

One manual example was the question: "for accident year 2022 what was CaseReserve_12".
The expected answer is that the corpus does not contain that value, because the
`CaseReserve_12` cell is blank. In one run the assistant incorrectly answered with
`380000`, which is the value from `CaseReserve_24`. On retry it refused correctly. This
shows that retrieval can be correct while table interpretation is still brittle.

## What would be done differently if more time to prepare

- Explore a hybrid RAG plus agentic document-reading workflow, where the assistant can
  run follow-up searches or open full documents when retrieval confidence is low.
- Improve source filtering so final answers cite fewer but more directly supporting
  sources.
- Add more Slovenian or bilingual questions if required by target users. For that version
  I would test a stronger multilingual embedding model instead of relying only on
  `all-MiniLM-L6-v2`.

## Use Of AI Assistance


I used Codex with GPT-5.5. First I gave the task to Codex and brainstormed how the task
could be done. I created `PLAN.md` and went over it in detail until I understood and
confirmed the implementation steps. During planning, Codex did not clearly separate
dataset creation from building the actual RAG assistant. I pushed for this separation
because it made the process more structured.

After that I instructed Codex to start making files. I reviewed and changed important
parts along the way. For corpus generation, I created specific prompts and evaluation
questions, generated the corpus, checked whether the information was consistent, and
regenerated documents where needed. I then moved on to the RAG assistant. The corpus,
evaluation set, chunking logic, and retrieval behavior were manually inspected to avoid
simply overfitting generated answers after the fact.

During the evaluation phase, Codex tried to change some parts of the corpus even though
I had explicitly decided not to do this, because changing the corpus after failures would
overfit the results.

Codex also suggested injecting extra words when a question was about a person, with the
goal of finding chunks that described team capability. I rejected this because it was too
specific to this corpus and felt like overfitting.

I also used Codex to help generate the README and to correct English and typos in this
report.