Knowledge graph + benchmark for epilepsy AI

EpiGraph

Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

An interactive knowledge graph and plug-and-play benchmark for testing general-purpose AI systems on clinical evidence, EEG findings, genes, treatments, and patient outcomes.

GitHub Repository Read the Paper Explore the graph Run EpiBench

48,166: papers
24,324: entities
32,009: triplets
5: tasks

Why EpiGraph

Epilepsy reasoning is graph-shaped.

Clinical decisions often require moving across multiple evidence layers: syndrome, EEG pattern, genetic mechanism, medication choice, contraindication, and outcome. EpiGraph makes those links explicit, then EpiBench tests whether models can use them.

Syndrome Phenotype Gene Treatment Outcome

Interactive demo

Explore a compact EpiGraph subgraph.

Search or click a preset query. Select any node or edge to inspect its layer, relation type, paper count, and supporting paper IDs.

Syndrome Phenotype Gene Treatment Outcome

EpiBench

Five tasks for evaluating epilepsy reasoning.

Each task can be run with or without Graph-RAG, making it easy to test your own model against the same clinical inputs.

Clinical Decision Accuracy

MCQ and open-ended epilepsy QA over diagnosis, treatment, outcomes, and reasoning.

accuracy · ROUGE-L · Token-F1

Clinical Report Generation

EEG description and patient context to neurologist-style clinical impression.

ROUGE-L · report alignment

Biomarker Precision Medicine

Gene variant and phenotype to antiseizure medication selection.

Top-1 · Drug Safety

Treatment Recommendation

Guideline-consistent therapy choice under patient-specific constraints.

Top-1 · KG Coverage

Deep Research Planning

Literature-grounded research question and feasible study-plan generation.

judge score · feasibility

Task 1 comparison

Clinical Decision Accuracy

Graph-RAG improves epilepsy MCQ accuracy and open-ended reasoning quality across all six evaluated LLMs.

+11.3% avg. MCQ accuracy lift

+0.51 avg. LLM-as-judge gain

75.0% best Graph-RAG score

Baseline vs Graph-RAG higher is better

Baseline Graph-RAG

T1 reports MCQ accuracy plus open-ended QA judge scores; values are from the paper's EpiBench results.

Run your model

Clone, install, evaluate.

EpiBench scripts accept local JSON datasets and an OpenRouter-compatible model name. For private Harvard EEG data, use the local JSONL adapter.

git clone https://github.com/LabRAI/EEG-KG.git
cd EEG-KG
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export OPENROUTER_API_KEY="your_key_here"

python tasks/t1_clinical_decision_accuracy.py \
  --dataset data/epibench/t1/mcq.json \
  --triplets data/epikg/triplets.json \
  --model openai/gpt-4o \
  --mode graph_rag

Release plan

Code, graph, tasks, and restricted-data adapters.

Code release Task scripts, Graph-RAG retriever, metrics, and examples Manifest Paper-to-code mapping for every task and metric Harvard EEG local schema Private-data adapter format for T2 report generation Demo graph JSON Compact KG subset used by this project page Apache-2.0 license Open-source license for this code release

Citation

Cite EpiGraph

@article{dai2026epigraph,
  title={EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild},
  author={Dai, Yuyang and Chen, Zheng and Pradeepkumar, Jathurshan and Matsubara, Yasuko and Sun, Jimeng and Sakurai, Yasushi and Dong, Yushun},
  journal={arXiv preprint arXiv:2605.09505},
  eprint={2605.09505},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.09505},
  year={2026}
}