Getting started¶
Install¶
git clone https://github.com/gallantlab/literature-review-toolkit.git
cd literature-review-toolkit
# Python deps (spreadsheet writer)
pip install xlsxwriter
# Only needed for the opt-in PDF reconciliation in Phase 4:
brew install poppler # macOS
# or: sudo apt-get install poppler-utils
The tools are plain, standalone Python 3 scripts with a tiny dependency footprint. They are meant to be read and adapted — scaffolding, not a framework.
Configure your contact email¶
NCBI and CrossRef ask API callers to identify themselves with a contact email (it buys you politeness limits instead of throttling). Set it once:
…or pass --email you@institution.edu to each tool invocation.
Optional: a Semantic Scholar API key
Citation counts (Phase 5b) query OpenAlex first and Semantic Scholar as a cross-check. S2 will rate-limit (HTTP 429) anonymous callers on large corpora. If you have a key, export it to avoid the throttle:
How you'll run it¶
There are two ways to drive the toolkit. The intended one is to let a Claude Code agent orchestrate the whole pipeline; the manual path exists for when you don't have an agent or want to run a single step by hand.
Open Claude Code in the directory where you keep your reviews (the
bibliography root) and just describe the review in plain English. The agent
reads PLAYBOOK.md,
picks a slug, creates a subdirectory, runs the phases, and drops the .xlsx
inside.
Prompts that work:
i want a literature review on the anatomical connections between the visual
system and the cerebellum. any anatomy papers from primate or human, using
any tractography method. go back as far as the 1970s.
do a fresh lit review on language learning in adults — both L1 and L2,
behavioral and neuroimaging, last 15 years.
extend the existing visual_cerebellum review with another 20 papers focused
on cerebello-thalamic projections.
now group the world_models bibliography into a few theoretical families,
and let's iterate on a lineage figure.
turn the world_models bibliography into a written review article — author it,
run the priority audit, and render the .docx.
The agent confirms scope only when something is genuinely ambiguous, and reports when it's done.
Rough cost
For ~40 search-added + ~30 cross-citation papers, the no-PDF workflow takes roughly 1–2M tokens and 5–10 minutes wall-clock. The cross-citation pass (Phase 6) is the slowest step.
Every phase is a single script you can run yourself. You supply the search
results (as rows.json); the toolkit does the verification and bookkeeping.
# Pick a topic slug and make a subdir.
mkdir my_topic && cd my_topic
# Phase 2 / 2b: run the search agent with tools/search_prompt_template.md
# filled in (forward search + a REQUIRED antecedents pass). Save the
# returned papers as rows.json — links MUST be DOI URLs.
# Phase 3: verify everything before trusting any of it.
python3 ../tools/verify.py --citations rows.json --out verify_report.json
# Phase 3f: rebuild every reference into canonical APA-7, then gate.
python3 ../tools/references.py --rows rows.json --out rows.json
python3 ../tools/references.py --rows rows.json --audit
# Phase 5: build the xlsx.
python3 ../tools/spreadsheet.py --rows rows.json --out my_topic_bibliography.xlsx
# Phase 5b: citation counts (attach to rows, rerun spreadsheet.py).
python3 ../tools/citations.py --rows rows.json --out citation_counts.json
# Phase 6: cross-citation pass; pick additions; repeat 3 + 5 for the batch.
python3 ../tools/xref.py --papers verified.json --exclude existing_dois.json \
--out xref_my_topic.json --min-cites 4 --resolve-unknown
The optional Phases 6b (families + figure) and 7 (review article) are on the Phases in detail page. PDF download (Phase 4) is opt-in.
Where things land¶
Each review lives in its own subdirectory under your bibliography root. The
JSON files are the source of truth; the .xlsx is rendered from them.
<bibliography_root>/
├── literature-review-toolkit/ <- this repo, cloned once
├── visual_cerebellum/ <- one review topic, one subdir
│ ├── visual_cerebellum_bibliography.xlsx <-- THE DELIVERABLE
│ ├── topic_definition.md (scope you & the agent agreed on)
│ ├── rows.json (the LIVE table — everything renders from it)
│ ├── verify_report.json (Phase 3: OK / MISMATCH / NOT-FOUND per cite)
│ ├── citation_counts.json (Phase 5b: OpenAlex + S2 counts, cached)
│ ├── xref_visual_cerebellum.json (Phase 6: cross-citation frequency table)
│ ├── families.json / families.md (Phase 6b: grouping, if run)
│ ├── visual_cerebellum_families.html (Phase 6b: interactive figure; +svg/png/pdf)
│ ├── content.json (Phase 7: authored prose, if run)
│ └── Visual_Cerebellum_review.docx (Phase 7: AI-authored review, if run)
└── attention/ <- a different topic, separate subdir
└── attention_bibliography.xlsx
After Phase 3f, rows.json is the live table
Edit rows.json directly for any later change. Re-running an upstream
row-emitter is destructive — it wipes the canonical references and the
citation counts. To share results, send the .xlsx (or the .docx). To
extend or re-run later, the JSON files are what the toolkit reads from.