Skip to content

feat: add UBERON brain anatomy matching for all species#1825

Open
bendichter wants to merge 6 commits intoadd-brain-area-anatomyfrom
add-uberon-anatomy
Open

feat: add UBERON brain anatomy matching for all species#1825
bendichter wants to merge 6 commits intoadd-brain-area-anatomyfrom
add-uberon-anatomy

Conversation

@bendichter
Copy link
Copy Markdown
Member

@bendichter bendichter commented Mar 29, 2026

Summary

  • Adds UBERON ontology-based brain region matching, extending anatomy extraction to all species (not just mouse)
  • For mice: tries Allen CCF first, falls back to UBERON per-token
  • For other species: tries UBERON directly
  • Synonym scope is a settable parameter (frozenset[str]), defaulting to EXACT only
  • Bundles ~2,400 nervous system descendant terms from UBERON as a compact JSON (~479KB)
  • Generation script parses the OBO file directly with no library dependencies

New files

  • dandi/data/generate_uberon_structures.py — downloads and parses uberon.obo, extracts nervous system descendants
  • dandi/data/uberon_brain_structures.json — bundled lookup data (2,408 structures, 9,717 synonyms)

Modified files

  • dandi/metadata/brain_areas.py — UBERON loading, lookup, and matching functions; locations_to_mouse_anatomy() with Allen→UBERON fallback
  • dandi/metadata/util.py_extract_brain_anatomy() now handles all species
  • dandi/tests/test_brain_areas.py — tests for UBERON matching, synonym scope control, and Allen/UBERON fallback
  • .pre-commit-config.yaml, pyproject.toml — codespell exclusions for new JSON

Test plan

  • All 56 brain area tests pass locally
  • All 179 metadata tests pass (including existing brain anatomy integration tests)
  • Pre-commit passes (black, isort, codespell, flake8)
  • CI passes on all platforms

🤖 Generated with Claude Code

For mice, location tokens are first matched against Allen CCF, then
fall back to UBERON.  For all other species, UBERON is tried directly.
Synonym scope (EXACT, RELATED, NARROW, BROAD) is a settable parameter,
defaulting to EXACT only.

- Add generate_uberon_structures.py to parse the UBERON OBO file and
  produce a bundled JSON of ~2,400 nervous-system descendants
- Add UBERON lookup/matching functions to brain_areas.py
- Update _extract_brain_anatomy in util.py to handle non-mouse species
- Add comprehensive tests for UBERON matching and Allen/UBERON fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bendichter bendichter added the minor Increment the minor version when merged label Mar 29, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bendichter bendichter added enhancement New feature or request minor Increment the minor version when merged and removed enhancement New feature or request minor Increment the minor version when merged labels Mar 29, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 29, 2026

Codecov Report

❌ Patch coverage is 93.43629% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.73%. Comparing base (3c7ba2d) to head (69a328e).
⚠️ Report is 11 commits behind head on add-brain-area-anatomy.

Files with missing lines Patch % Lines
dandi/metadata/brain_areas.py 84.61% 12 Missing ⚠️
dandi/cli/cmd_service_scripts.py 60.00% 2 Missing ⚠️
dandi/metadata/util.py 66.66% 2 Missing ⚠️
dandi/data/generate_uberon_structures.py 98.90% 1 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##           add-brain-area-anatomy    #1825      +/-   ##
==========================================================
+ Coverage                   75.34%   75.73%   +0.39%     
==========================================================
  Files                          87       89       +2     
  Lines                       12259    12514     +255     
==========================================================
+ Hits                         9237     9478     +241     
- Misses                       3022     3036      +14     
Flag Coverage Δ
unittests 75.73% <93.43%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bendichter and others added 2 commits March 29, 2026 12:44
Instead of passing a flat set of scopes, use a max_synonym_scope
parameter (default "EXACT").  Matching tries tiers in precision order:
EXACT > NARROW > BROAD > RELATED, up to the specified maximum.
Term names are always tried first before any synonym tier.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thread the scope parameter through so callers can control how
permissive UBERON synonym matching is.  Defaults to EXACT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely my requests could be handled by an agent quite sensibly

Comment on lines +65 to +66
m = re.match(r'synonym:\s+"(.+?)"\s+(EXACT|RELATED|NARROW|BROAD)', line)
if m:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to instruct my AIs to do walrus for such... I guess we need to adjust DEVELOPMENT.md and/or .lad for that to be auto-picked up

Suggested change
m = re.match(r'synonym:\s+"(.+?)"\s+(EXACT|RELATED|NARROW|BROAD)', line)
if m:
if (m := re.match(r'synonym:\s+"(.+?)"\s+(EXACT|RELATED|NARROW|BROAD)', line)):


def main() -> None: # pragma: no cover
url = "http://purl.obolibrary.org/obo/uberon.obo"
print(f"Downloading {url} ...")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially when moved into service command - use logging to gain logging control/archival etc

- Move UBERON generator to service-scripts CLI subcommand
- Use logging instead of print in generator
- Use walrus operators in generator and brain_areas.py
- Pretty-print UBERON JSON (indent=1) for better diffs
- Use glob patterns for codespell/pre-commit excludes
- Exclude *_structures.json from large-file check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bendichter
Copy link
Copy Markdown
Member Author

Addressed all review feedback in e40872c:

  1. Service-scripts CLI: UBERON generator is now a dandi service-scripts generate-uberon-structures subcommand (Allen CCF can follow in a separate commit on the parent PR).

  2. Logging: Replaced all print() calls with lgr.info().

  3. Walrus operators: Applied in both the generator (if m := re.match(...)) and brain_areas.py (if (s := _lookup_in_dicts(...)) is not None:).

  4. Pretty-printed JSON: uberon_brain_structures.json now uses indent=1 for human readability and better diffs. Excluded *_structures.json from the large-file check since the indented file is ~700KB.

  5. Glob patterns for codespell/pre-commit: Both pyproject.toml (*_structures.json) and .pre-commit-config.yaml (dandi/data/.*_structures\.json) now use globs so future structure files are automatically covered.

Not yet addressed: weekly CI smoke test for the generator — happy to add a pytest marker for that if you'd like it in this PR.

@yarikoptic
Copy link
Copy Markdown
Member

weekly CI smoke test for the generator — happy to add a pytest marker for that if you'd like it in this PR.

develop the test itself here, and mark with a new (to be added) data_regeneration marker. We will look into adding a dedicated CI (and moving out from the main loop) later for it. also mark with needing network.

Adds a regression test that re-downloads the UBERON OBO file and
verifies the generated output matches the committed JSON.  Marked
with data_regeneration and skipif_no_network for scheduled CI runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bendichter
Copy link
Copy Markdown
Member Author

Added in 69a328e:

  • Regression test in dandi/tests/test_data_regeneration.py: re-downloads the UBERON OBO file, runs the generator to a temp path, and asserts the output matches the committed JSON byte-for-byte. This catches both code regressions and upstream UBERON changes.
  • Marked with @pytest.mark.data_regeneration (new marker, registered in pytest_plugin.py) and @mark.skipif_no_network.
  • Run with pytest -m data_regeneration for a scheduled CI job.

@satra
Copy link
Copy Markdown
Member

satra commented Apr 2, 2026

@bendichter
Copy link
Copy Markdown
Member Author

thanks for the pointer, @satra. I'm looking at the URLs of the individual terms inside this owl file, and the ones I tried did not resolve to a term page, e.g. https://purl.brain-bican.org/ontology/HOMBA_12230 and https://purl.brain-bican.org/ontology/HOMBA_12233. Is that intentional?

@satra
Copy link
Copy Markdown
Member

satra commented Apr 2, 2026

url and uri are two different things. an owl doesn't have to resolve to a page. i believe they will fix that though soonish (in a couple of weeks).

@yarikoptic
Copy link
Copy Markdown
Member

@bendichter @satra -- please clarify what exactly yet to be done (by @bendichter and his agents) or should I re-review this?

@bendichter
Copy link
Copy Markdown
Member Author

See new design doc: dandi/dandi-archive#2768, an attempt to understand and plan how we can use all these ontologies together

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants