Skip to content

year=-1 sentinel value in lab cross samples causes crashes and silent NaN values in sample_metadata() downstream analysis #1092

@sallykinyua

Description

@sallykinyua

The Problem
year=-1 represents the values of mosquitoes bred in the laboratory. They were assigned the value -1 to represent no real collection dates exist.
There is no mention in the API documentation or docstrings that year=-1 exists or what it means. A user looking at the data for the first time has no way to know that -1 is a special code — it just looks like a corrupted or impossible date.

The root cause is in malariagen_data/anoph/sample_metadata.py. The year column is defined as int64 with no validation in _parse_general_metadata() at line 165. No filtering or warning exists when year=-1 values are returned.

Steps to Reproduce with Code:

import malariagen_data
ag3 = malariagen_data.Ag3()
df_samples = ag3.sample_metadata(sample_sets="3.0")

Crashes with ValueError

import pandas as pd
df_samples["collection_date"] = pd.to_datetime(
df_samples["year"].astype(str) + "-01-01"
)

The Outputs:

Image
Image

cohort metadata null values caused by year=-1:
country_iso: 297 NaN
admin1_name: 297 NaN
cohort_admin1_year: 300 NaN

Expected vs Actual Behaviour
Expected: The API should either exclude lab cross samples by default, or warn users that year=-1 samples are present and may affect temporal analysis.
Actual: year=-1 samples are silently included in all results with no warning, causing crashes and distorted outputs in downstream analysis.

The Proposed Fix
Add an exclude_lab_crosses=False parameter to sample_metadata() that filters out samples with year=-1 when set to True. Additionally, add a UserWarning when results contain year=-1 values, and document the sentinel value in the docstring.

Happy to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions