-
Notifications
You must be signed in to change notification settings - Fork 113
Description
The Problem
year=-1 represents the values of mosquitoes bred in the laboratory. They were assigned the value -1 to represent no real collection dates exist.
There is no mention in the API documentation or docstrings that year=-1 exists or what it means. A user looking at the data for the first time has no way to know that -1 is a special code — it just looks like a corrupted or impossible date.
The root cause is in malariagen_data/anoph/sample_metadata.py. The year column is defined as int64 with no validation in _parse_general_metadata() at line 165. No filtering or warning exists when year=-1 values are returned.
Steps to Reproduce with Code:
import malariagen_data
ag3 = malariagen_data.Ag3()
df_samples = ag3.sample_metadata(sample_sets="3.0")
Crashes with ValueError
import pandas as pd
df_samples["collection_date"] = pd.to_datetime(
df_samples["year"].astype(str) + "-01-01"
)
The Outputs:
cohort metadata null values caused by year=-1:
country_iso: 297 NaN
admin1_name: 297 NaN
cohort_admin1_year: 300 NaN
Expected vs Actual Behaviour
Expected: The API should either exclude lab cross samples by default, or warn users that year=-1 samples are present and may affect temporal analysis.
Actual: year=-1 samples are silently included in all results with no warning, causing crashes and distorted outputs in downstream analysis.
The Proposed Fix
Add an exclude_lab_crosses=False parameter to sample_metadata() that filters out samples with year=-1 when set to True. Additionally, add a UserWarning when results contain year=-1 values, and document the sentinel value in the docstring.
Happy to submit a PR!

