This project uses machine learning and spatial regression techniques to predict red squirrel populations across Scottish local authorities using mixed-source citizen science data. The system includes a two-stage duplicate detection algorithm and a Random Forest Regressor model to process and analyse data from the Scottish Squirrel Database.
- Two-Stage Duplicate Detection: Identifies and removes duplicate observations using both exact spatial matching and proximity-based detection that accounts for coordinate uncertainty
- Spatial-Temporal Modelling: Predicts future squirrel populations using Random Forest Regressor with spatial and temporal features
- Interactive Visualisation: Streamlit web application with dynamic maps, charts, and prediction tools
- Uncertainty Quantification: Provides confidence intervals for predictions to support conservation decision-making
- Python: Primary programming language
- MongoDB: Database storage for squirrel observation records
- Streamlit: Web application framework for interactive visualisation
- Scikit-learn: Implementation of Random Forest Regressor model
- GeoPandas: Spatial data processing and analysis
- Folium: Interactive mapping
- Plotly: Statistical visualisations
The system uses the Scottish Squirrel Database obtained via the NBN Atlas, containing over 93,000 squirrel sightings recorded between 1905 and 2021. The data is filtered, cleaned, and processed to focus on red squirrels in Scotland from 2010-2021.
- Successfully identified and removed 5.36% of records as potential duplicates
- Achieved an R² score of 0.849 during cross-validation and 0.79 on the test set
- Identified current year density (50.8%) and previous year count (35.5%) as the most influential predictors
- Created an intuitive application for exploring population trends and generating future predictions
- Clone this repository
- Install required packages:
pip install -r requirements.txt - Configure MongoDB connection in config.py
- Run the application:
streamlit run app.py
This project adheres to GDPR principles by:
- Anonymising all personal data
- Implementing appropriate security measures
- Using data only for the specific purpose of wildlife population modelling
- Displaying reduced precision coordinates in visualisations
Potential enhancements include:
- Dynamic coordinate uncertainty based on environmental factors
- Observer identity integration for improved duplicate detection
- Environmental covariates to improve prediction accuracy
- Real-time data collection interface for citizen scientists
-
The citizen scientists who contributed observations to the Scottish Squirrel Database
-
NBN Atlas for providing access to the dataset
Click on the thumbnail to view the Demonstration video
