Skip to content

Add sample weights, block bootstrap, and permutation importance#43

Open
kmedved wants to merge 3 commits intomainfrom
claude/sift-feature-improvements-ID3Ab
Open

Add sample weights, block bootstrap, and permutation importance#43
kmedved wants to merge 3 commits intomainfrom
claude/sift-feature-improvements-ID3Ab

Conversation

@kmedved
Copy link
Owner

@kmedved kmedved commented Jan 1, 2026

This commit implements three major improvements:

  1. Sample Weights in Filter Methods:

    • Add weighted F-regression and F-classif in relevance.py
    • Add weighted rank-Gaussian transforms and correlations in copula.py
    • Add weighted mRMR selection in loops.py
    • Update all public API functions to accept sample_weight parameter
  2. Block Bootstrap in StabilitySelector:

    • Add block bootstrap functions (moving, circular, stationary)
    • Update StabilitySelector to use block bootstrap when groups/time provided
    • Update stability_regression and stability_classif convenience functions
  3. Block Permutation Importance:

    • Create new importance.py module
    • Support global, within_group, block, and circular_shift permutation methods
    • Auto-select appropriate method based on available group/time data

All new parameters are optional with sensible defaults.
Tests added for all new functionality (47 tests, all passing).

claude added 2 commits January 1, 2026 06:36
This commit implements three major improvements:

1. Sample Weights in Filter Methods:
   - Add weighted F-regression and F-classif in relevance.py
   - Add weighted rank-Gaussian transforms and correlations in copula.py
   - Add weighted mRMR selection in loops.py
   - Update all public API functions to accept sample_weight parameter

2. Block Bootstrap in StabilitySelector:
   - Add block bootstrap functions (moving, circular, stationary)
   - Update StabilitySelector to use block bootstrap when groups/time provided
   - Update stability_regression and stability_classif convenience functions

3. Block Permutation Importance:
   - Create new importance.py module
   - Support global, within_group, block, and circular_shift permutation methods
   - Auto-select appropriate method based on available group/time data

All new parameters are optional with sensible defaults.
Tests added for all new functionality (47 tests, all passing).
Copilot AI review requested due to automatic review settings January 1, 2026 07:05
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds three major enhancements to the sift feature selection library:

Purpose: Add support for sample weights, block bootstrap for time-series data, and block-aware permutation importance.

Key Changes:

  • Sample weights integrated into all filter methods (mRMR, JMI, JMIM, CEFS+) with weighted F-statistics, correlations, and rank transforms
  • Block bootstrap methods (moving, circular, stationary) added to StabilitySelector for panel/time-series data
  • New permutation_importance module with time-series-aware permutation strategies (within_group, block, circular_shift)

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
sift/importance.py New module implementing time-series-aware permutation importance with multiple strategies
sift/stability.py Added block bootstrap functions and integration into StabilitySelector with automatic method selection
sift/estimators/relevance.py Implemented weighted F-regression and F-classification statistics
sift/estimators/copula.py Added weighted rank-Gaussian transforms, correlations, and weighted subsampling in cache building
sift/selection/loops.py Implemented weighted mRMR selection with weighted correlation calculations
sift/selection/cefsplus.py Updated CEFS+ to use weighted transforms from cache
sift/api.py Added sample_weight parameter to all public selection functions
sift/init.py Exported new permutation_importance function
tests/test_weights.py Comprehensive tests for sample weights across all selectors
tests/test_importance.py Tests for all permutation importance methods and edge cases
tests/test_block_bootstrap.py Tests for block bootstrap with different methods and configurations
.gitignore Standard Python gitignore patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to +179
# Prepare sample weights
n = X_arr.shape[0]
w = rel_est._ensure_weights(sample_weight, n) if sample_weight is not None else np.ones(n, dtype=np.float64)
if subsample is not None and len(w) > n:
# weights were passed in for original data, need to subsample them too
# This shouldn't happen as subsample happens in _prepare_xy_classic
pass
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample weights are not properly subsampled when the data is subsampled. When subsample_xy is called in _prepare_xy_classic, it subsamples X and y, but the sample_weight array is not subsampled to match. This means the weight array length will not match the subsampled data length, leading to a dimension mismatch.

The weights should be subsampled using the same indices as X and y. Consider modifying _prepare_xy_classic to accept and return sample_weight, or subsample the weights here using the same random state and logic.

Copilot uses AI. Check for mistakes.
Comment on lines +391 to +393
# Prepare sample weights
n = X_arr.shape[0]
w = rel_est._ensure_weights(sample_weight, n) if sample_weight is not None else np.ones(n, dtype=np.float64)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in _mrmr_classic: sample weights are not properly subsampled when the data is subsampled. When subsample_xy is called in _prepare_xy_classic, it subsamples X and y, but the sample_weight array is not subsampled to match. This means if the original data had more rows than the subsample limit, the weight array length will not match the subsampled data length, leading to a dimension mismatch.

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +49
length = min(rng.geometric(p), n - start)
result.extend(sorted_idx[start:start + length].tolist())
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stationary block bootstrap implementation has a potential issue with the geometric distribution. The geometric distribution with parameter p in numpy generates values starting from 1, but when n - start is less than 1, min(rng.geometric(p), n - start) could result in 0 length blocks when start equals n-1. This would cause an issue with the slicing sorted_idx[start:start + length].

Additionally, when the geometric sample is larger than remaining space, the logic min(rng.geometric(p), n - start) doesn't properly handle wrapping for stationary bootstrap. The standard stationary bootstrap should allow wrapping around the time series, similar to circular bootstrap.

Suggested change
length = min(rng.geometric(p), n - start)
result.extend(sorted_idx[start:start + length].tolist())
length = rng.geometric(p)
indices = [(start + i) % n for i in range(length)]
result.extend(sorted_idx[indices].tolist())

Copilot uses AI. Check for mistakes.
def f_classif_weighted(X: np.ndarray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
"""Weighted F-statistic for classification (weighted ANOVA)."""
n, p = X.shape
n_classes = int(y.max()) + 1
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weighted F-statistic for classification assumes that class labels are 0-indexed integers (line 68: n_classes = int(y.max()) + 1). If the class labels don't start at 0 or have gaps (e.g., labels are [1, 2] or [0, 2]), this will create arrays with incorrect dimensions or access incorrect indices, potentially causing index out of bounds errors.

Consider using a mapping from class labels to contiguous indices, similar to how sklearn handles this with LabelEncoder, or document this requirement clearly.

Copilot uses AI. Check for mistakes.
Comment on lines +177 to +179
# weights were passed in for original data, need to subsample them too
# This shouldn't happen as subsample happens in _prepare_xy_classic
pass
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "This shouldn't happen as subsample happens in _prepare_xy_classic" but this is misleading. The actual issue is that when data is subsampled, the sample_weight array is NOT subsampled, leading to a length mismatch. The condition len(w) > n will always be true when subsample is triggered (since n is the subsampled size and w is the original size), yet there's no corrective action in the pass statement.

This dead code should either subsample the weights properly or raise an error indicating the problem.

Suggested change
# weights were passed in for original data, need to subsample them too
# This shouldn't happen as subsample happens in _prepare_xy_classic
pass
# weights were passed in for original data, but we don't have access to the
# subsampling indices used in _prepare_xy_classic/subsample_xy, so we cannot
# correctly subsample the weights here. Using misaligned weights would be
# incorrect, so we fail fast with a clear error.
raise ValueError(
"sample_weight has length {w_len}, but after subsampling X has only {n_samples} "
"rows. When using the 'subsample' option, sample_weight must correspond to the "
"subsampled data (length equal to X.shape[0]) or subsample must be disabled."
.format(w_len=len(w), n_samples=n)
)

Copilot uses AI. Check for mistakes.
X_arr = X.values.copy()
n = len(y)

w = np.ones(n, dtype=np.float64) if sample_weight is None else np.asarray(sample_weight, dtype=np.float64)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The permutation importance function doesn't validate that sample_weight has the correct length matching X and y. If a user passes a sample_weight array with a different length, this will cause errors later when trying to compute weighted scores. Consider adding validation: if sample_weight is not None and len(sample_weight) != n: raise ValueError(f"sample_weight must have length {n}, got {len(sample_weight)}")

Suggested change
w = np.ones(n, dtype=np.float64) if sample_weight is None else np.asarray(sample_weight, dtype=np.float64)
if sample_weight is None:
w = np.ones(n, dtype=np.float64)
else:
w = np.asarray(sample_weight, dtype=np.float64)
if len(w) != n:
raise ValueError(f"sample_weight must have length {n}, got {len(w)}")

Copilot uses AI. Check for mistakes.

def test_permute_global(self):
"""Global permutation should shuffle all values."""
rng = np.random.default_rng(42)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable rng is not used.

Suggested change
rng = np.random.default_rng(42)

Copilot uses AI. Check for mistakes.
import numpy as np
import pandas as pd
import pytest
from sklearn.linear_model import Ridge, LogisticRegression
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'LogisticRegression' is not used.

Copilot uses AI. Check for mistakes.
Replace Python 3.10+ union type syntax (int | str, type | None)
with Union[int, str] and Optional[type] from the typing module
to support Python 3.9.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants