Add sample weights, block bootstrap, and permutation importance by kmedved · Pull Request #43 · kmedved/sift

kmedved · 2026-01-01T07:05:50Z

This commit implements three major improvements:

Sample Weights in Filter Methods:
- Add weighted F-regression and F-classif in relevance.py
- Add weighted rank-Gaussian transforms and correlations in copula.py
- Add weighted mRMR selection in loops.py
- Update all public API functions to accept sample_weight parameter
Block Bootstrap in StabilitySelector:
- Add block bootstrap functions (moving, circular, stationary)
- Update StabilitySelector to use block bootstrap when groups/time provided
- Update stability_regression and stability_classif convenience functions
Block Permutation Importance:
- Create new importance.py module
- Support global, within_group, block, and circular_shift permutation methods
- Auto-select appropriate method based on available group/time data

All new parameters are optional with sensible defaults.
Tests added for all new functionality (47 tests, all passing).

This commit implements three major improvements: 1. Sample Weights in Filter Methods: - Add weighted F-regression and F-classif in relevance.py - Add weighted rank-Gaussian transforms and correlations in copula.py - Add weighted mRMR selection in loops.py - Update all public API functions to accept sample_weight parameter 2. Block Bootstrap in StabilitySelector: - Add block bootstrap functions (moving, circular, stationary) - Update StabilitySelector to use block bootstrap when groups/time provided - Update stability_regression and stability_classif convenience functions 3. Block Permutation Importance: - Create new importance.py module - Support global, within_group, block, and circular_shift permutation methods - Auto-select appropriate method based on available group/time data All new parameters are optional with sensible defaults. Tests added for all new functionality (47 tests, all passing).

Copilot

Pull request overview

This PR adds three major enhancements to the sift feature selection library:

Purpose: Add support for sample weights, block bootstrap for time-series data, and block-aware permutation importance.

Key Changes:

Sample weights integrated into all filter methods (mRMR, JMI, JMIM, CEFS+) with weighted F-statistics, correlations, and rank transforms
Block bootstrap methods (moving, circular, stationary) added to StabilitySelector for panel/time-series data
New permutation_importance module with time-series-aware permutation strategies (within_group, block, circular_shift)

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
sift/importance.py	New module implementing time-series-aware permutation importance with multiple strategies
sift/stability.py	Added block bootstrap functions and integration into StabilitySelector with automatic method selection
sift/estimators/relevance.py	Implemented weighted F-regression and F-classification statistics
sift/estimators/copula.py	Added weighted rank-Gaussian transforms, correlations, and weighted subsampling in cache building
sift/selection/loops.py	Implemented weighted mRMR selection with weighted correlation calculations
sift/selection/cefsplus.py	Updated CEFS+ to use weighted transforms from cache
sift/api.py	Added sample_weight parameter to all public selection functions
sift/init.py	Exported new permutation_importance function
tests/test_weights.py	Comprehensive tests for sample weights across all selectors
tests/test_importance.py	Tests for all permutation importance methods and edge cases
tests/test_block_bootstrap.py	Tests for block bootstrap with different methods and configurations
.gitignore	Standard Python gitignore patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-01T07:11:03Z

sift/api.py

+    # Prepare sample weights
+    n = X_arr.shape[0]
+    w = rel_est._ensure_weights(sample_weight, n) if sample_weight is not None else np.ones(n, dtype=np.float64)
+    if subsample is not None and len(w) > n:
+        # weights were passed in for original data, need to subsample them too
+        # This shouldn't happen as subsample happens in _prepare_xy_classic
+        pass


Sample weights are not properly subsampled when the data is subsampled. When subsample_xy is called in _prepare_xy_classic, it subsamples X and y, but the sample_weight array is not subsampled to match. This means the weight array length will not match the subsampled data length, leading to a dimension mismatch.

The weights should be subsampled using the same indices as X and y. Consider modifying _prepare_xy_classic to accept and return sample_weight, or subsample the weights here using the same random state and logic.

Copilot · 2026-01-01T07:11:03Z

sift/api.py

+    # Prepare sample weights
+    n = X_arr.shape[0]
+    w = rel_est._ensure_weights(sample_weight, n) if sample_weight is not None else np.ones(n, dtype=np.float64)


Same issue as in _mrmr_classic: sample weights are not properly subsampled when the data is subsampled. When subsample_xy is called in _prepare_xy_classic, it subsamples X and y, but the sample_weight array is not subsampled to match. This means if the original data had more rows than the subsample limit, the weight array length will not match the subsampled data length, leading to a dimension mismatch.

Copilot · 2026-01-01T07:11:03Z

sift/stability.py

+        length = min(rng.geometric(p), n - start)
+        result.extend(sorted_idx[start:start + length].tolist())


The stationary block bootstrap implementation has a potential issue with the geometric distribution. The geometric distribution with parameter p in numpy generates values starting from 1, but when n - start is less than 1, min(rng.geometric(p), n - start) could result in 0 length blocks when start equals n-1. This would cause an issue with the slicing sorted_idx[start:start + length].

Additionally, when the geometric sample is larger than remaining space, the logic min(rng.geometric(p), n - start) doesn't properly handle wrapping for stationary bootstrap. The standard stationary bootstrap should allow wrapping around the time series, similar to circular bootstrap.

Suggested change

length = min(rng.geometric(p), n - start)

result.extend(sorted_idx[start:start + length].tolist())

length = rng.geometric(p)

indices = [(start + i) % n for i in range(length)]

result.extend(sorted_idx[indices].tolist())

Copilot · 2026-01-01T07:11:04Z

sift/estimators/relevance.py

+def f_classif_weighted(X: np.ndarray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
+    """Weighted F-statistic for classification (weighted ANOVA)."""
+    n, p = X.shape
+    n_classes = int(y.max()) + 1


The weighted F-statistic for classification assumes that class labels are 0-indexed integers (line 68: n_classes = int(y.max()) + 1). If the class labels don't start at 0 or have gaps (e.g., labels are [1, 2] or [0, 2]), this will create arrays with incorrect dimensions or access incorrect indices, potentially causing index out of bounds errors.

Consider using a mapping from class labels to contiguous indices, similar to how sklearn handles this with LabelEncoder, or document this requirement clearly.

Copilot · 2026-01-01T07:11:04Z

sift/api.py

+        # weights were passed in for original data, need to subsample them too
+        # This shouldn't happen as subsample happens in _prepare_xy_classic
+        pass


The comment states "This shouldn't happen as subsample happens in _prepare_xy_classic" but this is misleading. The actual issue is that when data is subsampled, the sample_weight array is NOT subsampled, leading to a length mismatch. The condition len(w) > n will always be true when subsample is triggered (since n is the subsampled size and w is the original size), yet there's no corrective action in the pass statement.

This dead code should either subsample the weights properly or raise an error indicating the problem.

Suggested change

# weights were passed in for original data, need to subsample them too

# This shouldn't happen as subsample happens in _prepare_xy_classic

pass

# weights were passed in for original data, but we don't have access to the

# subsampling indices used in _prepare_xy_classic/subsample_xy, so we cannot

# correctly subsample the weights here. Using misaligned weights would be

# incorrect, so we fail fast with a clear error.

raise ValueError(

"sample_weight has length {w_len}, but after subsampling X has only {n_samples} "

"rows. When using the 'subsample' option, sample_weight must correspond to the "

"subsampled data (length equal to X.shape[0]) or subsample must be disabled."

.format(w_len=len(w), n_samples=n)

)

Copilot · 2026-01-01T07:11:04Z

sift/importance.py

+    X_arr = X.values.copy()
+    n = len(y)
+
+    w = np.ones(n, dtype=np.float64) if sample_weight is None else np.asarray(sample_weight, dtype=np.float64)


The permutation importance function doesn't validate that sample_weight has the correct length matching X and y. If a user passes a sample_weight array with a different length, this will cause errors later when trying to compute weighted scores. Consider adding validation: if sample_weight is not None and len(sample_weight) != n: raise ValueError(f"sample_weight must have length {n}, got {len(sample_weight)}")

Suggested change

w = np.ones(n, dtype=np.float64) if sample_weight is None else np.asarray(sample_weight, dtype=np.float64)

if sample_weight is None:

w = np.ones(n, dtype=np.float64)

else:

w = np.asarray(sample_weight, dtype=np.float64)

if len(w) != n:

raise ValueError(f"sample_weight must have length {n}, got {len(w)}")

Copilot · 2026-01-01T07:11:04Z

tests/test_importance.py

+
+    def test_permute_global(self):
+        """Global permutation should shuffle all values."""
+        rng = np.random.default_rng(42)


Variable rng is not used.

Suggested change

rng = np.random.default_rng(42)

Copilot · 2026-01-01T07:11:05Z

tests/test_importance.py

+import numpy as np
+import pandas as pd
+import pytest
+from sklearn.linear_model import Ridge, LogisticRegression


Import of 'LogisticRegression' is not used.

Replace Python 3.10+ union type syntax (int | str, type | None) with Union[int, str] and Optional[type] from the typing module to support Python 3.9.

claude added 2 commits January 1, 2026 06:36

Add .gitignore to ignore Python cache directories

4b7a311

Copilot AI review requested due to automatic review settings January 1, 2026 07:05

Copilot started reviewing on behalf of kmedved January 1, 2026 07:06 View session

Copilot AI reviewed Jan 1, 2026

View reviewed changes

Fix Python 3.9 compatibility for type hints

b1e3216

Replace Python 3.10+ union type syntax (int | str, type | None) with Union[int, str] and Optional[type] from the typing module to support Python 3.9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample weights, block bootstrap, and permutation importance#43

Add sample weights, block bootstrap, and permutation importance#43
kmedved wants to merge 3 commits intomainfrom
claude/sift-feature-improvements-ID3Ab

kmedved commented Jan 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Copilot AI Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		length = min(rng.geometric(p), n - start)
		result.extend(sorted_idx[start:start + length].tolist())

-        length = min(rng.geometric(p), n - start)
-        result.extend(sorted_idx[start:start + length].tolist())
+        length = rng.geometric(p)
+        indices = [(start + i) % n for i in range(length)]
+        result.extend(sorted_idx[indices].tolist())

-        # weights were passed in for original data, need to subsample them too
-        # This shouldn't happen as subsample happens in _prepare_xy_classic
-        pass
+        # weights were passed in for original data, but we don't have access to the
+        # subsampling indices used in _prepare_xy_classic/subsample_xy, so we cannot
+        # correctly subsample the weights here. Using misaligned weights would be
+        # incorrect, so we fail fast with a clear error.
+        raise ValueError(
+            "sample_weight has length {w_len}, but after subsampling X has only {n_samples} "
+            "rows. When using the 'subsample' option, sample_weight must correspond to the "
+            "subsampled data (length equal to X.shape[0]) or subsample must be disabled."
+            .format(w_len=len(w), n_samples=n)
+        )

-    w = np.ones(n, dtype=np.float64) if sample_weight is None else np.asarray(sample_weight, dtype=np.float64)
+    if sample_weight is None:
+        w = np.ones(n, dtype=np.float64)
+    else:
+        w = np.asarray(sample_weight, dtype=np.float64)
+        if len(w) != n:
+            raise ValueError(f"sample_weight must have length {n}, got {len(w)}")

Conversation

kmedved commented Jan 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants