Add comprehensive documentation and contributing guidelines by kmedved · Pull Request #55 · kmedved/sift

kmedved · 2026-01-27T00:59:02Z

Summary

This PR adds extensive documentation to the SIFT project, including a new CONTRIBUTING.md file with detailed contributor guidelines and a significantly expanded README.md with comprehensive API documentation, usage examples, and algorithm guides.

Key Changes

New Files

CONTRIBUTING.md (416 lines): Complete contributor guide including:
- Getting started and development setup instructions
- Code style guidelines (PEP 8, type hints, docstring format)
- Testing guidelines with examples
- Pull request process and commit message conventions
- Issue reporting and feature request templates
- Code of conduct

Documentation Improvements

README.md: Expanded from ~486 to ~659 lines with:
- Better structure: Added table of contents and clear section organization
- Algorithm documentation: Detailed sections for each selector (mRMR, JMI/JMIM, CEFS+, Stability Selection, Boruta, CatBoost)
- Advanced features: New sections covering:
  - Automatic K selection with AutoKConfig
  - Time series support with block bootstrap and temporal validation
  - Grouped/panel data handling with group-aware operations
  - Smart sampling for large datasets
  - Categorical feature encoding strategies
  - Sample weight support
- Algorithm selection guide: Comparison table to help users choose the right algorithm
- Comprehensive API reference: Tables documenting all main functions, classes, and utilities
- Project structure: Visual overview of the codebase organization
- Quick start examples: Practical code examples for common use cases
- Badges: Added Python version and license badges

Notable Implementation Details

Documentation follows NumPy docstring conventions as specified in CONTRIBUTING.md
Examples are practical and runnable, covering both function and class-based APIs
Algorithm comparison tables help users make informed decisions
Time series and grouped data examples show real-world use cases (e.g., NBA player data)
Contributing guidelines emphasize code quality, testing, and clear communication
All examples use consistent naming conventions and best practices

Benefits

For new contributors: Clear guidelines on code style, testing, and submission process
For new users: Comprehensive examples and algorithm selection guidance
For maintainers: Standardized expectations for contributions and documentation
For the project: Professional documentation that improves discoverability and adoption

https://claude.ai/code/session_017CBJSg558QSEUJ3W6JtLSc

- Rewrite README.md with complete feature overview, installation guide, quick start examples, detailed algorithm documentation, API reference, and algorithm selection guide - Add CONTRIBUTING.md with development setup, code style guidelines, testing instructions, and contribution workflow - Add docs/ directory with detailed documentation: - API.md: Complete API reference for all public functions and classes - ALGORITHMS.md: Mathematical foundations and comparisons of all algorithms - ADVANCED.md: Guide for time series, panel data, auto-K selection, smart sampling, and other advanced features

Copilot

Pull request overview

Adds substantial end-user and contributor documentation for SIFT, including an expanded README and new deep-dive docs for API, algorithms, and advanced usage.

Changes:

Introduces new documentation pages: API reference, algorithm explanations, and advanced features guide.
Rewrites/expands README.md with installation instructions, usage examples, and an API/feature overview.
Adds a comprehensive CONTRIBUTING.md with development, testing, and PR process guidance.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 25 comments.

Show a summary per file

File	Description
docs/API.md	New API reference document; currently contains multiple signature/default/return-type mismatches vs the implemented public API.
docs/ALGORITHMS.md	New algorithm overview and theory guide for included selectors and MI estimators.
docs/ADVANCED.md	New advanced usage guide (time series, grouped data, auto-k, smart sampling); several examples currently don’t match actual callable signatures/requirements.
README.md	Major rewrite/expansion of project README; includes some inaccurate installation/extras info and a few example snippets that will error as written.
CONTRIBUTING.md	New contributor guide; includes a few dependency/instructions mismatches vs declared packaging extras.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T01:11:32Z

docs/API.md

+### select_k_auto
+
+```python
+sift.select_k_auto(
+    X: pd.DataFrame,
+    y: np.ndarray,
+    feature_path: List[str],
+    config: AutoKConfig,
+    *,
+    groups: Optional[np.ndarray] = None,
+    time: Optional[np.ndarray] = None,
+    task: Literal["regression", "classification"] = "regression"
+) -> Tuple[int, List[str], Dict[int, float]]
+```
+
+Automatically select optimal K via cross-validation or holdout.
+
+**Returns:**
+
+- `best_k` (int): Optimal number of features
+- `selected` (List[str]): Selected feature names
+- `scores` (Dict[int, float]): Scores for each K tried
+


select_k_auto’s return type is documented as Tuple[int, List[str], Dict[int, float]], but the implementation returns a diagnostic pd.DataFrame as the third element (see sift/selection/auto_k.py:96). The doc and the “Returns” section should be updated to reflect the DataFrame output.

Copilot · 2026-01-27T01:11:32Z

docs/API.md

+```python
+@dataclass
+class sift.SmartSamplerConfig:
+    feature_cols: List[str]
+    y_col: str
+    group_col: Optional[str] = None
+    time_col: Optional[str] = None
+    sample_frac: float = 0.2
+    anchor_strategy: str = "leverage"
+    random_state: int = 42
+```
+
+Configuration for smart sampling.
+
+---
+
+### smart_sample
+
+```python
+sift.smart_sample(
+    df: pd.DataFrame,
+    feature_cols: List[str],
+    y_col: str,
+    *,
+    group_col: Optional[str] = None,
+    time_col: Optional[str] = None,
+    sample_frac: float = 0.2,
+    anchor_strategy: str = "leverage",
+    random_state: int = 42
+) -> pd.DataFrame


SmartSamplerConfig and smart_sample are documented with fields/kwargs (feature_cols, y_col, anchor_strategy) that don’t exist in the actual dataclass/API. In code, SmartSamplerConfig contains sampling behavior fields (e.g., sample_frac, group_col, time_col, anchor_fn, etc.), and smart_sample takes an optional config plus **kwargs overrides (see sift/sampling/smart.py). Please align this section with the real config fields and accepted kwargs.

Copilot · 2026-01-27T01:11:33Z

docs/ADVANCED.md

+importance = permutation_importance(
+    model, X_test, y_test,
+    permute_method="circular_shift",  # Preserves temporal structure
+    groups=entity_ids,


permutation_importance(..., permute_method="circular_shift") requires both groups and time (the implementation raises if time is missing). This example passes groups but not time, so it will error; please include the time= argument or switch to a method that doesn’t require time.

Suggested change

groups=entity_ids,

groups=entity_ids,

time=time_index,

Copilot · 2026-01-27T01:11:33Z

docs/ADVANCED.md

+# Direct usage
+sampled = smart_sample(
+    df,
+    feature_cols=["f1", "f2", "f3"],
+    y_col="target",
+    group_col="user_id",
+    time_col="timestamp",
+    sample_frac=0.15,
+    anchor_strategy="leverage"
+)


smart_sample(...) is shown with an anchor_strategy argument, but the implementation doesn’t accept anchor_strategy (and SmartSamplerConfig doesn’t have that field). Passing it will raise a TypeError when SmartSamplerConfig(**kwargs) is built. Please remove it or document the supported configuration mechanism (anchor_fn in SmartSamplerConfig).

Copilot · 2026-01-27T01:11:33Z

docs/API.md

+| `time` | ndarray | None | Time values for temporal ordering |
+| `auto_k_config` | AutoKConfig | None | Configuration for automatic K selection |
+| `sample_weight` | ndarray | None | Sample weights |
+| `relevance` | str | "f" | Relevance measure: "f" (F-stat), "ks" (KS-test), "rf" (Random Forest) |


relevance="ks" is documented as an option for select_mrmr generically, but the implementation only supports KS relevance for classification (regression supports only "f" and "rf"; see sift/api.py where rel_funcs differs by task). Please reflect this constraint in the parameter description to prevent runtime ValueErrors for regression.

Suggested change

| `relevance` | str | "f" | Relevance measure: "f" (F-stat), "ks" (KS-test), "rf" (Random Forest) |

| `relevance` | str | "f" | Relevance measure: for regression, "f" (F-stat) or "rf" (Random Forest); for classification, "f", "ks" (KS-test), or "rf" |

Copilot · 2026-01-27T01:11:37Z

docs/API.md

+
+# Get selected features
+features = selector.get_support(indices=False)  # Boolean mask
+feature_names = selector.get_support(indices=True)  # Names/indices


get_support(indices=True) returns integer indices (selected_features_), not feature names. The docs currently label it as “Names/indices”; for names, users should use selected_feature_names_ (or document a helper that returns names).

Suggested change

feature_names = selector.get_support(indices=True) # Names/indices

feature_indices = selector.get_support(indices=True) # Integer indices of selected features

feature_names = selector.selected_feature_names_ # Selected feature names

Copilot · 2026-01-27T01:11:38Z

docs/ADVANCED.md

+```python
+class BlockedTimeSeriesSplit:
+    """Time series split with gap between train and validation."""
+
+    def __init__(self, n_splits=5, gap=0):
+        self.n_splits = n_splits
+        self.gap = gap
+
+    def split(self, X, y=None, groups=None):
+        n = len(X)
+        fold_size = n // (self.n_splits + 1)
+        for i in range(self.n_splits):
+            train_end = fold_size * (i + 1)
+            val_start = train_end + self.gap
+            val_end = val_start + fold_size
+            yield np.arange(train_end), np.arange(val_start, min(val_end, n))
+


The BlockedTimeSeriesSplit example uses np.arange(...) but the snippet doesn’t import NumPy (import numpy as np). If the goal is for examples to be runnable, add the missing import in this code block.

Copilot · 2026-01-27T01:11:38Z

docs/ADVANCED.md

+selector = StabilitySelector(n_bootstrap=50, threshold=0.6)
+selector.fit(
+    X, y,
+    groups=player_ids,
+    time=game_dates,
+    block_size="auto",
+    block_method="moving"


Same issue as earlier: block_size / block_method are passed to selector.fit(...), but fit() doesn’t accept them (they’re StabilitySelector constructor parameters). This example will raise TypeError unless the arguments are moved to StabilitySelector(..., block_size=..., block_method=...).

Suggested change

selector = StabilitySelector(n_bootstrap=50, threshold=0.6)

selector.fit(

X, y,

groups=player_ids,

time=game_dates,

block_size="auto",

block_method="moving"

selector = StabilitySelector(

n_bootstrap=50,

threshold=0.6,

block_size="auto",

block_method="moving"

)

selector.fit(

X, y,

groups=player_ids,

time=game_dates,

Copilot · 2026-01-27T01:11:38Z

README.md

+# Individual extras
+pip install -e ".[catboost]"     # CatBoost-based selection
+pip install -e ".[categorical]"  # Advanced categorical encoding (category_encoders)
+pip install -e ".[polars]"       # Polars DataFrame support


The installation section lists an extra polars ("pip install -e ".[polars]"") but setup.py defines extras only for all, catboost, categorical, and test (no polars). Please either add a polars extra to setup.py or remove/replace this instruction so users don’t hit an install error.

Suggested change

pip install -e ".[polars]" # Polars DataFrame support

Copilot · 2026-01-27T01:11:38Z

docs/API.md

+sift.select_cached(
+    cache: FeatureCache,
+    y: np.ndarray,
+    k: int,
+    *,
+    method: Literal["mrmr_quot", "mrmr_diff", "jmi", "jmim", "cefsplus"] = "mrmr_quot",
+    top_m: Optional[int] = None,


select_cached is documented with method=... = "mrmr_quot", but the actual default is "cefsplus" (see sift/selection/cefsplus.py:448). Please update the default (and ideally the full signature) so the API reference matches runtime behavior.

Copilot AI review requested due to automatic review settings January 27, 2026 00:59

Copilot started reviewing on behalf of kmedved January 27, 2026 00:59 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive documentation and contributing guidelines#55

Add comprehensive documentation and contributing guidelines#55
kmedved wants to merge 1 commit intomainfrom
claude/write-repo-documentation-MedHn

kmedved commented Jan 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	\| `relevance` \| str \| "f" \| Relevance measure: "f" (F-stat), "ks" (KS-test), "rf" (Random Forest) \|
	\| `relevance` \| str \| "f" \| Relevance measure: for regression, "f" (F-stat) or "rf" (Random Forest); for classification, "f", "ks" (KS-test), or "rf" \|

	feature_names = selector.get_support(indices=True) # Names/indices
	feature_indices = selector.get_support(indices=True) # Integer indices of selected features
	feature_names = selector.selected_feature_names_ # Selected feature names

Conversation

kmedved commented Jan 27, 2026

Summary

Key Changes

New Files

Documentation Improvements

Notable Implementation Details

Benefits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants