Skip to content

Add comprehensive documentation and contributing guidelines#55

Open
kmedved wants to merge 1 commit intomainfrom
claude/write-repo-documentation-MedHn
Open

Add comprehensive documentation and contributing guidelines#55
kmedved wants to merge 1 commit intomainfrom
claude/write-repo-documentation-MedHn

Conversation

@kmedved
Copy link
Owner

@kmedved kmedved commented Jan 27, 2026

Summary

This PR adds extensive documentation to the SIFT project, including a new CONTRIBUTING.md file with detailed contributor guidelines and a significantly expanded README.md with comprehensive API documentation, usage examples, and algorithm guides.

Key Changes

New Files

  • CONTRIBUTING.md (416 lines): Complete contributor guide including:
    • Getting started and development setup instructions
    • Code style guidelines (PEP 8, type hints, docstring format)
    • Testing guidelines with examples
    • Pull request process and commit message conventions
    • Issue reporting and feature request templates
    • Code of conduct

Documentation Improvements

  • README.md: Expanded from ~486 to ~659 lines with:
    • Better structure: Added table of contents and clear section organization
    • Algorithm documentation: Detailed sections for each selector (mRMR, JMI/JMIM, CEFS+, Stability Selection, Boruta, CatBoost)
    • Advanced features: New sections covering:
      • Automatic K selection with AutoKConfig
      • Time series support with block bootstrap and temporal validation
      • Grouped/panel data handling with group-aware operations
      • Smart sampling for large datasets
      • Categorical feature encoding strategies
      • Sample weight support
    • Algorithm selection guide: Comparison table to help users choose the right algorithm
    • Comprehensive API reference: Tables documenting all main functions, classes, and utilities
    • Project structure: Visual overview of the codebase organization
    • Quick start examples: Practical code examples for common use cases
    • Badges: Added Python version and license badges

Notable Implementation Details

  • Documentation follows NumPy docstring conventions as specified in CONTRIBUTING.md
  • Examples are practical and runnable, covering both function and class-based APIs
  • Algorithm comparison tables help users make informed decisions
  • Time series and grouped data examples show real-world use cases (e.g., NBA player data)
  • Contributing guidelines emphasize code quality, testing, and clear communication
  • All examples use consistent naming conventions and best practices

Benefits

  • For new contributors: Clear guidelines on code style, testing, and submission process
  • For new users: Comprehensive examples and algorithm selection guidance
  • For maintainers: Standardized expectations for contributions and documentation
  • For the project: Professional documentation that improves discoverability and adoption

https://claude.ai/code/session_017CBJSg558QSEUJ3W6JtLSc

- Rewrite README.md with complete feature overview, installation guide,
  quick start examples, detailed algorithm documentation, API reference,
  and algorithm selection guide

- Add CONTRIBUTING.md with development setup, code style guidelines,
  testing instructions, and contribution workflow

- Add docs/ directory with detailed documentation:
  - API.md: Complete API reference for all public functions and classes
  - ALGORITHMS.md: Mathematical foundations and comparisons of all algorithms
  - ADVANCED.md: Guide for time series, panel data, auto-K selection,
    smart sampling, and other advanced features
Copilot AI review requested due to automatic review settings January 27, 2026 00:59
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds substantial end-user and contributor documentation for SIFT, including an expanded README and new deep-dive docs for API, algorithms, and advanced usage.

Changes:

  • Introduces new documentation pages: API reference, algorithm explanations, and advanced features guide.
  • Rewrites/expands README.md with installation instructions, usage examples, and an API/feature overview.
  • Adds a comprehensive CONTRIBUTING.md with development, testing, and PR process guidance.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 25 comments.

Show a summary per file
File Description
docs/API.md New API reference document; currently contains multiple signature/default/return-type mismatches vs the implemented public API.
docs/ALGORITHMS.md New algorithm overview and theory guide for included selectors and MI estimators.
docs/ADVANCED.md New advanced usage guide (time series, grouped data, auto-k, smart sampling); several examples currently don’t match actual callable signatures/requirements.
README.md Major rewrite/expansion of project README; includes some inaccurate installation/extras info and a few example snippets that will error as written.
CONTRIBUTING.md New contributor guide; includes a few dependency/instructions mismatches vs declared packaging extras.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +763 to +785
### select_k_auto

```python
sift.select_k_auto(
X: pd.DataFrame,
y: np.ndarray,
feature_path: List[str],
config: AutoKConfig,
*,
groups: Optional[np.ndarray] = None,
time: Optional[np.ndarray] = None,
task: Literal["regression", "classification"] = "regression"
) -> Tuple[int, List[str], Dict[int, float]]
```

Automatically select optimal K via cross-validation or holdout.

**Returns:**

- `best_k` (int): Optimal number of features
- `selected` (List[str]): Selected feature names
- `scores` (Dict[int, float]): Scores for each K tried

Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_k_auto’s return type is documented as Tuple[int, List[str], Dict[int, float]], but the implementation returns a diagnostic pd.DataFrame as the third element (see sift/selection/auto_k.py:96). The doc and the “Returns” section should be updated to reflect the DataFrame output.

Copilot uses AI. Check for mistakes.
Comment on lines +814 to +843
```python
@dataclass
class sift.SmartSamplerConfig:
feature_cols: List[str]
y_col: str
group_col: Optional[str] = None
time_col: Optional[str] = None
sample_frac: float = 0.2
anchor_strategy: str = "leverage"
random_state: int = 42
```

Configuration for smart sampling.

---

### smart_sample

```python
sift.smart_sample(
df: pd.DataFrame,
feature_cols: List[str],
y_col: str,
*,
group_col: Optional[str] = None,
time_col: Optional[str] = None,
sample_frac: float = 0.2,
anchor_strategy: str = "leverage",
random_state: int = 42
) -> pd.DataFrame
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmartSamplerConfig and smart_sample are documented with fields/kwargs (feature_cols, y_col, anchor_strategy) that don’t exist in the actual dataclass/API. In code, SmartSamplerConfig contains sampling behavior fields (e.g., sample_frac, group_col, time_col, anchor_fn, etc.), and smart_sample takes an optional config plus **kwargs overrides (see sift/sampling/smart.py). Please align this section with the real config fields and accepted kwargs.

Copilot uses AI. Check for mistakes.
importance = permutation_importance(
model, X_test, y_test,
permute_method="circular_shift", # Preserves temporal structure
groups=entity_ids,
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

permutation_importance(..., permute_method="circular_shift") requires both groups and time (the implementation raises if time is missing). This example passes groups but not time, so it will error; please include the time= argument or switch to a method that doesn’t require time.

Suggested change
groups=entity_ids,
groups=entity_ids,
time=time_index,

Copilot uses AI. Check for mistakes.
Comment on lines +329 to +338
# Direct usage
sampled = smart_sample(
df,
feature_cols=["f1", "f2", "f3"],
y_col="target",
group_col="user_id",
time_col="timestamp",
sample_frac=0.15,
anchor_strategy="leverage"
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart_sample(...) is shown with an anchor_strategy argument, but the implementation doesn’t accept anchor_strategy (and SmartSamplerConfig doesn’t have that field). Passing it will raise a TypeError when SmartSamplerConfig(**kwargs) is built. Please remove it or document the supported configuration mechanism (anchor_fn in SmartSamplerConfig).

Copilot uses AI. Check for mistakes.
| `time` | ndarray | None | Time values for temporal ordering |
| `auto_k_config` | AutoKConfig | None | Configuration for automatic K selection |
| `sample_weight` | ndarray | None | Sample weights |
| `relevance` | str | "f" | Relevance measure: "f" (F-stat), "ks" (KS-test), "rf" (Random Forest) |
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevance="ks" is documented as an option for select_mrmr generically, but the implementation only supports KS relevance for classification (regression supports only "f" and "rf"; see sift/api.py where rel_funcs differs by task). Please reflect this constraint in the parameter description to prevent runtime ValueErrors for regression.

Suggested change
| `relevance` | str | "f" | Relevance measure: "f" (F-stat), "ks" (KS-test), "rf" (Random Forest) |
| `relevance` | str | "f" | Relevance measure: for regression, "f" (F-stat) or "rf" (Random Forest); for classification, "f", "ks" (KS-test), or "rf" |

Copilot uses AI. Check for mistakes.

# Get selected features
features = selector.get_support(indices=False) # Boolean mask
feature_names = selector.get_support(indices=True) # Names/indices
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_support(indices=True) returns integer indices (selected_features_), not feature names. The docs currently label it as “Names/indices”; for names, users should use selected_feature_names_ (or document a helper that returns names).

Suggested change
feature_names = selector.get_support(indices=True) # Names/indices
feature_indices = selector.get_support(indices=True) # Integer indices of selected features
feature_names = selector.selected_feature_names_ # Selected feature names

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +94
```python
class BlockedTimeSeriesSplit:
"""Time series split with gap between train and validation."""

def __init__(self, n_splits=5, gap=0):
self.n_splits = n_splits
self.gap = gap

def split(self, X, y=None, groups=None):
n = len(X)
fold_size = n // (self.n_splits + 1)
for i in range(self.n_splits):
train_end = fold_size * (i + 1)
val_start = train_end + self.gap
val_end = val_start + fold_size
yield np.arange(train_end), np.arange(val_start, min(val_end, n))

Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlockedTimeSeriesSplit example uses np.arange(...) but the snippet doesn’t import NumPy (import numpy as np). If the goal is for examples to be runnable, add the missing import in this code block.

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +225
selector = StabilitySelector(n_bootstrap=50, threshold=0.6)
selector.fit(
X, y,
groups=player_ids,
time=game_dates,
block_size="auto",
block_method="moving"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as earlier: block_size / block_method are passed to selector.fit(...), but fit() doesn’t accept them (they’re StabilitySelector constructor parameters). This example will raise TypeError unless the arguments are moved to StabilitySelector(..., block_size=..., block_method=...).

Suggested change
selector = StabilitySelector(n_bootstrap=50, threshold=0.6)
selector.fit(
X, y,
groups=player_ids,
time=game_dates,
block_size="auto",
block_method="moving"
selector = StabilitySelector(
n_bootstrap=50,
threshold=0.6,
block_size="auto",
block_method="moving"
)
selector.fit(
X, y,
groups=player_ids,
time=game_dates,

Copilot uses AI. Check for mistakes.
# Individual extras
pip install -e ".[catboost]" # CatBoost-based selection
pip install -e ".[categorical]" # Advanced categorical encoding (category_encoders)
pip install -e ".[polars]" # Polars DataFrame support
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation section lists an extra polars ("pip install -e ".[polars]"") but setup.py defines extras only for all, catboost, categorical, and test (no polars). Please either add a polars extra to setup.py or remove/replace this instruction so users don’t hit an install error.

Suggested change
pip install -e ".[polars]" # Polars DataFrame support

Copilot uses AI. Check for mistakes.
Comment on lines +710 to +716
sift.select_cached(
cache: FeatureCache,
y: np.ndarray,
k: int,
*,
method: Literal["mrmr_quot", "mrmr_diff", "jmi", "jmim", "cefsplus"] = "mrmr_quot",
top_m: Optional[int] = None,
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_cached is documented with method=... = "mrmr_quot", but the actual default is "cefsplus" (see sift/selection/cefsplus.py:448). Please update the default (and ideally the full signature) so the API reference matches runtime behavior.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants