Use uniform sampled softmax and add <START> token.#4
Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request implements uniform sampled softmax and adds a <START> token to playlist sequences. The changes introduce an alternative loss function that uses uniform sampling for the softmax denominator and modifies the data processing pipeline to prepend a special start token to all playlists.
- Implements uniform sampled softmax as an alternative to the existing label-in-denominator approach
- Adds
<START>token injection to playlist sequences during data loading - Updates experiment configurations to support the new loss function option
Reviewed Changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/trecs/util.py |
Adds new uniform sampling function and renames existing function for clarity |
src/trecs/experiments/decoder_only/_base.py |
Integrates start token injection, loss function selection, and updates token encoding order |
tests/test_util.py |
Updates tests to cover both loss function variants using parametrization |
src/trecs/experiments/decoder_only/decoder_only_uniform_softmax_loss.py |
New experiment configuration using uniform loss function |
tests/test_experiments.py |
Adds comprehensive test for all experiment setups |
| Other files | Minor configuration updates and dependency changes |
| LIMIT :context_length | ||
| """, | ||
| {"split": split}, | ||
| {"context_length": self.context_length}, |
There was a problem hiding this comment.
The SQL query limit was changed from context_length + 1 to context_length, but since a <START> token is now prepended, the total sequence length becomes context_length + 1. This could cause issues if the downstream processing expects sequences of exactly context_length + 1 tokens.
| {"context_length": self.context_length}, | |
| LIMIT :context_length_plus1 | |
| """, | |
| {"split": split}, | |
| {"context_length_plus1": self.context_length + 1}, |
| "<EOP>", | ||
| "<UNK>", | ||
| *(track_id for (track_id,) in cursor), | ||
| ], |
There was a problem hiding this comment.
The token order in the encoder has been changed to put <START> first, but the special token assignments later in the code assume this order. Consider documenting this token order dependency or making the token retrieval more explicit to avoid future ordering issues.
<START>token. Without this token, the first element of each playlist is only included as context but never as a target. Now the<START>token provides context for the first token. This is likely not that informative other than reproducing the marginal distribution of the first track in playlists, but gives us a bit of extra information in the early part of the playlist..jsonfile that was deleted.