This repository contains code for language modeling using LSTM and AWD-LSTM (ASGD Weight-Dropped LSTM) architectures, trained on the WikiText-2 dataset. The goal of this project is to explore the performance improvements gained by applying advanced regularization and optimization techniques to a basic LSTM model, as introduced in the AWD-LSTM paper.
Language modeling is a critical task in natural language processing (NLP) that involves predicting the next word in a sequence given its preceding context. This project builds and evaluates two models for this task:
- Base Model: A traditional LSTM-based language model.
- Improved Model: An AWD-LSTM model that incorporates advanced regularization techniques like weight dropping, variational dropout, and optimization strategies such as ASGD to enhance performance and generalization.
The project uses the WikiText-2 dataset, which contains Wikipedia articles with a vocabulary size of around 33,000 words. It is commonly used for language modeling tasks because of its rich linguistic structure and long-range dependencies. The dataset is pre-tokenized and divided into training, validation, and test sets.
- Training set: ~2 million tokens
- Validation set: ~200,000 tokens
- Test set: ~200,000 tokens
The base model is a two-layer LSTM network. LSTM (Long Short-Term Memory) networks are a type of recurrent neural network (RNN) that can capture long-term dependencies in sequential data, making them suitable for language modeling tasks.
Key Features:
- Two-layer LSTM
- Fully connected layer for word prediction
- Trained using cross-entropy loss and evaluated with perplexity
The AWD-LSTM model improves upon the base LSTM model by introducing several key techniques from the 2017 paper "Regularizing and Optimizing LSTM Language Models" by Stephen Merity et al. These include:
- Weight Dropping: Dropout applied to the LSTM’s recurrent weights to prevent overfitting.
- ASGD: Averaged Stochastic Gradient Descent to smooth training and prevent overfitting.
- Variational Dropout: Dropout applied consistently across time steps in the LSTM, ensuring temporal consistency.
- Weight Tying: Sharing weights between the input embedding and output layers to reduce model size and improve generalization.
The models were evaluated using perplexity, a metric that measures the uncertainty of the model in predicting the next word in the sequence. Lower perplexity indicates better performance.
| Model | Validation Perplexity | Test Perplexity |
|---|---|---|
| LSTM (Base) | 128.3 | 121.7 |
| AWD-LSTM | 80.27 | 77.11 |
The AWD-LSTM model significantly outperformed the base LSTM model, demonstrating the effectiveness of the advanced regularization techniques.