Language Modeling with LSTM and AWD-LSTM on WikiText-2

This repository contains code for language modeling using LSTM and AWD-LSTM (ASGD Weight-Dropped LSTM) architectures, trained on the WikiText-2 dataset. The goal of this project is to explore the performance improvements gained by applying advanced regularization and optimization techniques to a basic LSTM model, as introduced in the AWD-LSTM paper.

Introduction

Language modeling is a critical task in natural language processing (NLP) that involves predicting the next word in a sequence given its preceding context. This project builds and evaluates two models for this task:

Base Model: A traditional LSTM-based language model.
Improved Model: An AWD-LSTM model that incorporates advanced regularization techniques like weight dropping, variational dropout, and optimization strategies such as ASGD to enhance performance and generalization.

Dataset

The project uses the WikiText-2 dataset, which contains Wikipedia articles with a vocabulary size of around 33,000 words. It is commonly used for language modeling tasks because of its rich linguistic structure and long-range dependencies. The dataset is pre-tokenized and divided into training, validation, and test sets.

Training set: ~2 million tokens
Validation set: ~200,000 tokens
Test set: ~200,000 tokens

Models

Base LSTM Model

The base model is a two-layer LSTM network. LSTM (Long Short-Term Memory) networks are a type of recurrent neural network (RNN) that can capture long-term dependencies in sequential data, making them suitable for language modeling tasks.

Key Features:

Two-layer LSTM
Fully connected layer for word prediction
Trained using cross-entropy loss and evaluated with perplexity

AWD-LSTM Model

The AWD-LSTM model improves upon the base LSTM model by introducing several key techniques from the 2017 paper "Regularizing and Optimizing LSTM Language Models" by Stephen Merity et al. These include:

Weight Dropping: Dropout applied to the LSTM’s recurrent weights to prevent overfitting.
ASGD: Averaged Stochastic Gradient Descent to smooth training and prevent overfitting.
Variational Dropout: Dropout applied consistently across time steps in the LSTM, ensuring temporal consistency.
Weight Tying: Sharing weights between the input embedding and output layers to reduce model size and improve generalization.

Results

The models were evaluated using perplexity, a metric that measures the uncertainty of the model in predicting the next word in the sequence. Lower perplexity indicates better performance.

Model	Validation Perplexity	Test Perplexity
LSTM (Base)	128.3	121.7
AWD-LSTM	80.27	77.11

The AWD-LSTM model significantly outperformed the base LSTM model, demonstrating the effectiveness of the advanced regularization techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
wikitext-2		wikitext-2
.DS_Store		.DS_Store
Document.pdf		Document.pdf
README.md		README.md
app.py		app.py
language_modeling-awd_lstm.ipynb		language_modeling-awd_lstm.ipynb
language_modeling_base.ipynb		language_modeling_base.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Modeling with LSTM and AWD-LSTM on WikiText-2

Table of Contents

Introduction

Dataset

Models

Base LSTM Model

AWD-LSTM Model

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Language Modeling with LSTM and AWD-LSTM on WikiText-2

Table of Contents

Introduction

Dataset

Models

Base LSTM Model

AWD-LSTM Model

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages