This project is archived.

Learnings from this project about torch text and fourier transform usage captured.

NLP pipeline

This is a natural language processing (NLP) pipeline built for research during MSAI 337, Introduction to Natural Language Processing.

Overview

NLP pipeline for Spring 2021.

Get pre-trained GloVe embeddings.
Get benchmark corpra such as wikitext2, penn tree bank, and imdb
Train a neural network with variable parameters

Getting Started

To run from Google Colab, make a copy of the following colab notebook: https://colab.research.google.com/drive/1WI6atgi6TW8wikipHfj5adOAJ_oZrwy6?usp=sharing

The colab noteook will clone and run the lm-experiment.

To run from your machine, clone with the code below.

git clone https://github.com/Variable-Embedding/nlp-ft

optional but recommended step: create a new anaconda environment.

conda create -n nlp-ft
conda activate nlp-ft

run prep scripts

make prep

install dependencies

make install

Run experiments.

Run the experiment pipeline for language models ("lm").

make lm-experiement

Run the fourier transform experiment ("ft").

make ft-experiment

Options to get and prep necessary data for nlp pipeline.

Get pre-trained glove embeddings. By default, the get_pre_trained_embeddings.yaml file is set to return all types of available embeddings "everything". Note: Options: "everything" to get all embedding files or one of: glove.6B, glove.42B, glove.840B, glove.twitter, fasttext.en, fasttext.simple, charngram

make embeddings

Get benchmark corpra via the torchtext api such as wikitext 2 and imdb datasets. Provide the name of a corpus such as "imdb", specify a task such as "language_modeling", or get "everything". Note: By default, make benchmark will get everything that is currently available in this script.

make benchmark

The LM experiment can be controlled with run_lm_experiment_pipeline.yaml:


# The following configuration will run experiements for three copra, 
# each with the three version of glove embeddings, 
# each with the given training and model params, 
# run each with the lstm architecture, 
# and run two variations of the lstm. 

# in effect, we are running the experiment n times for each corpus with m configurations.

stages:
- name: run_lm_experiment
  corpus_type:
  - wikitext2
  - penntreebank
  embedding_type:
  - glove.6B.50d
  - glove.840B.300d
  model_type:
    - lstm
    - lstm
  lstm_configs:
    - default
    - res-ff-emb
  batch_size: 1024
  max_init_param: 0.05
  max_norm: 5
  number_of_layers: 2
  sequence_length: 30
  sequence_step_size: 10
  dropout_probability: 0.1
  device: gpu
  learning_rate_decay: 0.85
  learning_rate: 1
  number_of_epochs: 2
  min_freq: 5

5. Optional: Setup and configure pypy3 (THIS IS EXPERIMENTAL - NOT WORKING RIGHT NOW).

Create a new virtual environment or conda environment, then activate it.
For MacOS users, brew install pypy3

brew install pypy3

Configure pypy3 and install dependencies:

make install-pypy

For help on pypy3 and brew, see pypy3 docs and brew formula.
See torchtext-glove documents on pretrained_aliases for a full listing of available downloads.

Logging

This project uses logging library. The workflow generates log files that can be found in logs folder. Use logger.info / debug / error / warning instead of print for proper logging when creating new stages.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
.github		.github
configs		configs
sql_scripts		sql_scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
colab_how_to.png		colab_how_to.png
main.py		main.py
makefile		makefile
readme.md		readme.md
requirements.txt		requirements.txt
variable_embedding_msai337.pdf		variable_embedding_msai337.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This project is archived.

NLP pipeline

Overview

Getting Started

Logging

About

Uh oh!

Releases

Packages

Languages

License

Variable-Embedding/nlp-ft

Folders and files

Latest commit

History

Repository files navigation

This project is archived.

NLP pipeline

Overview

Getting Started

Logging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages