This is my project for classifying emails into different types (Type 2, Type 3, and Type 4). I tried two different ways to do this:
- Chained approach - where each model feeds into the next one
- Hierarchical approach - where models are organized in a tree structure
The main folders are:
- config - has some settings
- data - for loading emails and cleaning them
- models - contains my classifier code
- utils - helper functions I wrote
The main file to run is main.py
You need these packages:
- scikit-learn (for ML algorithms)
- numpy
- pandas (not using it much yet)
- nltk (for text processing)
I think Python 3.7 or newer should work fine.
Just run this command in the terminal:
python main.py
In this approach:
- First I predict if it's Type 2 or not
- Then I use that result to help predict if it's Type 3
- Finally I use both previous results to predict Type 4
This works because the Types might be related to each other.
In this approach:
- First figure out Type 2
- Depending on Type 2 result, use a specific model for Type 3
- Then use both results to pick the right model for Type 4
I'm still working on improving the accuracy. The hierarchical one is more complex but might work better for some email types.