A Jupyter-based agent that leverages LangChain and OpenAI to perform common data-cleaning tasks (e.g., imputing missing values, detecting outliers) via generated Python code.
- Anaconda or Miniconda installed
- Git (to clone this repo)
- An OpenAI API key set in your environment (
OPENAI_API_KEY)
-
Clone the repository and enter its folder
git clone https://github.com/Tchanwangsa/data-cleaning-agent.git cd data-cleaning-agent -
Create and activate a new conda environment
conda create -n data-cleaning-agent python=3.xx -y conda activate data-cleaning-agent
-
(Optional) Install Jupyter if it’s not already available (or skip if you prefer VS Code notebooks)
conda install jupyter -y # only if using Jupyter conda install ipykernel -y # for VS Code notebook support
-
Launch Jupyter Notebook (or VSCode)
jupyter notebook
-
In the Notebook UI, open
main.ipynband run the first cell to install all required Python packages:%pip install import-ipynb pandas numpy sklearn \ langchain langchain-community openai
- Copy
scaffold.ipynbtofeatures/your_feature.ipynband open it in VS Code or Jupyter. - In the copied notebook, implement the function
def your_feature(user_query, df):by replacing the placeholder code. - Save your notebook and import your feature in
main.ipynb, then add it to the features list.