A Python Program extracting Images from a pdf-file and extracting the text using Optical Character Recognition. Afterwards, you may feed an LLM running locally via Ollama with the data.
- The Project was made for Windows. If your OS is another like macOS or Linux, you need to adjust some paths because of
\\. - Python Version 3.11
pip install -r requirements.txtollama pull <YourModel>This is where the Paths are saved, change these according to your System
tesseract_path: The Path to yourtesseract.exefile (Windows Default:C:\\Program Files\\Tesseract-OCR\\tesseract.exe)- input_path: Path to your PDF-File (example:
"example_folder\\example.pdf") - output_path: The extracted images and Text will go here (example:
".\\out\\") - lang: language of the PDF (example:
"deu+eng"for German and English) - model: the ollama model you want to use (example:
llama3.2for English,cas/discolm-mfto-germanfor German)
- If you already have got images:
- place them in your
output_pathfolder - comment out the method call
image_extraction()(to:#image_extraction()) - delete the
+ folderin the line sayinginput_folder = output_path + folder
- place them in your
- If you only want to extract the images and do not want to use OCR/AI
- comment out the method call
ocr(input_folder, out_txt, lang)(to:#ocr(input_folder, out_txt, lang)) - comment out the method call
feedllm()(to:#feedllm())
- comment out the method call
- If you only want to use the AI with a txt file you have:
- comment out the method call
image_extraction()(to:#image_extraction()) - delete the
+ folderin the line sayinginput_folder = output_path + folder - comment out the method call
ocr(input_folder, out_txt, lang)(to:#ocr(input_folder, out_txt, lang)) - place the
.txtfile in youroutput_pathfolder and rename it toout.txt
- comment out the method call