PDF Image OCR

A Python Program extracting Images from a pdf-file and extracting the text using Optical Character Recognition. Afterwards, you may feed an LLM running locally via Ollama with the data.

Requirements

The Project was made for Windows. If your OS is another like macOS or Linux, you need to adjust some paths because of \\.
Python Version 3.11

pip install -r requirements.txt

Install Tesseract and copy the path
Install Ollama
in console (example model: llama3.2):

ollama pull <YourModel>

Var.py

This is where the Paths are saved, change these according to your System

tesseract_path: The Path to your tesseract.exe file (Windows Default: C:\\Program Files\\Tesseract-OCR\\tesseract.exe)
input_path: Path to your PDF-File (example: "example_folder\\example.pdf")
output_path: The extracted images and Text will go here (example: ".\\out\\")
lang: language of the PDF (example: "deu+eng" for German and English)
model: the ollama model you want to use (example: llama3.2 for English, cas/discolm-mfto-german for German)

Extras

If you already have got images:
- place them in your output_path folder
- comment out the method call image_extraction() (to: #image_extraction())
- delete the + folder in the line saying input_folder = output_path + folder
If you only want to extract the images and do not want to use OCR/AI
- comment out the method call ocr(input_folder, out_txt, lang) (to: #ocr(input_folder, out_txt, lang))
- comment out the method call feedllm() (to: #feedllm())
If you only want to use the AI with a txt file you have:
- comment out the method call image_extraction() (to: #image_extraction())
- delete the + folder in the line saying input_folder = output_path + folder
- comment out the method call ocr(input_folder, out_txt, lang) (to: #ocr(input_folder, out_txt, lang))
- place the .txt file in your output_path folder and rename it to out.txt

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
main.py		main.py
requirements.TXT		requirements.TXT
var.py		var.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Image OCR

Requirements

Var.py

Extras

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Image OCR

Requirements

Var.py

Extras

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages