[go: nahoru, domu]

Skip to content

A tool to OCR a PDF and add a text "layer" in the original file. Use only open source tools.

License

Notifications You must be signed in to change notification settings

lamlion/pdf2pdfocr

 
 

Repository files navigation

pdf2pdfocr

A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. The script uses only open source tools.

donations

This software is free, but if you like it, please donate to support new features.

paypal

flattr

Bitcoin (BTC) address: 173D1zQQyzvCCCek9b1SpDvh7JikBEdtRJ

Ethereum (ETH) address: 0x94a0e2e4eac8406e81806a152593e492824adb95

Litecoin (LTC) address: LT63cQRUZ8YgZZB5nVogEqQR91oUjHv9hN

Dogecoin (DOGE) address: DBNdvUptuZYMt7gb9HavCQovdsoxQzP6i6

Niobio Cash (NBR - http://niobio.money) address: N918uWiGba4ZcCBsc8nZrqhRaucjAZvhnMQ6WA7ubKoNhgNmWS1xn1pThP9HJG6rWqVEEWSPRkJff6dQjCEtbgtMP2Eudcr

installation

In Linux, installation is straightforward. Just install required packages and be happy. You can use "install_command" script to copy required files to "/usr/local/bin".

In macOS, you will need macports. Install macports, and run:

xcode-select --install
sudo xcodebuild -license
# install correct macports from https://www.macports.org/install.php
sudo port selfupdate
# install tesseract (Portuguese included - please setup for your preferred languages)
sudo port install git tesseract tesseract-por tesseract-osd tesseract-eng
# install python 3 and other dependencies
sudo port install python34 py34-pip poppler poppler-data ImageMagick wget 
# configure default python3 installer
sudo port select --set python3 python34
sudo port select --set pip pip34
# install libs (ignore warning messages)
sudo pip install reportlab
sudo pip install pypdf2
# install pdftk (may fail on newer macos)
sudo port install pdftk
# if fail, please install pdftk manually
# for versions <  macOS 10.11
  wget https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/pdftk_server-2.02-mac_osx-10.6-setup.pkg
# for versions >= macOS 10.11 (http://stackoverflow.com/questions/32505951/pdftk-server-on-os-x-10-11)
  wget https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/pdftk_server-2.02-mac_osx-10.11-setup.pkg

Note, wget and pdftk are optional. Macports version of pdftk won't build in macOS 10.11 and above. So you have to install it manually with above commands.

In Windows, you will need to manually install required software. Please read the document "Installing Windows tools for pdf2pdfocr" for a simple tutorial. It's also possible to use "Send To" menu using the "pdf2pdfocr.vbs" script.

docker

The Dockerfile can be used to build a docker image to run pdf2pdfocr inside a container. To build the image, please download all sourcers and run.

docker build -t leofcardoso/pdf2pdfocr:latest .

It's also possible to pull the docker image from docker hub.

docker pull leofcardoso/pdf2pdfocr

You can run the application with docker run.

docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./sample_file.pdf

basic usage

This will create a searchable (OCR) PDF file in the same dir of "input_file".

pdf2pdfocr.py -i <input_file>  

In some cases, you will want to deal with option flags. Please use:

pdf2pdfocr.py --help 

to view all the options.

About

A tool to OCR a PDF and add a text "layer" in the original file. Use only open source tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.3%
  • Shell 21.4%
  • Visual Basic .NET 5.3%