Turns a set of table images to one CSV, powered by Tesseract.
Crazy CPU hog because of parallelization.
-
Install the Python dependencies:
pip install -r requirements.txt
-
Install Tesseract as per https://github.com/tesseract-ocr/tesseract/wiki
-
Put images to
page.d
, with ordered filenames. -
If necessary, do pre-processings on the images to make them contain only tables.
-
Start the conversion:
./run.sh
-
The CSV will be stored in
result.csv
-
Chops table into cells.
-
OCRs the cells with Tesseract.
-
Puts the OCRed cell texts together in their original positions, and generates the CSV.