[go: nahoru, domu]

Skip to content
Stefan Weil edited this page Jul 27, 2021 · 69 revisions

Training Fraktur from GT4HistOCR

This is an intermediate report on training for Fraktur which was done at Mannheim University Library (@UB-Mannheim). See the latest results from 2020-02-15. More training with additional ground truth data was done in 2021, see frak2021.

What is GT4HistOCR?

GT4HistOCR is ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. See this publication for details:

Springmann, Uwe, Reul, Christian, Dipper, Stefanie, & Baiter, Johannes. (2018). GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1344132 (see fulltext)

Requirements

Use a recent Linux distribution. Newer distributions like Debian Buster provide Tesseract, so it is not necessary to build your own Leptonica or Tesseract.

Training requires much disk space, so use a working directory with at least 24 GiB free space.

Training also requires much CPU resources. There must be at least 4 CPU cores available. A fast CPU with AVX support is highly recommended.

Preparing the data for training

The training data is in subdirectories and uses PNG images while tesstrain expects a flat directory with TIFF images, so some preparation is needed. TODO: This was fixed in the latest version.

# Clone tesstrain.
git clone https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain

# Get uncharset files for some scripts (needed for character properties).
cd data
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/Cyrillic.unicharset
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/Greek.unicharset
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/Latin.unicharset

# Get the GT4HistOCR data.
mkdir GT4HistOCR
cd GT4HistOCR
wget https://zenodo.org/record/1344132/files/GT4HistOCR.tar

# Unpack the data.
tar xf GT4HistOCR.tar
for f in corpus/*.tar.bz2; do echo $f && tar xjf $f; done

# Optionally remove the tar archives which are no longer needed.
rm -r GT4HistOCR.tar corpus

# Remove wrong execute flag from files.
find dta19 EarlyModernLatin Kallimachos RefCorpus-ENHG-Incunabula RIDGES-Fraktur models -type f | xargs chmod -x

# Remove broken image.
rm dta19/1882-keller_sinngedicht/04970.nrm.png

# Remove BOM from some GT texts.
perl -pi -e s/$'\xEF\xBB\xBF'//g $(find Kallimachos/149* -name "*.gt.txt"|xargs grep -rl $'\xEF\xBB\xBF')

# Normalize unicode in GT texts.
../../normalize.py dta19/{1853-rosenkranz_aesthetik/03221.gt.txt,1879-vischer_auch02/03739.gt.txt}
../../normalize.py Kallimachos/1488-Heiligenleben-GWM11407/*

# Add missing GT text.
echo "Welt und Leben zu beherrschen wissen ; nicht Feldher⸗" >dta19/1819-goerres_revolution/01305.gt.txt 

The remaining commands are only needed for older versions of tesstrain which could only handle flat directories with *.tif files.

# Link all ground truth texts into one directory (can take more than 60 min).
cd ../ground-truth
for t in $(find ../GT4HistOCR -name "*.txt"); do ln -sf $t $(echo $t|perl -pe "s:.*GT4HistOCR/[^/]*/::; s:/:_:g"); done

# Convert png images to tiff images into one directory (can take more than 120 min).
for i in $(find ../GT4HistOCR -name "*.bin.png"); do echo convert $i $(echo $i|perl -pe "s:.*GT4HistOCR/[^/]*/::; s:/:_:g; s:bin.png:tif:"); done|sh -x
for i in $(find ../GT4HistOCR -name "*.nrm.png"); do echo convert $i $(echo $i|perl -pe "s:.*GT4HistOCR/[^/]*/::; s:/:_:g; s:nrm.png:tif:"); done|sh -x

# Now go back to the base directory. Training can start.
cd ../..

Training starts with generating box files which also takes a lot of time.

Training from scratch

Training from scratch was run in August 2019 with latest Tesseract (git master) and 10000, 100000, 300000 and 900000 iterations. The generated data is available online.

Result 10000 iterations

At iteration 9725/10000/10000, Mean rms=1.058%, delta=4.77%, char train=17.929%, word train=34.821%, skip ratio=0%,  New worst char error = 17.929 wrote checkpoint.

Finished! Error rate = 17.396
[...]
real    83m31.333s
user    100m45.497s
sys     8m41.462s

Result 100000 iterations

At iteration 49622/100000/100001, Mean rms=0.395%, delta=1.033%, char train=3.344%, word train=9.289%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 3.041
[...]
real    366m7.096s
user    646m14.635s
sys     3m19.165s

Result 300000 iterations

At iteration 102549/300000/300002, Mean rms=0.311%, delta=0.799%, char train=2.364%, word train=6.189%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 1.747
[...]
real    780m7.610s
user    3746m24.974s
sys     68m52.158s

Result 900000 iterations

At iteration 237920/900000/900006, Mean rms=0.187%, delta=0.316%, char train=0.946%, word train=3.094%, skip ratio=0%,  wrote checkpoint.                                           

Finished! Error rate = 0.898
[...]
real    3200m2.210s
user    11396m12.393s
sys     164m28.908s

Result 2000000 iterations

This training was run on two different machines, one using Tesseract 4.0.0 from Debian buster, one using Tesseract built from Git master. Despite the latest modifications for the tesstrain code, both trainings showed very different results. Tesseract 4.0.0 reported lots of encoding failures (caused by unnormalized ground truth texts). The achieved CER values differ, but are in a similar range.

Intermediate values:

  • CER 0.561 at 368648 iterations with Tesseract Git master (433 h CPU time)
  • CER 0.606 at 254680 iterations with Tesseract 4.0.0 (350 h CPU time).

Git master, training running on server

$ time make -r training MAX_ITERATIONS=2000000 MODEL_NAME=GT4HistOCR_2000000 RATIO_TRAIN=0.99
[...]
At iteration 395372/2000000/2000012, Mean rms=0.167%, delta=0.323%, char train=0.901%, word train=2.741%, skip ratio=0%,  wrote checkpoint.                                         

Finished! Error rate = 0.528
[...]
real    7697m12.388s
user    26882m51.379s
sys     353m35.842s

4.0.0, training running on VM

time make -r training MAX_ITERATIONS=2000000 MODEL_NAME=GT4HistOCR_2000000 RATIO_TRAIN=0.99
[...]
At iteration 357313/2000000/2004959, Mean rms=0.167%, delta=0.298%, char train=0.888%, word train=3.022%, skip ratio=0.3%,  wrote checkpoint.                                       

Finished! Error rate = 0.58
[...]
real    10690m43,457s
user    34225m45,698s
sys     396m18,729s

Other training

More training is currently (since September 2019) running on a virtual machine in the bwCloud. It is based on Git master (7a7704bc94e1942ee10047970b6c93e4871b2cd8) which can directly handle the images from GT4HistOCR, so no conversion to TIFF is needed.

For this training, all known problems in the GT4HistOCR ground truth texts were fixed. In addition, upper case "J" in Roman numerals and before lower case consonants was replaced by "I".

Training from scratch

This training was running from 2019-09-04 until 2020-01-02.

make -r training MAX_ITERATIONS=5000000 MODEL_NAME=GT4HistOCR_5000000 RATIO_TRAIN=0.99
[...]
At iteration 819826/5000000/5000032, Mean rms=0.198%, delta=0.477%, char train=1.296%, word train=3.322%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 0.568

Training from scratch (alternate network specification)

make -r training NET_SPEC=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c###] MAX_ITERATIONS=5000000 MODEL_NAME=GT4HistOCR_5000000-2 RATIO_TRAIN=0.99

Current best CER: 0.521 % at iteration 503468 (12000 hours CPU time)

Finetuning based on script/Fraktur

make -r training START_MODEL=Fraktur TESSDATA=/usr/local/share/tessdata/tessdata_best/script MAX_ITERATIONS=5000000 MODEL_NAME=Fraktur_5000000 RATIO_TRAIN=0.99

Current best CER: 0.312 % at iteration 450937 (13000 hours CPU time).

This is the best CER achieved so far.

frak2021

A new model frak2021 was trained in 2021 based on ground truth from Fibeln and updated versions of GT4HistOCR and AustrianNewspapers.

That new model also uses a smaller neural network, so OCR is faster. Trained model files are currently available from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/.

OCR results

Some OCR results for a double page from a historic newspaper (Deutscher Reichsanzeiger) which were produced with existing models from Google and new models from the training above are also available online.

The current best models from fine tuning of Tesseract's script/Fraktur model show a significant better CER for selected samples. Using the new best model reduces the CER from 3.8 % to 2.4 %. The combination of all three best models achieves a CER of 2.2 %.

Training for fine tuning

Starting with an existing traineddata model requires much less iterations to achieve low error rates. The following models are candidates as a starting point:

  • eng.traineddata (English, mainly Antiqua but also some Fraktur)
  • frk.traineddata (German, Fraktur and some Antiqua)
  • script/Latin.traineddata (Western Europe, mainly Antiqua but also some Fraktur)
  • script/Fraktur.traineddata (Western Europe, Fraktur and some Antiqua)

Problems

Images for ground truth

  • There exist *.bin.png and *.nrm.png images. Why are there these two variants? All nrm.png files are 8-bit grayscale. So are most bin.png files, only a few of them are 1-bit grayscale.
  • Broken image: dta19/1882-keller_sinngedicht/04970.nrm.png.

Bad box file

unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 1 "data/foo/all-boxes"
Bad box coordinates in boxfile string!  0 0 1044 71 0

It looks like the box data belongs to Kallimachos/1497-StultiferaNauis-GW5056/00401.gt.txt (see image). That text file looks strange in the vi editor (S<feff>ãcta dei ſpernãt poſita decreta pauoꝛe:). Removing the <feff> fixes it, but then other similar bad box coordinates are found. So there exist several ground truth text files which cause similar errors. Here is a list of all files with <feff>:

    geändert:       Kallimachos/1495-DasNeuNarrenschiff-GW5049/00470.gt.txt
    geändert:       Kallimachos/1495-DasNeuNarrenschiff-GW5049/01255.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/00134.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/00401.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/00475.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/00593.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/00808.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/01311.gt.txt
    geändert:       Kallimachos/1497-StultiferaNauis-GW5056/01323.gt.txt

Encoding failure

Tesseract fails to encode these ground truth texts:

This is caused by a mismatch of unicode characters in those ground truth texts and the generated unicharset. The unicode characters in the ground truth are not normalized, and so they are in the derives box and lstmf files, while those in the unicharset are normalized.

There are 771 unnormalized ground truth texts in the GT4HistOCR data set:

  • dta19/1853-rosenkranz_aesthetik/03221.gt.txt
  • dta19/1879-vischer_auch02/03739.gt.txt
  • Kallimachos/1488-Heiligenleben-GWM11407/ (many)

Missing ground truth

There is only an empty ground truth text file for dta19/1819-goerres_revolution/01305. Either remove that data or enter the missing text Welt und Leben zu beherrschen wissen ; nicht Feldher⸗.

Tesseract fails to create lstm files

Tesseract 4.0.0 does not create an lstm file for 363 images when training runs with the default settings. Example: EarlyModernLatin/1668-Leviathan-Hobbes/00045. There is no error message, so the training silently ignores those ground truth files. Using PSM=13 allows building the missing files.

Inherited characters

Inherited characters are used for combining characters. They are used in some ground truth texts. Maybe Tesseract does not handle them correctly. At least it complains about a missing file Inherited.unicharset:

$ combine_lang_model --input_unicharset data/foo/unicharset --script_dir data --output_dir data --lang foo
Loaded unicharset of size 305 from file data/foo/unicharset
Setting unichar properties
Other case T᷑ of t᷑ is not in unicharset
Other case Õ of õ is not in unicharset
Other case Ẽ of ẽ is not in unicharset
[...]
Other case H̃ of h̃ is not in unicharset
Setting script properties
Failed to load script unicharset from:data/Inherited.unicharset
Warning: properties incomplete for index 14 = ꝙ
Warning: properties incomplete for index 23 = t᷑
Warning: properties incomplete for index 39 = qᷓ
Warning: properties incomplete for index 51 = ꝗ
Warning: properties incomplete for index 54 = ꝗᷓ
Warning: properties incomplete for index 57 = ꝶ
[...]

The transcription for EarlyModernLatin/1476-SpeculumNaturale-Beauvais/01737.bin.png contains an inherited character ̃ which can cause a wrong unicharset for Tesseract. It also looks like a wrong transcription, so removing that character is suggested.

Mismatch image - transcriptions

It might be possible to evaluate the existing ground truth by manual inspection of ground truth text lines which differ much from the corresponding OCR result.

Harmonized transcriptions

dta uses extensively as separator, for example at line endings. Others use - or . So currently the same glyph gets at least three different transcriptions.

The ground truth texts of GT4HistOCR are partially harmonized, for example all I were replaced by J. Are we sure that is required for OCR of Fraktur texts? It might be bad for those parts which use Antiqua letters where I and J are clearly distinct, and it is also unwanted for Roman numerals. Maybe this should be reverted.

Missing characters

The following characters are used by AustrianNewspapers, but missing in the GT4HistOCR dataset:

# % + ~ ± ´ ˙ • … ∆ ═ □ ● ⅓ ⅕ ⅛ ⅜ ⅚ ⅝ ⅞ ʞ ɔ Š ž Ž

At least the first few are relevant for a good general model.

Questions

Examples

https://dfg-viewer.de/show?tx_dlf[id]=https%3A%2F%2Fdigi.bib.uni-mannheim.de%2Ffileadmin%2Fdigi%2F1688775706%2Ftess.xml

License

GT4HistOCR was published using CC-BY 4.0 (see paper) or CC-BY-SA 4.0 (see README). It is unclear what this implies for any OCR model which was trained using that data set or for text which was recognized using such models.