Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Balearica · 2024-06-17T17:54:20Z

Tesseract.js uses two sets of language data by default. When the oem is set to the default (LSTM only), integerized versions of tessdata_best (LSTM only data) are used. When oem is set to Legacy or LSTM with Legacy fallback, files from tessdata are used, which generally contain both the integerized version of tessdata_best and data for the Legacy model.

While this generally works, it looks like the Legacy model was removed from several languages in the files in the tessdata repo. This appears to have been motivated purely by the fact that these files are large and do not perform well compared to the LSTM models.

tesseract-ocr/tessdata#90

These justifications do not make sense for our case, so the Tesseract.js data should be modified to add these back. Based in the PR linked above, it looks like most users should not be using the Legacy model for these languages--as the LSTM model is both much smaller and performs much better. However, within Tesseract.js, we only load the files from tessdata if the user specifically requested the Legacy model. If the user sets the oem to Legacy, we need to load Legacy data.

Note that this issue is specific to the cases where both Legacy and LSTM language data exists, however the Legacy data was removed. There are other languages where data for one model never existed in the first place, which will remain broken.

The text was updated successfully, but these errors were encountered:

Balearica mentioned this issue Jun 17, 2024

The issue encountered when add the Thai language. scribeocr/scribeocr#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Balearica commented Jun 17, 2024

Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Comments

Balearica commented Jun 17, 2024