Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Open
Balearica opened this issue Jun 17, 2024 · 0 comments

Comments

@Balearica
Copy link
Member

Tesseract.js uses two sets of language data by default. When the oem is set to the default (LSTM only), integerized versions of tessdata_best (LSTM only data) are used. When oem is set to Legacy or LSTM with Legacy fallback, files from tessdata are used, which generally contain both the integerized version of tessdata_best and data for the Legacy model.

While this generally works, it looks like the Legacy model was removed from several languages in the files in the tessdata repo. This appears to have been motivated purely by the fact that these files are large and do not perform well compared to the LSTM models.

tesseract-ocr/tessdata#90

These justifications do not make sense for our case, so the Tesseract.js data should be modified to add these back. Based in the PR linked above, it looks like most users should not be using the Legacy model for these languages--as the LSTM model is both much smaller and performs much better. However, within Tesseract.js, we only load the files from tessdata if the user specifically requested the Legacy model. If the user sets the oem to Legacy, we need to load Legacy data.

Note that this issue is specific to the cases where both Legacy and LSTM language data exists, however the Legacy data was removed. There are other languages where data for one model never existed in the first place, which will remain broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant