You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tesseract.js uses two sets of language data by default. When the oem is set to the default (LSTM only), integerized versions of tessdata_best (LSTM only data) are used. When oem is set to Legacy or LSTM with Legacy fallback, files from tessdata are used, which generally contain both the integerized version of tessdata_best and data for the Legacy model.
While this generally works, it looks like the Legacy model was removed from several languages in the files in the tessdata repo. This appears to have been motivated purely by the fact that these files are large and do not perform well compared to the LSTM models.
These justifications do not make sense for our case, so the Tesseract.js data should be modified to add these back. Based in the PR linked above, it looks like most users should not be using the Legacy model for these languages--as the LSTM model is both much smaller and performs much better. However, within Tesseract.js, we only load the files from tessdata if the user specifically requested the Legacy model. If the user sets the oem to Legacy, we need to load Legacy data.
Note that this issue is specific to the cases where both Legacy and LSTM language data exists, however the Legacy data was removed. There are other languages where data for one model never existed in the first place, which will remain broken.
The text was updated successfully, but these errors were encountered:
Tesseract.js uses two sets of language data by default. When the
oem
is set to the default (LSTM only), integerized versions oftessdata_best
(LSTM only data) are used. Whenoem
is set to Legacy or LSTM with Legacy fallback, files fromtessdata
are used, which generally contain both the integerized version oftessdata_best
and data for the Legacy model.While this generally works, it looks like the Legacy model was removed from several languages in the files in the tessdata repo. This appears to have been motivated purely by the fact that these files are large and do not perform well compared to the LSTM models.
tesseract-ocr/tessdata#90
These justifications do not make sense for our case, so the Tesseract.js data should be modified to add these back. Based in the PR linked above, it looks like most users should not be using the Legacy model for these languages--as the LSTM model is both much smaller and performs much better. However, within Tesseract.js, we only load the files from
tessdata
if the user specifically requested the Legacy model. If the user sets theoem
to Legacy, we need to load Legacy data.Note that this issue is specific to the cases where both Legacy and LSTM language data exists, however the Legacy data was removed. There are other languages where data for one model never existed in the first place, which will remain broken.
The text was updated successfully, but these errors were encountered: