-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce file sizes #806
Comments
Upon investigation, the primary change that would reduce file size is removing the Legacy engine and corresponding English - 54% Total Reduction
Chinese (Simplified) - 73% Total Reduction
Compiling Tesseract with different optimization settings would also significantly reduce size (by ~1.1 MB), however this makes recognition significantly slower, so is not worth it. See these benchmarks. |
A benchmark shows this change leads to a ~50% reduction in runtime for first-time users. Numbers are shown below.
Details:
|
Closing as completed. As of v5, by default only the LSTM code and data are loaded. |
How do I use the new language data? Currently I have the |
@lmk123 If you use Tesseract.js v5 and do not set Some insight as to why the language data files are different sizes:
|
Thanks for the detailed explanation, I figured it out. Can you please put This is because |
@lmk123 Unfortunately the fact that the https://tessdata.projectnaptha.com/ is not updating to include these files is not trivial to fix. That is a GitHub pages site configured to host that entire repo. Unfortunately, at some point that repo passed GitHub's file size limit so it stopped updating. This was part of the reason why the default was changed to use jsDelivr. What country are you in? I was not aware that jsDelivr was blocked in certain countries. Should probably implement some sort of fallback mechanism if there are regional issues I was not aware of. |
China. The following picture (source) shows the connection of
Maybe put 4.0.0_best_int in a separate repository? |
We can download the data manual, however, where should we put the file in PC? By the way, would you convert the data to pure JS so we can store them anywhere? |
The fact that Tesseract.js uses the JSDelivr CDN by default is unrelated to the reduction of file sizes, which is the topic of this issue. I created issue #899 for discussing the topic of JSDelivr not working in China, so discussion about the CDN should move there. |
The amount of data loaded by Tesseract.js is quite large. For example, if default settings are used, a new user will end up downloading 15.34 MB of JavaScript and language data before recognition will be run (not taking into account compression). While this is largely mitigated by caching language data after it is first downloaded (and should not be an issue for Node users at all), this amount of data likely causes annoyance for first-time browser users.
We should investigate whether this can be reduced without significant tradeoffs (e.g. runtime increase, dropping support for file formats, etc.).
The text was updated successfully, but these errors were encountered: