Reduce file sizes #806

Balearica · 2023-08-21T07:33:03Z

The amount of data loaded by Tesseract.js is quite large. For example, if default settings are used, a new user will end up downloading 15.34 MB of JavaScript and language data before recognition will be run (not taking into account compression). While this is largely mitigated by caching language data after it is first downloaded (and should not be an issue for Node users at all), this amount of data likely causes annoyance for first-time browser users.

File	Size
tesseract.min.js	0.07 MB
worker.min.js	0.13 MB
tesseract-core-simd.wasm.js	4.74 MB
eng.traineddata.gz	10.4 MB
total	15.34 MB

We should investigate whether this can be reduced without significant tradeoffs (e.g. runtime increase, dropping support for file formats, etc.).

Balearica · 2023-08-21T15:00:17Z

Upon investigation, the primary change that would reduce file size is removing the Legacy engine and corresponding .traineddata (by default). As the vast majority of users do not use the Legacy model, and it takes up a significant amount of space, this should be opt-in rather than opt-out.

English - 54% Total Reduction

File	Size [LSTM + Legacy]	Size [LSTM Only]
tesseract.min.js	0.07 MB	0.07 MB
worker.min.js	0.13 MB	0.13 MB
tesseract-core-simd.wasm.js	4.74 MB	3.95 MB
eng.traineddata.gz	10.4 MB	2.95 MB
total	15.34 MB	7.1 MB

Chinese (Simplified) - 73% Total Reduction

File	Size [LSTM + Legacy]	Size [LSTM Only]
tesseract.min.js	0.07 MB	0.07 MB
worker.min.js	0.13 MB	0.13 MB
tesseract-core-simd.wasm.js	4.74 MB	3.95 MB
chi_sim.traineddata.gz	20.2 MB	1.72 MB
total	25.14 MB	5.87 MB

Compiling Tesseract with different optimization settings would also significantly reduce size (by ~1.1 MB), however this makes recognition significantly slower, so is not worth it. See these benchmarks.

Balearica · 2023-09-25T08:06:40Z

A benchmark shows this change leads to a ~50% reduction in runtime for first-time users. Numbers are shown below.

Network Speed	Before	After	% Reduction
Slow	13.9s	6.3s	55%
Med	5.6s	2.6s	54%
Fast	2.7s	1.4s	48%

Details:

This test was conducted using the "network throttling" feature in Chrome.
1. "Slow" corresponds to 10 Mb/s + 20ms latency, "medium" corresponds to 30 Mb/s + 15ms latency, and "fast" corresponds to 100 Mb/s + 20ms latency.
Cache was disabled and local storage was cleared.
1. This forces code and language data to be re-downloaded, emulating the experience of a first-time user (the performance impact this change will have on repeat users is marginal, as the files will already be cached).
This file was recognized.
1. A more complex input would lead to a smaller change in percentage (although not absolute) terms, as a larger proportion of runtime would be spent on recognition.

Balearica · 2023-09-28T06:38:40Z

Closing as completed. As of v5, by default only the LSTM code and data are loaded.

lmk123 · 2023-09-28T09:06:28Z

How do I use the new language data?

Currently I have the langPath set to "https://tessdata.projectnaptha.com/4.0.0_best", but I noticed that https://tessdata.projectnaptha.com/4.0.0_best/chi_sim.traineddata.gz downloads at 11.4MB, not 1.72MB as you said.

Balearica · 2023-09-28T18:56:54Z

@lmk123 If you use Tesseract.js v5 and do not set langPath, the new language data will be loaded automatically. If you wish to self-host the language data, then you should create a directory on your site with the language data files found here, and set langPath to that directory.

Some insight as to why the language data files are different sizes:

The default data in Tesseract.js v5 comes from the 4.0.0_best_int directory
1. This contains an integerized version of the tessdata_best data for LSTM, and no data for Legacy
2. English is ~2.8 MB, Chinese is ~1.6 MB
The default data in Tesseract.js v4 comes from the 4.0.0 directory
1. This contains an integerized version of the tessdata_best data for LSTM, as well as data for Legacy
2. English is ~10 MB, Chinese is ~19 MB
The data you are currently using is from the 4.0.0_best directory
1. This contains a non-integerized version of the tessdata_best data for LSTM, and no data for Legacy
2. English is ~12 MB, Chinese is ~11 MB
  1. The non-integerized versions of the data are significantly larger, despite (purportedly) having minimal impact on recognition accuracy

lmk123 · 2023-09-30T05:50:51Z

Thanks for the detailed explanation, I figured it out.

Can you please put 4.0.0_best_int on "https://tessdata.projectnaptha.com/" as well? I found that "https://tessdata.projectnaptha.com/4.0.0_best_int/eng.traineddata.gz" opens to a 404.

This is because cdn.jsdelivr.net doesn't work in my country, but I tested https://tessdata.projectnaptha.com and it works.

Balearica · 2023-09-30T06:06:59Z

@lmk123 Unfortunately the fact that the https://tessdata.projectnaptha.com/ is not updating to include these files is not trivial to fix. That is a GitHub pages site configured to host that entire repo. Unfortunately, at some point that repo passed GitHub's file size limit so it stopped updating. This was part of the reason why the default was changed to use jsDelivr.

What country are you in? I was not aware that jsDelivr was blocked in certain countries. Should probably implement some sort of fallback mechanism if there are regional issues I was not aware of.

lmk123 · 2023-09-30T06:22:03Z

What country are you in?

China. The following picture (source) shows the connection of cdn.jsdelivr.net in China, most of them are red, which means they can't connect.

at some point that repo passed GitHub's file size limit

Maybe put 4.0.0_best_int in a separate repository?

ivysrono · 2024-03-03T03:42:11Z

What country are you in? I was not aware that jsDelivr was blocked in certain countries. Should probably implement some sort of fallback mechanism if there are regional issues I was not aware of.

We can download the data manual, however, where should we put the file in PC?

By the way, would you convert the data to pure JS so we can store them anywhere?

Balearica · 2024-03-03T04:06:50Z

The fact that Tesseract.js uses the JSDelivr CDN by default is unrelated to the reduction of file sizes, which is the topic of this issue. I created issue #899 for discussing the topic of JSDelivr not working in China, so discussion about the CDN should move there.

This was referenced Aug 24, 2023

Improve Progress Logs #598

Closed

Use smaller .traineddata files by default #750

Closed

Balearica added this to the v5.0 milestone Aug 30, 2023

Balearica mentioned this issue Sep 1, 2023

Version 5 Changes #820

Open

Balearica closed this as completed Sep 28, 2023

naptha locked as off-topic and limited conversation to collaborators Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce file sizes #806

Reduce file sizes #806

Balearica commented Aug 21, 2023 •

edited

Loading

Balearica commented Aug 21, 2023 •

edited

Loading

Balearica commented Sep 25, 2023

Balearica commented Sep 28, 2023

lmk123 commented Sep 28, 2023

Balearica commented Sep 28, 2023

lmk123 commented Sep 30, 2023

Balearica commented Sep 30, 2023

lmk123 commented Sep 30, 2023 •

edited

Loading

ivysrono commented Mar 3, 2024

Balearica commented Mar 3, 2024

Reduce file sizes #806

Reduce file sizes #806

Comments

Balearica commented Aug 21, 2023 • edited Loading

Balearica commented Aug 21, 2023 • edited Loading

English - 54% Total Reduction

Chinese (Simplified) - 73% Total Reduction

Balearica commented Sep 25, 2023

Balearica commented Sep 28, 2023

lmk123 commented Sep 28, 2023

Balearica commented Sep 28, 2023

lmk123 commented Sep 30, 2023

Balearica commented Sep 30, 2023

lmk123 commented Sep 30, 2023 • edited Loading

ivysrono commented Mar 3, 2024

Balearica commented Mar 3, 2024

Balearica commented Aug 21, 2023 •

edited

Loading

Balearica commented Aug 21, 2023 •

edited

Loading

lmk123 commented Sep 30, 2023 •

edited

Loading