Large images cause excessive memory usage #900

Balearica · 2024-03-11T19:23:48Z

Overview

Tesseract.js currently accepts any valid image, and does not downsize large images. Additionally, while the memory allocated for the webassembly "heap" can increase if needed, it cannot decrease. These behaviors, taken together, can cause issues for applications that run recognition on arbitrary user inputs. A single excessively large image can cause the allocated memory to expand, and for the rest of the workers lifespan, it will always use a large amount of memory. This is especially problematic in cases where schedulers are used with 4+ workers.

Solutions

Individual Projects

Individual projects can mitigate by checking the size of images before sending to Tesseract. If an image is excessively large, it could be rejected or downsized.

Additionally, if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often. While workers are re-usable, and should not be created/killed for every image recognized, there are disadvantages to using them forever. As noted above, memory use can only expand over time, so a single large image will permanently increase the memory footprint of a worker. Additionally, workers "learn" over time by default, editing their internal dictionaries based on words recognized in documents. This is useful within the context of a single document, or group of similar documents, however is not necessarily desirable if recognizing hundreds of unrelated documents. Re-creating the worker resets the dictionary.

Tesseract.js

Eventually, Tesseract.js should automatically downsize images that are over a certain size. This size should be configurable by the user.

rohitsahu-bstack · 2024-06-27T06:24:15Z

if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often.

@Balearica
I feel that this should be included in the documentation, a lot of time would be saved of developers who are trying to reuse workers in server. They would face memory consumption issue.

Balearica · 2024-06-28T22:55:01Z

@rohitsahu-bstack Good suggestion, I added a new section explaining this case. https://github.com/naptha/tesseract.js/blob/master/docs/workers_vs_schedulers.md#reusing-workers-in-nodejs-server-code

Balearica mentioned this issue Mar 11, 2024

possibility to capture stderr #898

Closed

Balearica added a commit that referenced this issue Jun 28, 2024

Update workers_vs_schedulers.md per #900

e8919a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large images cause excessive memory usage #900

Large images cause excessive memory usage #900

Balearica commented Mar 11, 2024

rohitsahu-bstack commented Jun 27, 2024

Balearica commented Jun 28, 2024

Large images cause excessive memory usage #900

Large images cause excessive memory usage #900

Comments

Balearica commented Mar 11, 2024

Overview

Solutions

Individual Projects

Tesseract.js

rohitsahu-bstack commented Jun 27, 2024

Balearica commented Jun 28, 2024