Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large images cause excessive memory usage #900

Open
Balearica opened this issue Mar 11, 2024 · 2 comments
Open

Large images cause excessive memory usage #900

Balearica opened this issue Mar 11, 2024 · 2 comments

Comments

@Balearica
Copy link
Member

Overview

Tesseract.js currently accepts any valid image, and does not downsize large images. Additionally, while the memory allocated for the webassembly "heap" can increase if needed, it cannot decrease. These behaviors, taken together, can cause issues for applications that run recognition on arbitrary user inputs. A single excessively large image can cause the allocated memory to expand, and for the rest of the workers lifespan, it will always use a large amount of memory. This is especially problematic in cases where schedulers are used with 4+ workers.

Solutions

Individual Projects

Individual projects can mitigate by checking the size of images before sending to Tesseract. If an image is excessively large, it could be rejected or downsized.

Additionally, if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often. While workers are re-usable, and should not be created/killed for every image recognized, there are disadvantages to using them forever. As noted above, memory use can only expand over time, so a single large image will permanently increase the memory footprint of a worker. Additionally, workers "learn" over time by default, editing their internal dictionaries based on words recognized in documents. This is useful within the context of a single document, or group of similar documents, however is not necessarily desirable if recognizing hundreds of unrelated documents. Re-creating the worker resets the dictionary.

Tesseract.js

Eventually, Tesseract.js should automatically downsize images that are over a certain size. This size should be configurable by the user.

@rohitsahu-bstack
Copy link

if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often.

@Balearica
I feel that this should be included in the documentation, a lot of time would be saved of developers who are trying to reuse workers in server. They would face memory consumption issue.

@Balearica
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants