Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker.recognize memory leak #678

Closed
Balearica opened this issue Oct 8, 2022 · 4 comments
Closed

worker.recognize memory leak #678

Balearica opened this issue Oct 8, 2022 · 4 comments

Comments

@Balearica
Copy link
Member

A memory leak exists in worker.recognize. When the same worker is used for recognizing many files, memory slowly increases, and eventually the program crashes. To replicate this issue, run the benchmark.js example, except with i < 10 changed to i < 100. On my system this fails on the 84th recognition job.

@Balearica
Copy link
Member Author

I investigated the cause of the memory leak and confirmed that leak is caused by the setImage function. Repeatedly running image recognition without setImage does not cause a memory leak, and repeatedly running setImage without recognition does cause a memory leak.

I was unable to figure out how to adapt the existing code (which creates the image object on the Javascript side) to not have this issue. However, writing the image to the filesystem and then reading it to a pix object on the C++/Webassembly side resolved the issue entirely. Therefore, pending another user figuring out how to patch the current code, I plan to use this solution.

@Balearica
Copy link
Member Author

@engageaffli The upcoming changes in version 4 (which includes this fix) are explained in #662. You should review as there are additional changes in that version.

To use version 4 while in development you would just need to (1) replace the contents of your current Tesseract.js folder with the dev/v4 branch of this repo and (2) replace the contents of your current Tesseract.js-core folder with the dev/v4 branch of the Tesseract.js-core repo.

Balearica added a commit that referenced this issue Nov 25, 2022
See #662 for explanation of Tesseract.js Version 4 changes.  List below is auto-generated from commits. 

* Added image preprocessing functions (rotate + save images)

* Updated createWorker to be async

* Reworked createWorker to be async and throw errors per #654

* Reworked createWorker to be async and throw errors per #654

* Edited detect to return null when detection fails rather than throwing error per #526

* Updated types per #606 and #580 (#663) (#664)

* Removed unused files

* Added savePDF option to recognize per #488; cleaned up code for linter

* Updated download-pdf example for node to use new savePDF option

* Added OutputFormats option/interface for setting output

* Allowed for Tesseract parameters to be set through recognition options per #665

* Updated docs

* Edited loadLanguage to no longer overwrite cache with data from cache per #666

* Added interface for setting 'init only' options per #613

* Wrapped caching in try block per #609

* Fixed unit tests

* Updated setImage to resolve memory leak per #678

* Added debug output option per #681

* Fixed bug with saving images per #588

* Updated examples

* Updated readme and Tesseract.js-core version
@Balearica
Copy link
Member Author

Closing as this should be resolved in Version 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@Balearica and others