Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract.js Bug on IBM i Server #930

Open
gregorysababady opened this issue Jun 12, 2024 · 9 comments
Open

Tesseract.js Bug on IBM i Server #930

gregorysababady opened this issue Jun 12, 2024 · 9 comments

Comments

@gregorysababady
Copy link

gregorysababady commented Jun 12, 2024

Environment
I am using tesseract.js@5.1.0 running on nodejs (v20.11.1)

My issue
So basically I am getting this same error TypeError [Error]: Arguments to path.resolve must be string...Emitted 'error' event on Worker instance at... whenever I am running my application and it hits the tesseract's createWorker instance inside the performOCR function :

performOCR(fs.readFileSync(filePath));

async function performOCR(imageFile) {
const worker = await createWorker("eng"); <--- issue occuring there
const ret = await worker.recognize(imageFile);
await worker.terminate();
console.log(ret.data.text);
}

Package.json
{
"name": "file-reader",
"version": "1.0.0",
"main": "index.js",
"scripts": {
"start": "node backend/index.js",
"dev": "nodemon backend/index.js",
"test": "echo "Error: no test specified" && exit 1"
},
"author": "",
"license": "ISC",
"description": "",
"dependencies": {
"cors": "^2.8.5",
"dotenv": "^16.4.5",
"express": "^4.19.2",
"ibm_db": "^3.2.4",
"idb-pconnector": "^1.1.1",
"multer": "^1.4.5-lts.1",
"mysql2": "^3.10.0",
"pdf-parse": "^1.1.1",
"tesseract.js": "^5.1.0",
"tesseract.js-core": "^5.1.0"
}
}

Specs

  • IBM i OS

I tried to remove and reinstall tesseract.js as well as tesseract.js-core, to use an image path instead of stream but still getting the exact same error. Any help would be much appreciated !

@Balearica
Copy link
Member

Do you have reason to believe that this issue is specific to IBM i, or is that simply extra context? Can you provide an example repo (or standalone flie) that is sufficient to reproduce this error? As await createWorker("eng") should run, the snippet above is not enough to go off of.

On a completely unrelated note, it is generally inadvisable to create new workers within the function that runs recognition within real applications. The reason is that, in addition to creating overhead every time the function is run, there is no limit to the number of workers that can end up being created. As a result, you either end up running 1 recognition job at a time to be safe (which is slow), or allowing for an unlimited number of jobs to run at the same time, which can crash your application.

The recommended approach is to create a scheduler. This allows you to define a fixed number of workers (say, 4) that persist between jobs, and use them to run recognition in parallel. See this guide for an explanation.

@gregorysababady
Copy link
Author

Repository
https://github.com/gregorysababady/filereader

I already tried it on a windows server and it works just fine.

Thanks for the advise, yeah I definitely need to correct this out !

@Balearica
Copy link
Member

Several of the dependencies in your repo's package.json don't install on Linux. Regardless, a more minimal example would be more useful for determining if there is a platform-specific issue with IBM i machines. Can you try running the basic example from the README on your system?

import { createWorker } from 'tesseract.js';

(async () => {
  const worker = await createWorker('eng');
  const ret = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
  console.log(ret.data.text);
  await worker.terminate();
})();

If the issue is with createWorker not being able to run on this platform, then this basic example code should fail. If it runs properly, then I think the issue is something else in your codebase. Note that the function path.resolve is not actually used in this repo (outside of example scripts and build code), so it is unclear to me how your code could be failing at the createWorker step with this error message.

@gregorysababady
Copy link
Author

This not working either, getting same path.resolve error.

On windows it works just fine.

The issue is coming from tesseract.js-core packages:
Issue_capture

@Balearica
Copy link
Member

Thanks for confirming. Unless there is something particular about your settings, it sounds like there is indeed some platform-specific issue with IBM i.

It looks like the code in question is not originally from either the Tesseract.js or Tesseract.js-core repos, but rather is code that is added by Emscripten, which is the compiler used to go from C/C++ to webassembly.

https://github.com/emscripten-core/emscripten/blob/0c504193efb3d0b51d30c07895544b29cbad1950/src/library_path.js#L96-L100

I am currently not sure what is happening here. It appears to be something filesystem-related. Upon a brief search of the emscripten issues I did not see any references to IBM i.

@gregorysababady
Copy link
Author

Ok so there's no solution to it ?

@gregorysababady
Copy link
Author

So is there going to be a new patch soon ?

@Balearica
Copy link
Member

It is likely that this can be fixed, however that would require troubleshooting by you or another IBM i user to figure out what the root cause is. I am not able to troubleshoot a platform-specific issue on a proprietary platform that I do not have access to.

@gregorysababady
Copy link
Author

gregorysababady commented Jun 20, 2024

Thank you and by the way there is this pulbic server available to anyone https://pub400.com

You can create credentials and test the code directly there !

That's the same working environment as on my machine.

Thank you for your help, its much appreciated !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants