Replies: 1 comment 2 replies
-
What is the license for your project? I have code that does this, however it relies on dual licensed (GPL and proprietary) projects.
I am not aware of any way to accomplish this using only permissively licensed software. Specifically, the "merg[ing] with the input PDF" operation is fairly involved, and I do not know of any permissively licensed PDF libraries that support this. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there! We've just implemented an "addOcrLayerToExistingPDF" function by turning an arbitrary input PDF into a series of TIFF images and running those through tesseract.js. Pretty much straightforward. However, I would want to apply the OCR layer to the original PDF, not the images – in order to keep the output PDF more faithful to the input PDF. This way we would also make sure that the image quality in the output PDF by definition doesn't get degraded.
I understand this may not be in the scope of being solved with tesseract.js, but understanding how to create a "OCR-only PDF" with tesseract.js would already help a lot. That PDF could then be merged with the input PDF (I would think there should be solutions to this problem). Thanks for any ideas or pointers!
Beta Was this translation helpful? Give feedback.
All reactions