Apply OCR-Layer to the input PDF #928

mbaer3000 · 2024-05-29T14:02:41Z

mbaer3000
May 29, 2024

Hi there! We've just implemented an "addOcrLayerToExistingPDF" function by turning an arbitrary input PDF into a series of TIFF images and running those through tesseract.js. Pretty much straightforward. However, I would want to apply the OCR layer to the original PDF, not the images – in order to keep the output PDF more faithful to the input PDF. This way we would also make sure that the image quality in the output PDF by definition doesn't get degraded.

I understand this may not be in the scope of being solved with tesseract.js, but understanding how to create a "OCR-only PDF" with tesseract.js would already help a lot. That PDF could then be merged with the input PDF (I would think there should be solutions to this problem). Thanks for any ideas or pointers!

Balearica · 2024-05-29T18:25:51Z

Balearica
May 29, 2024
Maintainer

What is the license for your project? I have code that does this, however it relies on dual licensed (GPL and proprietary) projects.

If your project is open source and GPL licensed, you could use this code without any additional limitations.
If your project is proprietary/closed source, you could use this code but would need to obtain a paid license.
If your project is MIT licensed, using this code would not be possible without creating a separate GPL licensed version with these features.

I am not aware of any way to accomplish this using only permissively licensed software. Specifically, the "merg[ing] with the input PDF" operation is fairly involved, and I do not know of any permissively licensed PDF libraries that support this.

2 replies

mbaer3000 May 29, 2024
Author

Thanks for your perspective. The project is a closed source proprietary one. What are the options to obtain a paid license for the code you refer to?

Balearica Jun 5, 2024
Maintainer

The code I am referring to is from Scribe OCR, which I also maintain. Scribe OCR uses Tesseract.js, however provides advanced features such as PDF support. Among other things, this includes the ability to run OCR on .pdf files, and create .pdf output that combines the OCR overlay with the source file (rather than using a series of rendered images). Some of the PDF-related features rely on code from mupdf, which is a dual licensed AGPL/commercial library, so our use of that code requires the free version of Scribe OCR to be AGPL, with proprietary use requiring a commercial license.

While the PDF functionality can be used in an ad-hoc manner (including embedding in other applications), the easiest way to test is to use the web interface at scribeocr.com. Using this interface, you can upload a .pdf document, run OCR, and download a .pdf file where the OCR text is combined with the input document (rather than rendering everything to images). This results in files that are significantly smaller than creating a new PDF from images, and generally maintains the appearance of the input .pdf much better.

Please try with some test documents, and let me know if this is what you had in mind. If that looks promising for your use case, we can discuss licensing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply OCR-Layer to the input PDF #928

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Apply OCR-Layer to the input PDF #928

mbaer3000 May 29, 2024

Replies: 1 comment · 2 replies

Balearica May 29, 2024 Maintainer

mbaer3000 May 29, 2024 Author

Balearica Jun 5, 2024 Maintainer

mbaer3000
May 29, 2024

Replies: 1 comment 2 replies

Balearica
May 29, 2024
Maintainer

mbaer3000 May 29, 2024
Author

Balearica Jun 5, 2024
Maintainer