Disable non-text output formats by default #916

Balearica · 2024-04-16T07:14:09Z

By default, 4 different output formats are produced: text, blocks, hocr, and tsv. It's safe to say that few if any users make use of more than one format. However, producing all 4 formats can significantly inflate runtime. This is especially true for blocks, which iterates individually over every symbol (and symbol choice) in the data, and retrieves information about them all.

I recently encountered an image where creating the blocks output took 12 seconds, whereas running recognition took just 10 seconds. While this is uncharacteristically long, it is unacceptable for a default option few users benefit from to inflate runtime >100% for any images. Even outside of this fringe case, testing on other documents shows that creating blocks often inflates runtime in the 0.25-0.50 second range when scanning documents, which is a non-trivial increase.

I think it makes sense to leave text on by default, as presumably this is the most used and quickest to render, and some output format needs to be enabled by default. However, other formats should not be enabled unless the user actually wants them.

This is a breaking change so it would need to wait until Tesseract.js v6. Restoring the previous behavior would simply be a matter of manually specifying formats in the output argument to worker.recognize.

The text was updated successfully, but these errors were encountered:

Balearica added this to the v6.0 milestone Apr 16, 2024

Balearica mentioned this issue Apr 16, 2024

Font attributes incorrect even when font is properly identified (is_italic, is_serif, etc.) #907

Closed

Balearica mentioned this issue Jun 13, 2024

Investigate and fix runaway recognition times scribeocr/scribeocr#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable non-text output formats by default #916

Disable non-text output formats by default #916

Balearica commented Apr 16, 2024

Disable non-text output formats by default #916

Disable non-text output formats by default #916

Comments

Balearica commented Apr 16, 2024