You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default, 4 different output formats are produced: text, blocks, hocr, and tsv. It's safe to say that few if any users make use of more than one format. However, producing all 4 formats can significantly inflate runtime. This is especially true for blocks, which iterates individually over every symbol (and symbol choice) in the data, and retrieves information about them all.
I recently encountered an image where creating the blocks output took 12 seconds, whereas running recognition took just 10 seconds. While this is uncharacteristically long, it is unacceptable for a default option few users benefit from to inflate runtime >100% for any images. Even outside of this fringe case, testing on other documents shows that creating blocks often inflates runtime in the 0.25-0.50 second range when scanning documents, which is a non-trivial increase.
I think it makes sense to leave text on by default, as presumably this is the most used and quickest to render, and some output format needs to be enabled by default. However, other formats should not be enabled unless the user actually wants them.
This is a breaking change so it would need to wait until Tesseract.js v6. Restoring the previous behavior would simply be a matter of manually specifying formats in the output argument to worker.recognize.
The text was updated successfully, but these errors were encountered:
By default, 4 different output formats are produced:
text
,blocks
,hocr
, andtsv
. It's safe to say that few if any users make use of more than one format. However, producing all 4 formats can significantly inflate runtime. This is especially true forblocks
, which iterates individually over every symbol (and symbol choice) in the data, and retrieves information about them all.I recently encountered an image where creating the
blocks
output took 12 seconds, whereas running recognition took just 10 seconds. While this is uncharacteristically long, it is unacceptable for a default option few users benefit from to inflate runtime >100% for any images. Even outside of this fringe case, testing on other documents shows that creatingblocks
often inflates runtime in the 0.25-0.50 second range when scanning documents, which is a non-trivial increase.I think it makes sense to leave
text
on by default, as presumably this is the most used and quickest to render, and some output format needs to be enabled by default. However, other formats should not be enabled unless the user actually wants them.This is a breaking change so it would need to wait until Tesseract.js v6. Restoring the previous behavior would simply be a matter of manually specifying formats in the
output
argument toworker.recognize
.The text was updated successfully, but these errors were encountered: