Multiple issues: Discussion #915
Replies: 1 comment 4 replies
-
Regarding the first issue--regarding bounding boxes--it is unclear to me whether you are referring to (1) the text is identified correctly but the bounding box is missing the entire word or (2) incorrect text is being identified in a region where no word exists. Regarding the second question, while I'm not sure why certain words are assigned Metrics from OCR engines can be useful on a less granular level--a page with average confidence |
Beta Was this translation helpful? Give feedback.
-
Hi, I have been working with Tesseract OCR since months in my project and while doing so, I've noticed many weird things which have perplexed me and I want to get an understanding of why is it happening.
To provide context for the errors, I'll have to link 2 resources,
In the above repository, the code crops out individual word images detected in the input image file, using canvas API.
These are the following errors which I want to put focus on:
The text [if exists] in the input image, even if it gets recognized, the bbox data does not contain it, and it is out of bounds as in there is no relation between the detected text and the cropped image.
Sometimes when the recognized text matches exactly with the actual handwritten text, the confidence number weirdly is written 0?
You can check out the number of "0.00"s here in this json file, but yeah that wouldn't give you an exact measure of this error happening, sometimes it can be 0 because of an actual mismatch.
These 2 issues have made me scratch my brain from the inside. What can be the possible reasons for this, and can it be fixed by any tweaks?
Beta Was this translation helpful? Give feedback.
All reactions