Font attributes incorrect even when font is properly identified (`is_italic`, `is_serif`, etc.) #907

Balearica · 2024-03-31T02:11:53Z

The blocks output format includes various font attributes on the word level, including is_italic and is_serif. These do not appear to be functioning properly, and seem to always return false, even when using the Legacy model and when font identification worked correctly.

For example, when running recognition with the Legacy engine on the image below, the font is correctly recognized as an italic/serif font (Times_New_Roman_Italic). However, despite this, the is_italic and is_serif attributes are both false.

If this is an issue with Tesseract.js/Tesseract.js-core we should fix. If it is an issue on the Tesseract side, where the information is always incorrect, these should be removed from our output to avoid confusion (in the next major version).

Note that this is distinct from general accuracy issues with Tesseract font recognition, or the fact that it only runs on Legacy, which are outside of the scope of this repo. This issue is specific to cases where Tesseract correctly identifies the font but is still returning the wrong font attributes.

The text was updated successfully, but these errors were encountered:

Balearica · 2024-04-16T07:20:41Z

Given that the amount of information retrieved for blocks was found to inflate runtime (see #916), I am now planning to simply remove these properties in v6.

Even if this feature worked correctly (in that it accurately reported what Tesseract finds), it would still only be useful in extremely niche circumstances, if ever. Tesseract LSTM (the default) does not perform font detection at all, and Tesseract Legacy performs poorly with font identification, so this information would either be missing or inaccurate even if this bug was fixed.

Balearica · 2024-08-24T02:23:41Z

I looked into this, and it is caused by a typo in the function that gets boolean pointers. No booleans will be reported correctly using the current function. This should be fixed as it is a broader issue, regardless of whether or not this data is cut from a future version.

https://github.com/naptha/tesseract.js-core/blob/5ed64d41d0549f4f46fa235d597e643ddd748cbb/javascript/anterior.js#L1

Balearica · 2024-08-24T07:00:59Z

Closing as this was resolved in v5.1.1.

Balearica added this to the v6.0 milestone Apr 16, 2024

Balearica added a commit to naptha/tesseract.js-core that referenced this issue Aug 24, 2024

Fixed bug getting boolean values per naptha/tesseract.js#907

d19b54d

Balearica closed this as completed Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Font attributes incorrect even when font is properly identified (`is_italic`, `is_serif`, etc.) #907

Font attributes incorrect even when font is properly identified (`is_italic`, `is_serif`, etc.) #907

Balearica commented Mar 31, 2024

Balearica commented Apr 16, 2024

Balearica commented Aug 24, 2024

Balearica commented Aug 24, 2024

Font attributes incorrect even when font is properly identified (is_italic, is_serif, etc.) #907

Font attributes incorrect even when font is properly identified (is_italic, is_serif, etc.) #907

Comments

Balearica commented Mar 31, 2024

Balearica commented Apr 16, 2024

Balearica commented Aug 24, 2024

Balearica commented Aug 24, 2024

Font attributes incorrect even when font is properly identified (`is_italic`, `is_serif`, etc.) #907

Font attributes incorrect even when font is properly identified (`is_italic`, `is_serif`, etc.) #907