You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is currently no easy way to retrieve accurate line size metrics using Tesseract.js. Several sub-optimal ways of accomplishing this are listed below.
Use word object within blocks output format.
These metrics are rarely accurate, and unhelpful without additional information
There is no absolute mapping between font size and pixels
Use ascender/descender/x_size metrics in hocr output
These are useful and accurate (at least using Tesseract Legacy)
Unfortunately, getting the values is a hassle
Extracting requires using an XML parser or regular expressions
Calculate font size manually using character-level bounding boxes
This works, but is even more of a hassle than parsing from HOCR
The ascender/descender/row_height metrics from the hocr output should be added to the blocks output format. This will allow for easily retrieving accurate data about line size.
The text was updated successfully, but these errors were encountered:
There is a RowAttributes getter in Tesseract, however it is not accessible through Tesseract.js-core because of how recently it was added. Therefore, implementing this will require a new minor version of Tesseract.js-core. The implementation should be sure not to break code for users with old versions of Tesseract.js-core.
There is currently no easy way to retrieve accurate line size metrics using Tesseract.js. Several sub-optimal ways of accomplishing this are listed below.
word
object withinblocks
output format.ascender
/descender
/x_size
metrics inhocr
outputThe
ascender
/descender
/row_height
metrics from thehocr
output should be added to theblocks
output format. This will allow for easily retrieving accurate data about line size.The text was updated successfully, but these errors were encountered: