-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 4 Development and Changes #662
Comments
See #662 for explanation of Tesseract.js Version 4 changes. List below is auto-generated from commits. * Added image preprocessing functions (rotate + save images) * Updated createWorker to be async * Reworked createWorker to be async and throw errors per #654 * Reworked createWorker to be async and throw errors per #654 * Edited detect to return null when detection fails rather than throwing error per #526 * Updated types per #606 and #580 (#663) (#664) * Removed unused files * Added savePDF option to recognize per #488; cleaned up code for linter * Updated download-pdf example for node to use new savePDF option * Added OutputFormats option/interface for setting output * Allowed for Tesseract parameters to be set through recognition options per #665 * Updated docs * Edited loadLanguage to no longer overwrite cache with data from cache per #666 * Added interface for setting 'init only' options per #613 * Wrapped caching in try block per #609 * Fixed unit tests * Updated setImage to resolve memory leak per #678 * Added debug output option per #681 * Fixed bug with saving images per #588 * Updated examples * Updated readme and Tesseract.js-core version
is arabic avalible in load language |
@alaaeid1993 Yes, the code for Arabic is |
Yes, the Arabic letters are no problem, but the Arabic numbers do not appear correctly |
@alaaeid1993 This repo is for the JavaScript/webassembly port of Tesseract. We do not make changes to the Tesseract OCR engine or language data ( To confirm, you can install the Tesseract CLI (the main project) with an equivalent version (v5.3.0 as of this writing) and run with equivalent settings. If you find that accuracy in Tesseract CLI is also unacceptable, then the issue is with Tesseract (not Tesseract.js), and you should look for a fix in the main Tesseract repo. If you find that Tesseract CLI produces correct results (with equivalent version/settings) but Tesseract.js does not, then we can discuss further here. |
Overview
While bug fixes continue to be released for Version 3, all breaking changes will be released in Version 4, which is currently under development in the branch named dev/v4. This branch should be usable at present by users eager to use any new features, however there is no guarantee that additional breaking changes will not be implemented. Note that using this branch also requires using the Tesseract.js-core branch dev/v4.
Summary
Breaking Changes
createWorker
is now asyncworker = Tesseract.createWorker()
should be replaced withworker = await Tesseract.createWorker()
workerPath
orcorePath
now produces error/rejected promise (Rework error reporting from worker threads so all promises resolve #654)worker.load
is no longer needed (createWorker
now returns worker pre-loaded)getPDF
function replaced bypdf
recognize option (GetPDF() with Scheduler returns the same PDF file #488)Major New Features
imageColor
,imageGrey
, andimageBinary
options (Is it possible to obtain the Thresholded Image from tesseract? #588)rotateAuto
androtateRadians
have been added, which significantly improve accuracy on certain documentsrotateAuto
optionworker.setParameters
) can now be set for single jobs usingworker.recognize
options (Allow for setting parameters for single recognize job when using scheduler #665)worker.recognize(image, {tessedit_char_whitelist: "0123456789"})
load_system_dawg
,load_number_dawg
, andload_punc_dawg
) can now be set (Add a way to set "Init Only" parameters (user_word_suffix, etc.) #613)worker.initialize
now accepts either (1) an object with key/value pairs or (2) a string containing contents to write to a config fileload_number_dawg
to 0:worker.initialize('eng', "0", {load_number_dawg: "0"});
worker.initialize('eng', "0", "load_number_dawg 0");
Other Changes
loadLanguage
now resolves without error when language is loaded but writing to cache failsdetect
returnsnull
values when OS detection fails rather than throwing error (Failed to dectet OS #526)Detail
New Output Format Interface
A single, unified interface has been added for specifying all output formats.
output
is now the 3rd argument torecognize
(see example below). This replaces the separategetPDF
function, as well as varioussetParameters
options (tessjs_create_box
,tessjs_create_hocr
,tessjs_create_osd
,tessjs_create_tsv
, andtessjs_create_unlv
).Note: the default output formats (
text
,blocks
,hocr
, andtsv
) are not changing between v3 and v4, so this change only impacts users who want non-default options. This also means that users who want text and pdf outputs only need to specify{pdf: true}
, as text is already a default.The text was updated successfully, but these errors were encountered: