« MBS Xojo Plugins, ver… | Home | xDev Magazine 19.4 »

Introducing Tesseract 4 to Xojo

Years ago we added TesseractMBS class to MBS Xojo OCR Plugin in 2012. We decided to go with the Tesseract engine, which was available as C++ library with an open source license. We integrated Tesseract in version 3.02 and stayed with that version for a long time. Since each version requires compatible data files, we could not easily change the library without you guys changing data files.

That brings us to the version 4.11 of Tesseract. We had plans to use the new version and looked for a way to make the transition easy for our plugin users. But since the newer tesseract library exports a C interface, we can load it dynamically at runtime. We got the new TessEngineMBS class for you and there we have a LoadLibrary() function to load the leptonica and tesseract libraries.

macOS with Homebrew

One way for macOS is to use the homebrew project to install the tesseract library with data files on a Mac.
So after installing homebrew package manager via Terminal, you would use a command to install the packages like this:

brew install tesseract-lang

And then you would do two TessEngineMBS.LoadLibrary function calls to load the libraries, first the leptonica image library and then the actual OCR library on top:

Dim r1 as Boolean = TessEngineMBS.LoadLibrary( "/opt/homebrew/lib/liblept.5.dylib" )
Dim r2 as Boolean = TessEngineMBS.LoadLibrary( "/opt/homebrew/lib/libtesseract.4.dylib" )

If both return true, you are good to go.

macOS with our download

Or you go to our website where we have a disk image for you. This is a bit special as we provide you with one dylib for both libraries (leptonica and tesseract) as well as both architectures: Intel and ARM. So you put this dylib somewhere with the files somewhere and then you can load the plugin here:

Dim r as Boolean = TessEngineMBS.LoadLibrary( "/Users/Tesseract/tesseract.dylib")

Once loaded, you are ready to go.

Linux

On Linux with Ubuntu you can install the tesseract files via Terminal using the apt-get command:

sudo apt-get install libtesseract4

This should install all the dependencies and the tesseract package. Once that is done, you can simply load it:

Dim r1 as Boolean = TessEngineMBS.LoadLibrary( "liblept.so.5" )
Dim r2 as Boolean = TessEngineMBS.LoadLibrary( "libtesseract.so.4" )

Please notice that we don't pass a path since the libraries are installed in the default location for Linux, so the loader will find them automatically.
If both return true, you are good to go.

Windows

On Windows you may use an installer for tesseract to get the data files and the DLLs into place. We got an installer for you from the University of Mannheim on our Download Libs folder.

Once installed you can load it:

Dim r1 as Boolean = TessEngineMBS.SetCurrentWorkingDirectory( "C:\Program Files\Tesseract-OCR") &
Dim r2 as Boolean = TessEngineMBS.LoadLibrary( "liblept-5.dll" )
Dim r3 as Boolean = TessEngineMBS.LoadLibrary( "libtesseract-4.dll" )

As you see we have to first switch the current working directory to the right folder. Then we load first the leptonica library and then the tesseract library. Since the tesseract one depends on the others, we load it first to have the DLL loader find it. But if all three functions returned true, you a good to go.

Initialize it

Once the library is loaded, you can use Initialize function to initialize the library. It's not part of the constructor, since you can create an object, set a few variables and then initialize. Pass the path to the language files, except on Linux or with homebrew, where it may go with the default location instead. When initializes, you can start the other functions. Basically you can just move this all to the app start code or first time you like to use OCR functions.

New tricks

Since we got version 4, we added a few new functions: First SetImageData and SetImageFile allow you to pass image files and in-memory image data directly without going through picture object.

And the newer engine can be initialized for multiple languages, e.g. "eng+deu" for loading both English and Deutsch (German).

Before loading library, you may use TessEngineMBS.LibraryLoaded function to know. So you only initialize on the first try.
30 06 21 - 09:40