How to Extract Text From PDFs and Images on Linux Using gImageReader

If you're a student or your work involves working with lots of images and PDFs, you'd have, at some point, felt the need to extract text from an image or a document.

Luckily, text extraction makes this possible. And there are several tools that you can use to do this. gImageReader is one of the many tools. It's free to use and works with both image files and PDF documents.

Let's dive in to check out gImageReader in detail and see how you can use it to extract text from images and PDFs.

What Is gImageReader?

gImageReader is an app that lets you extract text from images and PDFs on Linux. It's essentially a GUI or front-end to the Tesseract OCR engine, an open-source engine developed by Hewlett-Packard that's considered to be one of the best OCR engines available.

With gImageReader, you can easily and quite accurately extract text from images or PDF documents with a few simple clicks. You can then export the extracted text to a text or PDF file for further use.

Features of gImageReader

gImageReader packs the following features:

Import PDF documents and images from different sources (disk, scanning devices, clipboard, and screenshot)
Batch process images or documents, i.e., extract text from multiple images or documents at once
Recognize text snippets as plain text or hOCR documents
Built-in spell checker
Automatic text area detection
Basic image/document editing
Save output as a text file

How to Install gImageReader on Linux

gImageReader is available on most major Linux distros. But before you proceed with its installation, you need to install the Tesseract OCR engine on your system.

To do this, open the Software Manager on your system and search for tesseract. When it returns a list of results, install the tesseract-ocr and tesseract-ocr-eng packages. You can also use command-line package managers to install the package if you're more comfortable with the terminal.

After this, check out the installation instructions in the following sections to install gImageReader on your computer.

If you're on Debian or Ubuntu, open the terminal and run the below commands to install gImageReader:

        sudo add-apt-repository ppa:sandromani/gimagereader
sudo apt-get update
sudo apt install gimagereader

On Fedora, CentOS, or Red Hat Enterprise Linux (RHEL):

        sudo dnf install gimagereader-qt

On Arch Linux or Manjaro:

        sudo pacman -S gimagereader

openSUSE users can install gImageReader using:

        sudo zypper install gimagereader

In case you're using any other Linux distro, you can build gImageReader from the source by following the instructions over at gImageReader's GitHub.

How to Use gImageReader on Linux

gImageReader is pretty easy to use and works with all kinds of image files as well as PDF documents. Follow the instructions below to extract text from images or PDFs on Linux.

Open the applications menu, search for gImageReader, and launch the app. Hit the Maximize button in the gImageReader window to open it in full-screen view.

Now, click the Add images button on the left pane under the toolbar and use the file browser to select the image(s) or PDF(s) from which you want to extract text.

Click Ok to import the image(s) or PDF(s) to gImageReader. Or, if you want to extract text from what's displayed on the screen, click on the dropdown beside the Add images button and select Take Screenshot. gImageReader will take a screenshot of the screen's content.

Once you've added the image to gImageReader, click the Toggle output pane button (one with the notepad icon) to bring up the output pane. This is where the text you extract from images or PDFs appears.

Depending on how you want to proceed, you now have the option to identify the text in the image or PDF automatically or manually. To do this automatically, click on the Autodetect layout button, and it will highlight all the text blocks in the selected image or PDF document.

After this, tap on Recognize selection > Current Page to begin the text extraction process.

gimagereader auto-detect text extraction

Alternatively, to select the text manually, hover over the text you want to extract, and using the cross-hair draw a box around the area from where you want to extract the text. Then, hit the Recognize selection button to proceed.

If it's a PDF document, and you want to extract text from different pages, tap on the Plus (+) button to flip pages over.

To go back, hit the Minus (-) button. And then, select the text you want to extract and hit the Recognize selection button to extract it.

Although rare, there may be times when gImageReader would return the extracted text in a language other than English. When this happens, simply tap on the dropdown button beside Recognize selection button and select one of the English options.

Finally, to save the extracted text, click on the Save output button. This will bring up the Save window. Here, give a name to the file and hit Ok.

What Else Can You Do With gImageReader?

As mentioned earlier, gImageReader also gives you the option to modify certain aspects of the imported images or documents, like their brightness, contrast, and resolution. Additionally, you can also invert colors or rotate the images or documents, if required.

Most of these options can prove to be useful when the text in an image or document isn't legible to gImageReader, and is, therefore, preventing the tool from recognizing the text.

To access any of these editing options, click the Image Controls button, and it will reveal a mini toolbar below the main toolbar. From here, select the appropriate buttons to perform your desired editing operation on the image or document.

Text Extraction on Linux Made Easy With gImageReader

Text extraction often requires the right tool: one that employs a reliable and accurate OCR engine that enables it to identify text in an image or document effectively, so you can extract it efficiently without any hassle.

gImageReader accomplishes this nicely, thanks to the Tesseract OCR engine it uses in the background. Considering its ease of use, gImageReader is undoubtedly one of the best text extraction tools available for Linux.

Alternatively, if you're looking for a simpler solution, you can check out TextSnatcher, which is fast and pretty easy to use.