Scening Text Detection And Recognition Using RabbitMQ & Tesseract

Tesseract is an open-source OCR engine maintained by Google. It is capable of recognizing text within images, PDFs, and other scanned documents. RabbitMQ is a message broker that enables different parts of a system to communicate with each other by passing messages independently and asynchronously.

· 3 min read
Photo by Max Chen / Unsplash
Photo by Max Chen / Unsplash

Prerequisites.

  1. Public domain registered in Cloudflare.
  2. 2 PCs for publishing and consuming simulation of converting PDF files into text files.
  3. For use cases make the PCs' IP address and subnet different.

Tesseract OCR

Tesseract is a free and open-source OCR engine that complies with the Apache 2.0 license. For programmers, it can be applied directly or through the use of an API to extract written text from images. Many different languages are supported. Tesseract doesn't come with a built-in graphical user interface, however, there are plenty on the third-party website. Through the use of wrappers, Tesseract is compatible with a wide range of programming languages and frameworks. It can be used in conjunction with the current layout analysis to identify text within a huge document or with an outside text detector to identify text from a picture of a single text line.

Here's how the Tesseract and RabbitMQ are used to identify and recognize text coming from a PDF file in real-time streaming.

Initial display of the RabbitMQ web page.

Preparation of converting PDF to text and publishing to the consumer.

Preparation of consuming text submitted from the publisher.

The above display is the output of a PDF file conversion to a text file by scanning per page.

The above display is the text sent and consumed by the publisher.

Display the progress of publishing and consuming the converted PDF-based text stream.

Tesseract's Drawbacks

Tesseract functions best when the foreground and background texts are well separated. Guaranteeing this kind of setup can be very difficult in practice. Tesseract might not produce high-quality results for several different reasons, such as if the image has background noise. The results of recognition improve with greater image quality (size, contrast, and lighting). To improve the OCR results, a little preprocessing is needed. Images must be scaled correctly, have as much picture contrast as possible, and the text must be horizontally aligned. The following restrictions apply to Tesseract OCR, despite its considerable strength.

The list summarizes the constraints of tesseracts.

  1. The OCR is not as precise as some of the commercial options.
  2. Doesn't work well with pictures that have artifacts, such as partial occlusion, warped perspective, and complex backgrounds.
  3. It can't read handwriting, unfortunately.
  4. Gibberish could be discovered and reported as OCR output.
  5. Results could be subpar if a document comprises languages other than those listed with the -l LANG parameters.
  6. Analysis of the normal reading sequence of documents is not always accurate. For instance, it might attempt to connect text across columns if it doesn't realize a document has two columns.
  7. Poor-quality scans could result in subpar OCR.
  8. It conceals the identity of the font family to which the text belongs.

Complimentary Source Code With Non-Tesseract Plugin

Before carrying out to executing the following source code, the RabbitMQ server app must already be installed. On macOS, the command is as follows.

brew install rabbitmq

consumer.py

import pika

producer.py

import pdftotext

The sequence of execution is first running python3 consumer.py. Next, running producer.py

Conclusion

With some effort, the streaming of text detection and recognition can be done quite easily.