Optical Character Recognization using Tesseract

261 views 12/08/2022 Abhishek Sai 0 Comments Extract the text from the image, OCR, open source OCR engine, Optical Character Recognization, pytesseract, Tesseracts

Introduction

In this blog, we will discuss how to recognize the text in images using Tesseract. Tesseract is a command-line program and open source OCR engine. Optical Character Recognization is a technology that enables the identification of text characters contained within a digital image. Tesseract is an engine created by HP that supports right-to-left and ideographic languages in addition to more than 100 other languages. Tesseracts can be trained to recognize additional languages. It has two OCR engines for image processing, one that uses long short-term memory (LSTM) and the other that relies on character patterns.

Libraries Required

pyteserract
NumPy
cv2
PIL

Optical Character Recognization for an Image

Step 1 – Install Tesseract-OCR

For Windows

Firstly install the tesseract-OCR from Tesseract Link. The default installation path at the time is C:\Program Files\Tesseract-OCR\tesseract.exe. It may change so please check the installation path. In colab notebook use pip install pytesseract command. Finally set the tesseract path in the script using

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' the command.

For Linux

Firstly add the repository using !sudo add-apt-repository ppa:alex-p/tesseract-ocr the command and update using !sudo apt-get update the command. Now install tesseract using !sudo apt install tesseract-ocr the command. In the collab notebook use pip install pytesseract command.

Step 2 – Import the Libraries

Import the required libraries

import cv2
import numpy as np
import pytesseract
from PIL import Image

Step 3 – Extract the text from the image

Now, we extract the text from the image given as an input. First, we convert the image to a grayscale image and do pre-processing steps to remove the noise from the image. Then we use the tesseract tool for character extraction.

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite("removed_noise.png", img)

    # Write the image after apply opencv to do some ...
    cv2.imwrite(img_path, img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))

    return result

Step 4 – Result

Finally, pass the image as input for the above function and print it.

print ('--- Start recognize text from image ---')

print (get_string('/content/demo.jpg'))

print ("------ Done -------")

Input Image

This is a quote that is randomly chosen from the internet

Output image

Share this post