Optical Character Recognization using Tesseract

Introduction

 

In this blog, we will discuss how to recognize the text in images using Tesseract. Tesseract is a command-line program and open source OCR engine. Optical Character Recognization is a technology that enables the identification of text characters contained within a digital image. Tesseract is an engine created by HP that supports right-to-left and ideographic languages in addition to more than 100 other languages. Tesseracts can be trained to recognize additional languages. It has two OCR engines for image processing, one that uses long short-term memory (LSTM) and the other that relies on character patterns. 

 

Libraries Required

 

  1. pyteserract
  2. NumPy
  3. cv2
  4. PIL

 

Optical Character Recognization for an Image

 

Step 1 – Install Tesseract-OCR

 

For Windows

Firstly install the tesseract-OCR from Tesseract Link.  The default installation path at the time is C:\Program Files\Tesseract-OCR\tesseract.exe.  It may change so please check the installation path. In colab notebook use pip install pytesseract command. Finally set the tesseract path in the script using

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  the command.

 

For Linux

 

Firstly add the repository using !sudo add-apt-repository ppa:alex-p/tesseract-ocr the command and update using !sudo apt-get update  the command. Now install tesseract using !sudo apt install tesseract-ocr the command. In the collab notebook use pip install pytesseract command.

 

Step 2 – Import the Libraries

 

Import the required libraries

 

import cv2
import numpy as np
import pytesseract
from PIL import Image

 

Step 3 – Extract the text from the image

 

Now, we extract the text from the image given as an input. First, we convert the image to a grayscale image and do pre-processing steps to remove the noise from the image. Then we use the tesseract tool for character extraction.

 

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite("removed_noise.png", img)

    # Write the image after apply opencv to do some ...
    cv2.imwrite(img_path, img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))

    return result

 

Step 4 – Result

 

Finally, pass the image as input for the above function and print it.

 

print ('--- Start recognize text from image ---')

print (get_string('/content/demo.jpg'))

print ("------ Done -------")

 

Input Image 

 

This is a quote that is randomly chosen from the internet

 

Output image

 

 

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *