Optical Character Recognization using Tesseract
Introduction
In this blog, we will discuss how to recognize the text in images using Tesseract. Tesseract is a command-line program and open source OCR engine. Optical Character Recognization is a technology that enables the identification of text characters contained within a digital image. Tesseract is an engine created by HP that supports right-to-left and ideographic languages in addition to more than 100 other languages. Tesseracts can be trained to recognize additional languages. It has two OCR engines for image processing, one that uses long short-term memory (LSTM) and the other that relies on character patterns.
Libraries Required
- pyteserract
- NumPy
- cv2
- PIL
Optical Character Recognization for an Image
Step 1 – Install Tesseract-OCR
For Windows
Firstly install the tesseract-OCR from Tesseract Link. The default installation path at the time is C:\Program Files\Tesseract-OCR\tesseract.exe. It may change so please check the installation path. In colab notebook use pip install pytesseract
command. Finally set the tesseract path in the script using
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
the command.
For Linux
Firstly add the repository using !sudo add-apt-repository ppa:alex-p/tesseract-ocr
the command and update using !sudo apt-get update
the command. Now install tesseract using !sudo apt install tesseract-ocr
the command. In the collab notebook use pip install pytesseract
command.
Step 2 – Import the Libraries
Import the required libraries
import cv2 import numpy as np import pytesseract from PIL import Image
Step 3 – Extract the text from the image
Now, we extract the text from the image given as an input. First, we convert the image to a grayscale image and do pre-processing steps to remove the noise from the image. Then we use the tesseract tool for character extraction.
def get_string(img_path): # Read image with opencv img = cv2.imread(img_path) # Convert to gray img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Apply dilation and erosion to remove some noise kernel = np.ones((1, 1), np.uint8) img = cv2.dilate(img, kernel, iterations=1) img = cv2.erode(img, kernel, iterations=1) # Write image after removed noise cv2.imwrite("removed_noise.png", img) # Write the image after apply opencv to do some ... cv2.imwrite(img_path, img) # Recognize text with tesseract for python result = pytesseract.image_to_string(Image.open(img_path)) return result
Step 4 – Result
Finally, pass the image as input for the above function and print it.
print ('--- Start recognize text from image ---') print (get_string('/content/demo.jpg')) print ("------ Done -------")
Input Image
This is a quote that is randomly chosen from the internet
Output image