# OCR for PDFs

## Introduction

Since the widespread use of computers in the 1970s followed by the invention of PDFs in 1993, storing text files in digital formats has gradually but steadily dominated over traditional papers. This unbeatable edge was created and further enhanced by hitchhiking the convenience of the internet that allows digital text files to be easily sent across the world in a matter of seconds. Today, not only are machine-encoded texts saved and shared via PDFs, even papers of handwritten documents are scanned into such formats for further processing and distribution.

This emerging trend, however, has shed light on a new and currently ongoing domain of research — PDF Optical Character Recognition (OCR). OCR is the process of converting scanned or handwritten text into machine-encoded text, such that it could be further used by programs for further processing and analysis. While the application of OCR is broad (from images of road signs to formal text documents), this article dives specifically into the domain of PDF OCR, particularly PDFs of scanned and handwritten paper, and discusses the technology and programs in various languages to perform the task. A detailed overview and comparison across several in-the-market software for PDF OCR is further presented for reference.

Looking for an OCR solution to extract information from PDFs? Give Nanonetsa spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!

Before discussing the codes, details, and benefits of OCRs, we first explain the concept of how OCR works by introducing the advancements in the technology involved.

The electronic conversion of scanned documents for further computation, before deep learning meeting the required accuracy for such tasks, are usually performed with four simple steps:

1. Collect a database of known characters.
2. Use photosensors to gather and separate individual letters from scanned documents.
3. Compare the set of attributes retrieved from the photosensors with physical attributes from the database.
4. Convert each set of attributes accordingly into the known character with the highest similarity.

While the traditional approach appears to be effective the majority of the time, it is vulnerable to its inherent rule-based restrictions. One crucial intermediate step of OCR is to successfully extract single letters or signs from a set/group of texts. This extraction requires certain templates or rules (i.e., preset font sizes/styles) in order for it to be highly accurate. Imposing more and more rules to increase accuracy will create a dilemma of the OCR overfitting, or correct only on specific styles of writings. Any inconsistencies of lighting during the scanning process would also lead to errors when the OCR is completely rule-based.

In addition, rule-based attribute comparisons also fall short when dealing with handwritings. Computer-generated fonts are mostly fixed with attributes often obvious and easy to cross-compare – handwritten fonts are the exact opposite, with unlimited variations and therefore are much more difficult to be classified. Since each time a hand-crafted character is slightly different, it is not possible to include all of them as part of the database either. This often requires OCRs to perform more sophisticated algorithms besides naive attribute-matching.

Finally, the barrier of multiple languages also exists in the traditional approach. Numerous languages adopt similar or even identical symbols; if we store all the symbols into the database, we would not be able to tell the difference between two symbols by merely performing attribute-matching, which ultimately makes the traditional approach often limited to just one language per model.

In light of the recent deep learning era, thankfully brought up by the rapidly growing hardware computation capabilities, newer OCRs have incorporated learning models both during the process of extracting text and in the phase of interpreting them.

### Deep-learning Based OCR Engines

Deep learning, a major branch of the machine learning realm, has gained great popularity with the help of numerous renowned scientists pushing it to the forefront. In traditional engineering, our goal is to design a system/function that generates an output from a given input; deep learning, on the other hand, relies on the inputs and outputs to find the intermediate relationship that can be extended to new unseen data through the so-called neural network.

A neural network, or a multi-layer perceptron, mimics the way human brains learn. Each node, namely neurons, inside the network are like biological neurons such that they receive information to “activate”. Sets of neurons form layers, and multiple layers stack up to become a network, which uses the information to generate a prediction. The prediction can be of all forms, from a prediction of class for classification problems to the bounding boxes of items in object detection tasks – all of which have achieved state-of-the-art compared to previous literature. In the task of OCR, two types of output, along with two genres of networks, are heavily applied.

• Convolutional Neural Networks (CNNs) – CNNs are one of the most dominant sets of networks used today particularly in the realm of computer vision. It comprises multiple convolutional kernels that slide through the image to extract features. Accompanied with traditional network layers at the end, CNNs are very successful in retrieving features from a given image to perform predictions. This process can further be transferred to the task of finding bounding boxes and detecting attributes of characters for further classification in the OCR process.
• Long Short-Term Memories (LSTMs) – LSTMs are a family of networks applied majorly to sequence inputs. The intuition is simple — for any sequential data (i.e., weather, stocks), new results may be heavily dependent on previous results, and thus it would be beneficial to constantly feed-forward previous results as part of the input features in performing new predictions. In the case of OCR, previously detected letters could be of great assistance to help predict the next, as a set of characters should usually make sense when put together (e.g., an English letter “g” is more likely to come after “do” than a number “9”, despite their similar attributes).

Besides the main tasks in OCR that incorporate deep learning, many pre-processing stages to eliminate rule-based approaches have also been beneficiaries of the thriving neural network technologies:

• Denoising – When a document is scanned improperly, rule-based methods may easily fall short. A recent approach adopted by OCR technologies is to apply a Generative Adversarial Network (GAN) to “denoise” the input. GAN comprises two networks, a generator and a discriminator. The generator constantly generates new inputs for the discriminator to distinguish between the actual and generated inputs, allowing the generator to constantly improve in creating ideal contents. In this case, the GAN is trained from a pair of denoised and noised documents, and the goal for the generator is to generate a de-noised document as close to the ground-truth as possible. During the application phase, the GAN, if trained well, can then be used on every input to finetune any poorly-scanned documents.
• Document Identification – OCR tasks, particularly OCR tasks on PDFs, are often used for the purpose of properly extracting data from forms and documents. Therefore, knowing the type of document the OCR machine is currently processing may significantly increase the accuracy of data extraction. Recent arts have incorporated a Siamese network, or a comparison network, to compare the documents with pre-existing document formats, allowing the OCR engine to perform a document classification beforehand. This extra step has been empirically shown to improve the accuracy in text retrievals.

In summary, the progression of OCR has been well-benefitted by the exponential growth of hardware capabilities and deep learning. PDF OCRs have now achieved accuracies to an astonishing standard for numerous applications.

Looking for an OCR solution to extract information from PDFs? Give Nanonetsa spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!

## Applications of PDF OCR Software

The main goal of OCR is to retrieve data from unstructured formats, whether that be numerical figures or actual numbers. If the retrieval is successful and highly accurate, programs can utilize OCR for labour tasks such as recognizing and interpreting text, specifically for numerical and contextual analysis.

### Numerical Data Analysis

When PDFs contain numerical data, OCR helps extract them to perform statistical analysis. Specifically, OCR with the help of table or key-value pairs (KVPs) extractions can be applied to find meaningful numbers from different regions of one given text. We can then adopt statistical or even machine learning methods (i.e., KNN, K-Means, Linear/Logistic Regression) to models of various applications

### Text Data Interpretation

On the other hand, text data processing may require more stages of computation, with the ultimate goal for programs to understand the “meanings” behind words. Such a process of interpreting text data into its semantic meanings is referred to as Natural Language Processing (NLP).

## Benefits of PDF OCR

PDF OCR serves numerous purposes at an application level. The following sections describe some example use-cases from as small as a personal use to as big as that of a corporation.

### Personal Use Cases

PDF OCRs bring immense convenience when dealing with annoying tasks such as scanning IDs and personal financing.

Personal IDs are often required to be converted into PDF formats to be sent to various applications. These identification documents contain information such as date-of-birth and ID numbers which are often required  to be repetitively typed in for different purposes, and therefore a highly accurate PDF OCR that finds the matching fields and corresponding values across the ID would be of great help in performing trivial manual tasks. The only labour required would be to just double-check for any inconsistency.

Personal Financing is another process that requires tons of manual labour. Although developments in excel and spreadsheets have already eased tasks like personal budgeting, OCR and extractions of data on PDF invoices could further expedite the process. This data can be automatically put into spreadsheets for analysis as mentioned in previous sections to be carried out. One can easily utilize the original key-in time into thinking of better financial plans.

Both big corporations and smaller organizations have to deal with thousands of paperwork following similar formats, which are highly labour intensive and yet unproductive (i.e., all the labour are put in use to something that requires less brainstorming). Automated document classifications and survey collections/analyses are where OCR comes in handy.

OCRs enable computers to convert scanned texts into machine-encoded texts. The contents of the converted texts can then be used for classifying documents, whether it is applications for different roles or forms waiting to be approved. If trained well, OCRs can lead to minimal errors which could be frequent due to inevitable human fatigue. From a business perspective, the labour expenditure may also be vastly reduced.

In terms of surveys or feedback, which are often required by organizations to improve on their current product or plans, OCR also plays a vital role. Data can quickly be extracted and extensively evaluated for statistical analysis. If designed well, even handwritten text may be extracted and analysed automatically.

Looking for an OCR solution to extract information from PDFs? Give Nanonetsa spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!

## A Simple Tutorial

PDF OCRs can actually be easily programmed out personally. The following is a simple pipeline to perform OCR on PDFs.

### Conversion of PDF to Images

There are numerous libraries and APIs in multiple languages that supports pretrained OCRs. However, most of them process with images and not directly PDFs. Hence, to simplify the following steps, we can preprocess the PDFs into image formats before performing character recognitions.

One of the most commonly used library to do so is the pdf2image library for Python, which can simply be installed via the following command:

pip install pdf2image

Afterwards, one can import the libary and use any of the two lines of code to get an image in PIL format as the following:

from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)

images = convert_from_path('/home/belval/example.pdf')
images = convert_from_bytes(open('/home/belval/example.pdf','rb').read())

For more information on the code, you can refer to the official documentation in https://pypi.org/project/pdf2image/

### Image OCR

There are numerous APIs from big tech companies with highly accurate OCRs. Due to the presumption that PDFs are usually highly packed with dense text data, the most suitable way to perform such OCR would be to use the Google Vision API, particularly the Document_Text_Annotation function as it is particularly designed for such purposes. Specifically, Document_Text_Annotation sends the code to the OCR engine which Google designed for dense texts, including handwritings in various languages.

The entire Google Vision API is simple to setup, one may refer to its official guidance on https://cloud.google.com/vision/docs/quickstart-client-libraries for the detailed setup procedure.

Afterwards we can use the following codes for OCR retrieval:

def detect_document(path):
"""Detects document features in an image."""
import io
client = vision.ImageAnnotatorClient()

with io.open(path, 'rb') as image_file:

image = vision.Image(content=content)

response = client.document_text_detection(image=image)

for page in response.full_text_annotation.pages:
for block in page.blocks:
print('nBlock confidence: {}n'.format(block.confidence))

for paragraph in block.paragraphs:
print('Paragraph confidence: {}'.format(
paragraph.confidence))

for word in paragraph.words:
word_text = ''.join([
symbol.text for symbol in word.symbols
])
print('Word text: {} (confidence: {})'.format(
word_text, word.confidence))

for symbol in word.symbols:
print('tSymbol: {} (confidence: {})'.format(
symbol.text, symbol.confidence))

if response.error.message:
raise Exception(
response.error.message))

Alternatively, Google Vision API also supports multiple languages, such as Java and Go. More codes regarding the usage of Google API can be retrieved here: https://cloud.google.com/vision

There are also other OCR services/APIs from Amazon and Microsoft, and you can always use the PyTesseract library to train on your model for specific purposes.

Looking for an OCR solution to extract information from PDFs? Give Nanonetsa spin for higher accuracy, greater flexibility, post-processing, and a broad set of integrations!

## Comparison

There are numerous PDF OCRs currently available in the market. While some are free, fast, and can be instantly used online, others provide more accurate and better-designed products for professional usage. Here we describe a few options, as well as their pros and cons.

### Online PDF OCRs

When using PDF OCRs for personal use on quick conversions, free and fast may be more desirable than accuracy. There are numerous online PDF OCR services that serve these needs. One can simply upload PDF documents and be turned into written text in a fast and convenient manner.

The main problem with this, however, is the quality control of the OCR. These online free OCR software, while they work well most of the time, are not bound to deliver the best quality output each time compared to other offline software that requires constant maintenance

### Offline Software

Currently, there are several companies that provide highly accurate PDF OCR services. Here we look at several options of PDF OCR that specialize in different aspects, as well as some recent research prototypes that seem to provide promising results:

There are multiple OCR services that are targeted towards tasks such as images-in-the wild. We skipped those services as we are currently focusing on PDF document reading only.

• ABBYY – ABBYY FineReader PDF is an OCR developed by ABBYY. The software has a friendly UI used for PDF reading and text conversion. However, with its non-engineering nature (the target customers are non-tech specialists in other fields in need of PDF OCR), it would be more difficult to incorporate it into other programs for further processing.
• Kofax – Similar to ABBYY, Kofax is a friendly PDF reader that requires purchase. The price is fixed for individual usage, with discounts for large corporations. 24/7 assistance is also available in case of any technical difficulties.
• Deep Reader – Deep Reader is a research work published in ACCV Conference 2019. It incorporates multiple state-of-the-art network architectures to perform tasks such as document matching, text retrieval, and denoising images. There are additional features such as tables and key-value-pair extractions that allow data to be retrieved and saved in an organized manner.
• Nanonets™ – Nanonets™ PDF OCR uses deep learning  and therefore is completely template and rule independent. Not only can Nanonets work on specific types of PDFs, it could also be applied onto any document type for text retrieval.

## Conclusion

In conclusion, in this article we walked through the basics of how an OCR works, as well as the timeline of OCR development followed by simple tutorials and use cases. We also presented a set of viable options for PDF OCRs as well as their advantages and disadvantages for further use.