 Improve newspaper digitalization efficacy with a generic document segmentation tool using Amazon Textract

We are living in a digital age. Information that used to be spread by printouts is disseminated at unforeseen speeds through digital formats. In parallel to the inventions of new types of media, an increasing number of archives and libraries are trying to create digital repositories with new technologies. Digitization allows for preservation by creating an accessible surrogate, while at the same time enabling easier storage, indexing, and faster search.

In this post, we demonstrate how to efficiently digitize newspaper articles using Amazon Textract with a document segmentation module. Amazon Textract is a fully managed machine learning (ML) service that automatically extracts text and data from scanned documents. For this use case, we show how the customized segmentation tool further augments Amazon Textract to recognize small and old German fonts despite low image quality. Our proposed solution expands the capabilities of Amazon Textract in the following ways:

  • Provides additional support for Amazon Textract to handle documents with complex structures and style (such as columnar texts with varying width, text blocks floating around images, texts nested within images and tables, and fonts with varying size and style)
  • Overcomes the 10 MB image (such as JPEG and PNG format) size limit of Amazon Textract for large documents

Our generic document segmentation module intelligently segments the document with awareness of its layout—we can crop a large image file into smaller pieces that are consistent with the layout. Then each smaller image is under the 10 MB limit while the original resolution is maintained for optimal OCR results. Another benefit of this segmentation tool is that the extracted texts from the detected segments are correctly ordered and grouped following human reading habits. Raw Amazon Textract OCR results of a newspaper image can’t be automatically grouped into meaningful sentences without knowing which segment (or article) each word belongs to. In fact, segmenting a page into different regions with different context is a common practice in existing pipelines of historical document digitalization.

In the following sections, we show the process of developing a Fully Convolutional Network (FCN) based document segmentation engine with Amazon SageMaker and Amazon SageMaker Ground Truth. After applying the segmentation model on test newspaper images, it was able to distinguish between background, pictures, headlines, and different articles. We compared the word count with the out-of-the-box Amazon Textract as a proxy for word recall. For our specific use case (old, low-quality images of newspapers with small and old styled fonts in a complex layout), our solution was able to pick up more words consistently after cropping the image and s


