Building a document processing pipeline (Part 2): OCR and working with HEIF files
This post is part of a series of articles where I write about the technicalities of building a document processing pipeline. Read the Preface for context.
The document processing workflow starts with my girlfriend uploading images of the letters to the website. I limit the number of images to process as a batch to only 10. The first time she used the website, the backend failed. As I SSH’ed into my server to check the logs, I found that the image (well, she is using an iPhone, and me being so ignorant of the Apple ecosystem and photography) was a .heif file. I’ve encountered .heif files before but I didn’t care what they’re for. But now, I have to deal with them and they are kind of interesting.
Images go through OCR
To work with the text content on the images, Optical Character Recognition (OCR) must be performed. I decided to go with Azure AI Document Intelligence for this task. Why?
First, it’s quick to setup and can do the OCR fast. I was thinking of doing the OCR locally. My options included using Vision Language Models (VLMs) or using a battle-tested OCR system. But VLMs are not reliable enough to do OCR for my use case. Also, VLMs that perform in OCR very well are big models which my server can’t handle as I only have CPU. Another option was PaddleOCR but I eventually just dumped it when I learned of Azure’s service.
Second, it has a free tier! It allows to process up to 500 images per month which I expect my girlfriend cannot exhaust.
Now, Document Intelligence has a 4 MB max document size which I only learned when my girlfriend started using the website and the backend failed because the images’ sizes exceeded the limit.
So the images need to be compressed
Document Intelligence supports processing HEIF images. But for some reason I can no longer remember, I had to convert the HEIF images to JPEG. I had to check if file.content_type == "image/heic" and do the conversion.
import io
import pyheif
from PIL import Image
def convert_heic_to_jpg(file_content: bytes) -> bytes:
heif_file = pyheif.read(io.BytesIO(file_content))
output_buffer = io.BytesIO()
image = Image.frombytes(
heif_file.mode,
heif_file.size,
heif_file.data,
"raw",
heif_file.mode,
heif_file.stride,
)
image.save(output_buffer, format="JPEG")
return output_buffer.getvalue()
After the conversion, the compressing follows:
def compress_image(file_content: bytes, target_file_size_kb: int = 3500) -> bytes:
with Image.open(io.BytesIO(file_content)) as image:
output_buffer = io.BytesIO()
image.save(output_buffer, format="JPEG", optimize=True, quality=95)
image_size_bytes = len(output_buffer.getvalue())
file_size_kb = image_size_bytes // 1024
while file_size_kb > target_file_size_kb:
image.save(output_buffer, format="JPEG", optimize=True, quality=95)
image_size_bytes = len(output_buffer.getvalue())
file_size_kb = image_size_bytes // 1024
return output_buffer.getvalue()
Now the image is ready for Document Intelligence.
Do the OCR-ing
As of the writing, the Document Intelligence SDK v4.0 (GA) is the latest SDK version. When I wrote the code back then, SDK v3.1 was the latest. And so the client library is azure-ai-formrecognizer.
from pathlib import Path
from azure.ai.formrecognizer import DocumentAnalysisClient, AnalyzeResult
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv
load_dotenv()
azure_doc_int_endpoint = "https://<some-app-name>.cognitiveservices.azure.com/"
azure_doc_int_key = os.getenv("AZURE_DOC_INT_KEY")
document_analysis_client = DocumentAnalysisClient(
endpoint=azure_doc_int_endpoint, credential=AzureKeyCredential(azure_doc_int_key)
)
def extract_text_from_image(img_path: Path | str) -> str:
with open(img_path, "rb") as f:
poller = document_analysis_client.begin_analyze_document(
"prebuilt-read",
document=f,
)
result: AnalyzeResult = poller.result()
return result.content
Usually the processing only takes about under 5 seconds.
The extracted text is not formatted well. The text on the letters is in English but there are non-English and English words that are misspelled. Because of these issues, the next step in the document processing pipeline is to do OCR text correction.
About HEIF
According to the standard[1], High Efficiency Image File Format (HEIF) enables encapsulation of images and image sequences, as well as their associated metadata into a container file. Think of it as the image version of the MKV (audio, video, and subtitles!) multimedia format. HEIF is regarded as better than the more familiar JPEG. The technical advantages of HEIF were even presented by Apple in this 2017 video, and reading the technical document[1] was very difficult but educational to me.
Another thing I learned is that only Safari and Safari on iOS web browsers support HEIF becasue of licensing. When I searched for “compress heic python”, this is the first result from Stack Overflow and the top comment discusses a bit of the licensing and HEIF support in Pillow.
[1] Hannuksela, M., Aksu, E., Malamal Vadakital, V., Lainema, J. (2015). Overview of the High Efficiency Image File Format. JCT-VC. Download document