Is there a way to OCR incoming PDFs that are faxed in order to make them searchable

  • 1
  • Question
  • Updated 2 weeks ago
  • Answered
Photo of Aaron

Aaron

  • 72 Points

Posted 2 weeks ago

  • 1
Photo of John Wang

John Wang, Official Rep

  • 5,446 Points 5k badge 2x thumb
You can do this buy retrieving the PDF and using an OCR API or the Tesseract Open Source package.

One API that can be used is the Google Vision API:

https://cloud.google.com/vision/docs/pdf

The Tesseract Open Source OCR engine is generally considered one of, if not, the best open source solutions:

https://github.com/tesseract-ocr/tesseract
(Edited)
Photo of Tyler Long

Tyler Long, Official Rep

  • 6,374 Points 5k badge 2x thumb
Just want to mention that OCR is not the only way to extract text from PDF.  If the PDF's content is text instead of image, you can use some library to extract the text. Search GitHub for "pdf to text".

Photo of John Wang

John Wang, Official Rep

  • 5,436 Points 5k badge 2x thumb
The PDF content depends on what generates the PDF. If you use a program like MS Word to generate a PDF then the PDF can text content, but a fax transmission will typically result in a PDF that contains an image and requires OCR, due to the fax transmission process.
(Edited)