Sample Header Ad - 728x90

OCR on PDFs in OS X with free, open source tools

17 votes
2 answers
7816 views
After reading these blog posts: * Linux, OCR and PDF - Problem solved * Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr * Using Tesseract OCR with PDF scans and going through the snippet below (from this gist) for Linux, I think I found a method to OCR a multi-page PDF and get a PDF in the output that could also work in OS X. Most of the dependencies are available in homebrew (brew install tesseract and brew install imagemagick), except one, hocr2pdf. I haven't been able to find a port of it for OS X. Is there one available? If not, how can one OCR a multi-page PDF and get the results back again in a multi-page PDF in OS X, using free, open source tools? #!/bin/bash # This is a script to transform a PDF containing a scanned book into a searchable PDF. # Based on previous script and many good tips by Konrad Voelkel: # http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/ # http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ # Depends on convert (ImageMagick), pdftk and hocr2pdf (ExactImage). # $ sudo apt-get install imagemagick pdftk exactimage # You also need at least one OCR software which can be either tesseract or cuneiform. # $ sudo apt-get install tesseract-ocr # $ sudo apt-get install cuneiform # To install languages into tesseract do (e.g. for Portuguese): # $ sudo apt-get install tesseract-ocr-por echo "usage: ./pdfocr.sh document.pdf ocr-sfw split lang author title" # where ocr-sfw is either tesseract or cuneiform # split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page) # lang is a language as in "tesseract --list-langs" or "cuneiform -l". # and author, title are used for the PDF metadata. # # usage example: # ./pdfocr.sh SomeFile.pdf tesseract 1 por "Some Author" "Some Title" pdftk "$1" burst dont_ask for f in pg_*.pdf do if [ "1" == "$3" ]; then convert -normalize -density 300 -depth 8 -crop 50%x100% +repage $f "$f.png" else convert -normalize -density 300 -depth 8 $f "$f.png" fi done rm pg_*.pdf for f in pg_*.png do if [ "tesseract" == "$2" ]; then tesseract -l $4 -psm 1 $f $f hocr elif [ "cuneiform" == "$2" ]; then cuneiform -l $4 -f hocr -o "$f.html" $f else echo "$2 is not a valid OCR software." fi hocr2pdf -i $f -r 300 -s -o "$f.pdf" in.info echo "InfoKey: Author" >> in.info echo "InfoValue: $5" >> in.info echo "InfoBegin" >> in.info echo "InfoKey: Title" >> in.info echo "InfoValue: $6" >> in.info echo "InfoBegin" >> in.info echo "InfoKey: Creator" >> in.info echo "InfoValue: PDF OCR scan script" >> in.info in_filename="${1%.*}" pdftk merged+data.pdf update_info_utf8 in.info output "$in_filename-ocr.pdf" rm -r doc_data.txt in.info merged* pg_*
Asked by Josh (365 rep)
Apr 22, 2014, 03:17 PM
Last activity: May 28, 2019, 05:15 AM