OCR on PDFs in OS X with free, open source tools

17 votes
2 answers
7816 views
                          After reading these blog posts:

* Linux, OCR and PDF - Problem solved 
* Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr 
* Using Tesseract OCR with PDF scans 

and going through the snippet below (from this  gist) for Linux, I think I found a method to OCR a multi-page PDF and get a PDF in the output that could also work in OS X. Most of the dependencies are available in homebrew (brew install tesseract and brew install imagemagick), except one, hocr2pdf.

I haven't been able to find a port of it for OS X. Is there one available? If not, how can one OCR a multi-page PDF and get the results back again in a multi-page PDF in OS X, using free, open source tools?

    #!/bin/bash
     
    # This is a script to transform a PDF containing a scanned book into a searchable PDF.
    # Based on previous script and many good tips by Konrad Voelkel:
    # http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/ 
    # http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ 
    # Depends on convert (ImageMagick), pdftk and hocr2pdf (ExactImage).
    # $ sudo apt-get install imagemagick pdftk exactimage
    # You also need at least one OCR software which can be either tesseract or cuneiform.
    # $ sudo apt-get install tesseract-ocr
    # $ sudo apt-get install cuneiform
    # To install languages into tesseract do (e.g. for Portuguese):
    # $ sudo apt-get install tesseract-ocr-por
     
    echo "usage: ./pdfocr.sh document.pdf ocr-sfw split lang author title"
    # where ocr-sfw is either tesseract or cuneiform
    # split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page)
    # lang is a language as in "tesseract --list-langs" or "cuneiform -l".
    # and author, title are used for the PDF metadata.
    #
    # usage example:
    # ./pdfocr.sh SomeFile.pdf tesseract 1 por "Some Author" "Some Title"
    pdftk "$1" burst dont_ask
    for f in pg_*.pdf
    do
    if [ "1" == "$3" ]; then
    convert -normalize -density 300 -depth 8 -crop 50%x100% +repage $f "$f.png"
    else
    convert -normalize -density 300 -depth 8 $f "$f.png"
    fi
    done
    rm pg_*.pdf
     
    for f in pg_*.png
    do
    if [ "tesseract" == "$2" ]; then
    tesseract -l $4 -psm 1 $f $f hocr
    elif [ "cuneiform" == "$2" ]; then
    cuneiform -l $4 -f hocr -o "$f.html" $f
    else
    echo "$2 is not a valid OCR software."
    fi
    hocr2pdf -i $f -r 300 -s -o "$f.pdf"  in.info
    echo "InfoKey: Author" >> in.info
    echo "InfoValue: $5" >> in.info
    echo "InfoBegin" >> in.info
    echo "InfoKey: Title" >> in.info
    echo "InfoValue: $6" >> in.info
    echo "InfoBegin" >> in.info
    echo "InfoKey: Creator" >> in.info
    echo "InfoValue: PDF OCR scan script" >> in.info
    in_filename="${1%.*}"
    pdftk merged+data.pdf update_info_utf8 in.info output "$in_filename-ocr.pdf"
     
    rm -r doc_data.txt in.info merged* pg_*



                        
Asked by Josh (365 rep)
Apr 22, 2014, 03:17 PM
Last activity: May 28, 2019, 05:15 AM
OCR on PDFs in OS X with free, open source tools

Related Questions