Sample Header Ad - 728x90

Make existing PDF searchable ( OCR ) via command line / script

36 votes
13 answers
53618 views
I am looking for an offline scriptable tool that makes an existing PDF file searchable by running OCR on it, replacing the original non-searchable file with the searchable version, and can run unattended. E.g., www.pdfscannerapp.com - does exactly what I need, but it's GUI only - not scriptable. I am aware that Evernote makes PDF files searchable, but they remain searchable only when within Evernote. I am not looking for perfect OCR, even a moderately acceptable OCR is fine, but I would prefer a small utility rather than a bulky software package. (I am aware of a similar, but different question on AD: https://apple.stackexchange.com/questions/72676/looking-for-software-to-scan-or-convert-to-searchable-and-signable-pdf - however, I don't need to sign or fill PDFs, and my requirement is that the solution is scriptable) EDIT: 1) Several utilities allow structured text extraction, however in order to be extracted, the text must be there; I am mainly referring to PDFs that are wrapped bitmaps, as is the case with plain PDFs generated by scanners. 2) I am not necessarily looking for a free solution, and I would be more than happy to pay for a good utility that just does what I need, but I am not looking for bulky applications with a million features that include an OCR feature but whose cost does not justify buying them just for the OCR functionality. 3) As stated above, I am not looking for perfect OCR, just a moderately acceptable OCR. Unfortunately, in my experience, tesseract is really below that threshold. I define "moderately acceptable" an OCR that can, say, OCR an utility bill so that at least the account number (customer number) is recognized correctly. EDIT: "scriptable" or "automatable", that is, able to be triggered automatically and run unattended without human input whatsoever.
Asked by magma (958 rep)
Jan 1, 2013, 05:20 PM
Last activity: Feb 25, 2022, 09:48 AM