Make existing PDF searchable ( OCR ) via command line / script
36
votes
13
answers
53618
views
I am looking for an offline scriptable tool that makes an existing PDF file searchable by running OCR on it, replacing the original non-searchable file with the searchable version, and can run unattended.
E.g., www.pdfscannerapp.com - does exactly what I need, but it's GUI only - not scriptable.
I am aware that Evernote makes PDF files searchable, but they remain searchable only when within Evernote.
I am not looking for perfect OCR, even a moderately acceptable OCR is fine, but I would prefer a small utility rather than a bulky software package.
(I am aware of a similar, but different question on AD: https://apple.stackexchange.com/questions/72676/looking-for-software-to-scan-or-convert-to-searchable-and-signable-pdf - however, I don't need to sign or fill PDFs, and my requirement is that the solution is scriptable)
EDIT:
1) Several utilities allow structured text extraction, however in order to be extracted, the text must be there; I am mainly referring to PDFs that are wrapped bitmaps, as is the case with plain PDFs generated by scanners.
2) I am not necessarily looking for a free solution, and I would be more than happy to pay for a good utility that just does what I need, but I am not looking for bulky applications with a million features that include an OCR feature but whose cost does not justify buying them just for the OCR functionality.
3) As stated above, I am not looking for perfect OCR, just a moderately acceptable OCR. Unfortunately, in my experience, tesseract is really below that threshold. I define "moderately acceptable" an OCR that can, say, OCR an utility bill so that at least the account number (customer number) is recognized correctly.
EDIT: "scriptable" or "automatable", that is, able to be triggered automatically and run unattended without human input whatsoever.
Asked by magma
(958 rep)
Jan 1, 2013, 05:20 PM
Last activity: Feb 25, 2022, 09:48 AM
Last activity: Feb 25, 2022, 09:48 AM