tesseract: is it possible to change font output in OCRed pdf?
5
votes
1
answer
2030
views
Following up on https://unix.stackexchange.com/questions/301318/how-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf/301319#301319 I have successfully produced OCRed pdf pages.
In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them and paste them elsewhere successfully. This does not seem to be a bug of Evince: https://bugzilla.redhat.com/show_bug.cgi?id=1364201
When initiating an OCR of a pdf page with pdfsandwich, tesseract produces a page that
> contains a font which doesn't have any
usable glyphs (they named it GlyphLessFont). It has only .notdef and
.null replacements (the squares). Evince uses the .notdef glyph if there
is no glyph for the character. The reason that Okular highlight the text
is because it does it in the image not as a regular text as evince does.
pdftotext recognises the characters.
Now, the question is: can tesseract be told to use a different font?
Asked by ingli
(2029 rep)
Aug 27, 2016, 08:14 AM
Last activity: Mar 22, 2017, 09:39 PM
Last activity: Mar 22, 2017, 09:39 PM