Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

0 answers

33 views

Relinking OCR data to downscaled images

I have a PDF consisting of scanned pages with OCR done by `tesseract`. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages? --- Original question (*Sudden surge in the size of PDF generated by Gho...

                                  I have a PDF consisting of scanned pages with OCR done by tesseract. I want to downscale the images (by around 4x) and retain the OCR. What would be an automatic way to relink the OCR data to the new coordinates on the pages?

---
Original question (*Sudden surge in the size of PDF generated by Ghostscript 9.55.0*):

I have a large PDF and am unsuccessfully trying to reduce its size. (The file is copyrighted material, so I cannot post it; I am doing this for personal use. It must be a common problem, though: I had the same experience earlier with other files from different sources.)
The file is a scan of a book with 1500 B&W A4 pages of text, no pictures at all. The individual pages were mogrify-ed into PNG images of equal height (around 1000px) and cleaned up via scantailor-advanced. Then each of the pages (now in TIFF) was tesseracted. The results were pdfunited into a 200MB file.

This is way too large for this kind of book. I would like to be able to shrink it to around 30MB, perhaps 50. (The total text size extracted by pdftext is 9MB.)

Most of the PDF compression methods I found on StackExchange and other sites boil down to gs with varying parameters. On my machine they all behave in a very similar way. I start gs in the terminal and switch to a GUI file manager. The size of the output file grows slowly and steadily from 0 to around 15MB (no matter the settings), and then it gallops in the last split second, as if gs gives up and just dumps the input into the output verbatim. (I attributed this to memory shortage, but the program also exhibits similar behavior on a relatively small, 100-page, part of this file.)
If gs is not told to change the DPI (300), the output file becomes as large as the input was. If the DPI is changed to 72, the file becomes 70MB; this is still too much for such a loss in image quality.

Is there an explanation of this surge? Should I perhaps use some other toolchain on the raw scans, or a different optimization tool? pdfsizeopt is very slow and seems to lead to 10% reduction. tiff2pdf -j 50 saves 5% (which will be re-added during OCR).
                                

Dilettante (101 rep)

Jul 25, 2025, 07:17 PM • Last activity: Jul 26, 2025, 07:19 PM

1 votes

0 answers

74 views

Pdfsandwich does not work

arch-linux tesseract

I use arch linux. When running pdfsandwich, I get the following error: ``` ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpae6a1b/pdfsandwich76f2d2.tif>/dev/null 2>&1 /tmp/pdfsandwich_tmpae6a1b/pdfsandwich6ff0e9 -l eng pdf " failed. Terminating pdfsandwich. All temporary files are k...

I use arch linux. When running pdfsandwich, I get the following error:

ERROR: Command "OMP_THREAD_LIMIT=1 tesseract /tmp/pdfsandwich_tmpae6a1b/pdfsandwich76f2d2.tif>/dev/null 2>&1 /tmp/pdfsandwich_tmpae6a1b/pdfsandwich6ff0e9  -l eng pdf " failed. 
Terminating pdfsandwich. All temporary files are kept.

after receiving the error message

WARNING: The convert command is deprecated in IMv7, use "magick" instead of "convert" or "magick convert"

Does anyone know what the problem is? I do not know where to begin to try fixing it. EDIT: When running

tesseract /tmp/pdfsandwich_tmpae6a1b/pdfsandwich76f2d2.tif /tmp/pdfsandwich_tmpae6a1b/pdfsandwich6ff0e9  -l eng pdf

I get

Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

fish_monster (111 rep)

Sep 18, 2024, 07:24 PM • Last activity: Sep 20, 2024, 05:46 PM

0 votes

0 answers

300 views

What happened to Tesseract's "Math / equation detection module"?

debian ocr tesseract

I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure...

                                  I was able to get Tesseract to run via a Python script on my Windows machine to turn non-searchable PDFs into searchable ones. When downloading Tesseract onto windows, it asked me which languages I wanted and I selected them, this was when I learned about the math module to begin with. I am not sure how effective the math module was, but I could see that it was downloaded when I checked the languages. 

Now I am trying to install Tesseract on Debian.

To install Tesseract I used the command:

    sudo apt install -y tesseract-ocr

Then, to ensure I had the math module, I would always follow that up with:

    sudo apt install tesseract-ocr-equ

And, I am pretty sure that would install the math module. I remember using that command successfully several times, including earlier this morning. However, now, when I use that code, I get the following messages: 

    Reading package lists... Done
    Building dependency tree... Done
    Reading state information... Done
    E: Unable to locate package tesseract-ocr-equ

Just to make sure I wasn't crazy, I looked up the language codes used by Tesseract, [according to Debian.org](https://manpages.debian.org/testing/tesseract-ocr/tesseract.1.en.html#LANGUAGES_AND_SCRIPTS:~:text=Math%20/%20equation%20detection%20module) , and they say that "equ" belongs to the "Math / equation detection module", admittedly that is an earlier version. So, I tried the following code: 

    sudo apt-get install -y tesseract-ocr-equ

Among the several lines of code that I got in response were the following: 

    Note, selecting 'tesseract-ocr-uzb-cyrl' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-ell' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-eng' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-enm' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-epo' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-est' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-eus' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-que' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-uig' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-ukr' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-urd' for regex 'tesseract-ocr-[equ]'
    Note, selecting 'tesseract-ocr-uzb' for regex 'tesseract-ocr-[equ]'
    tesseract-ocr-eng is already the newest version (1:4.1.0-2).
    tesseract-ocr-eng set to manually installed.

So, this made me wonder if there was a different math module for different languages, and the math module is automatically downloaded with the language you download. I just really remember using the command initially without any problem. That being said, I have had several head injuries, so my memory is not entirely reliable. It's just that if I turn out to have been mistaken here and I have not been using that code as I remember, this will be one of those deeply troubling times due to how vividly I remember this working. 

So, the primary question is how do I download the "Math / equation detection module" for Tesseract onto my Linux Beta on my Chromebook. Secondarily, could someone tell me if the functionality of the "sudo apt install tesseract-ocr-equ" command changed recently. This is frustrating me quite a bit. I am hoping that someone just changed the functionality this morning and math modules are now built into the languages. 
                                

Curious Layman (101 rep)

May 16, 2024, 04:17 PM • Last activity: May 21, 2024, 09:06 AM

2 votes

0 answers

50 views

OCR high res images & combine OCR data later, after image compression?

pdf compression ocr tesseract

I have a large number of .tif's coming out of ScanTailor. Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and finally combine the OCR data with the compressed images? The point is that I don't want to compress befo...

                                  I have a large number of .tif's coming out of ScanTailor.  Is there a way that I might OCR those .tif's with tesseract, holding the OCR data separate from the images; then compress the images, and finally combine the OCR data with the compressed images?

The point is that I don't want to compress before I OCR, and the tools for compressing the pdf's later, preserving the OCR, are not great.

Diagon (740 rep)

Jul 7, 2023, 10:50 PM

1 votes

1 answers

594 views

Best command-line OCR software for recognizing typed text over colorful background

image-manipulation ocr tesseract

I need to extract text from images like the one below: [![example image][1]][1] As you can see, the text is typed not handwritten. Moreover, the background is colorful. I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produc...

                                  I need to extract text from images like the one below:

As you can see, the text is typed not handwritten. Moreover, the background is colorful.

I've tried Tesseract OCR, and while it works some of the times, it fails miserably on some inputs. For the above example, it produces "Due CoN  aicomrBi em Cela RTL".

Which command-line OCR software would you recommend? If Tesseract is my best bet, can I transform these images to make it easier for Tesseract to recognize the characters?

**EDIT**: Based on @MarcusMüller's suggestion, I used convert -threshold 55% to better separate the foreground text from the background. The resulting images are much better!

Alas, Tesseract still is useless. On this new image, it produces: "Bim KM ioes Bm Meme e Cera".

As such, the question remains open.

user549392

Nov 15, 2022, 07:35 PM • Last activity: Nov 15, 2022, 09:26 PM

0 votes

1 answers

161 views

Tesseract doesn't accept process substitution

shell-script process-substitution tesseract

I'm making a quick script that is supposed to use OCR tool (`tesseract`) on image in clipboard to convert it to text and output it. It looks like this: ```sh #!/bin/sh temp="$(mktemp tmpXXX.png)" xclip -selection clipboard -t image/png -o > $temp tesseract $temp stdout 2>/dev/null rm $temp ``` What...

I'm making a quick script that is supposed to use OCR tool (tesseract) on image in clipboard to convert it to text and output it. It looks like this:

#!/bin/sh

temp="$(mktemp tmpXXX.png)"
xclip -selection clipboard -t image/png -o > $temp
tesseract $temp stdout 2>/dev/null
rm $temp

What I'm wondering is why doesn't this one-liner tesseract <(xclip -selection clipboard -t image/png -o) stdout work? From what I know, process substitution is supposed to make temporary file (similar to my full script) that tesseract uses as input file. Alas, this leads to an error:

Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Error during processing.

Does anybody have an idea why this happens? Thanks in advance.

Fedja (125 rep)

Apr 4, 2022, 05:08 PM • Last activity: Apr 4, 2022, 06:10 PM

0 votes

1 answers

129 views

Scripting tesseract for file manager context menu

shell-script file-manager dolphin tesseract

File manager context menu scripts sometimes do the job far quicker than using a GUI utility. So I've been using dozens of simple and more complex scripts for a long time in file managers Dolphin, Nautilus and Nemo, although I have elementary level scripting skills. However, this time I'm stuck with...

                                  File manager context menu scripts sometimes do the job far quicker than using a GUI utility. So I've been using dozens of simple and more complex scripts for a long time in file managers Dolphin, Nautilus and Nemo, although I have elementary level scripting skills. 
However, this time I'm stuck with a very simple loop to OCR selected image file(s) using **tesseract** in **Dolphin**, which works in many other scripts:

    for filename in "${@}"; do
    	tesseract -l eng "$filename" "${filename%.*}"
    done

This should normally be executed for a selected image (or each and every one of the selected images) like this, a command which works in Terminal, giving me a text file named "image.txt":

    tesseract -l eng "image.png" "image"

Any ideas please???
                                

Sadi (515 rep)

Feb 25, 2022, 02:20 PM • Last activity: Feb 27, 2022, 06:05 PM

1 votes

0 answers

568 views

Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

ocr tesseract

I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" `tesseract`. https://www.linuxlinks.com/ocrtool...

                                  I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" tesseract. https://www.linuxlinks.com/ocrtools/  second best on chart. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution   

> Tesseract is probably the most accurate open source OCR engine
> available.  

I've installed it from distro via apt-get and run. Result with out-of-the-box is IMO awful. Why? Maybe it can be ealily fixed? Or advice another package that does the job. The page I've tried to recognize lacks pictures, as I see it it is rather easy task. See below the result:

Edit: in fact result when that small part is processed were much better, but when whole is processed than results are not ok. I understand making lines more horizontal and not skewed might help a lot, still I was hoping software got good at recognizing non-perfectly aligned text.



 

    oon usb 1-@: |
    “3792661 usb 1-8: New USB device found, idVendor=1343, idProduct:
    
    7.983163] usb 1-8: New USB dev bs P luct=5662, bedDevice=16.6?
    
    re eh peeled haibbetaia a
    
    : new high-speed USB device number 5 PhS |
    i
    
    Per Samm SCR Can)
    t pela ee rcpt PP cay
    : 2.998668) usb 1-8: er
    t
    Ct


When only small part is processed:  

    2.837811) usb 1-8: new high-speed USB device number 5 using xhei_hed
    
    2.979266] usb 1-8: New USB device ECU CREME Cnt ttc cain Tt teen Td
    7.983163] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumbers@
    
    ?.9869291 usb 1-8: Product: Integrated Camera

Added 1:  

Tried again smaller and less skewed picture, I guess software considers time stamps as separate column, I have not seen on man page options to tweak that:




    f a eg
    | 7.849264]
    Device= 6.44
    f 7 .6492961
    | 7.849355]
    f 7.849415]
    [ 7.849492]
    | Van eos
    fl 7.861846]
    if Va ACB
    | 7.864776]
    if eel Be
    Ha Bs) bs 4
    if be A be ge
    C ie BD LB
    ce B)
    te] Bs]
    rage
    lb eae
    8.962076)
    ie Ke Lb
    9.600567)
    9.696957)
    9 .6970371
    
    YS SF SS Se
    
    usb 1-8: new high-speed USB device number 4 using xhci_hcd
    usb 1-8: New USB device found, idVendor=04f2, idProduct=b449, bed
    
    usb 1-8: New USB device strings: Mfr=3, Product=1, SerialNumber=2
    usb 1-8: Product: Integrated Camera
    
    usb 1-8: Manufacturer: Chicony Electronics Co.,Ltd.
    usb 1-8: SerialNumber: 6x0001
    
    usb-storage 1-1:1.6: USB Mass Storage device detected
    
    scsi host3:
    
    usb-storage 1-1:1.6
    
    usbcore: registered new interface driver usb-storage
    usbcore: registered new interface driver uas
    
    scsi 3:0:6:@: Direct-fAccess General UDisk eg
    sd 3:0:0:0: Attached scsi generic sgi type @
    
    eM Pee PM eA PA ed) te) ae
    Py Me ee dd
    
    Py ee ee eee dm
    
    sd 3:0:0:0: [sdb] Assuming drive cache: write through
    
    sdb: sdbi sdb2 sdb3
    
    sd 3:0:0:0: [sdb] Attached SCSI removable disk
    
    squashfs: version 4.6 (2609/01/31) Phillip Lougher
    
    Copying live image to RAM...
    Ca ewe te Mae

                                

Martian2020 (1443 rep)

Jan 10, 2022, 06:35 AM • Last activity: Jan 10, 2022, 07:13 AM

2 votes

0 answers

105 views

Is there software to manually OCR / teach OCR for handwriting (non-english) texts?

pdf software-rec ocr tesseract

I had a problem that can't solve Tesseract/Abbyy Finereader etc - they can't recognize handwriting Russian as example. So I search 1. OCR software for such things 2. or a way to manually OCR my pdfs (create layers, draw squares, fill it with text by hands) 3. maybe teach OCR engine locally for autom...

                                  I had a problem that can't solve Tesseract/Abbyy Finereader etc - they can't recognize handwriting Russian as example.
So I search

1. OCR software for such things
2. or a way to manually OCR my pdfs (create layers, draw squares, fill it with text by hands)
3. maybe teach OCR engine locally for automation after some manually job
                                

PDD (21 rep)

Oct 15, 2021, 04:19 AM

0 votes

1 answers

232 views

How do you save the text in the terminal to various text formats?

terminal tesseract

I'm playing around a bit with OCR software, in particular I'm spending a bit of time with tesseract. I got it to where I can load an image and get tesseract to rip the text from the image, in Linux terminal. I'm now trying to figure out how I can automatically save that ripped text to pdf, odf, txt...

                                  I'm playing around a bit with OCR software, in particular I'm spending a bit of time with tesseract. I got it to where I can load an image and get tesseract to rip the text from the image, in Linux terminal. I'm now trying to figure out how I can automatically save that ripped text to pdf, odf, txt and word formats, from the terminal.
                                

Neil Meyer (149 rep)

Mar 8, 2021, 09:21 AM • Last activity: Mar 9, 2021, 10:50 AM

10 votes

2 answers

13787 views

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

ocr tesseract

# Problem `pytesseract.image_to_string()` takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and simultaneously with supervisor scripts). Apart from taking too much time, the processes are also showing hi...

# Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and simultaneously with supervisor scripts). Apart from taking too much time, the processes are also showing high CPU usage. Time taken by pytesseract.image_to_string() when run via Supervisord: ~30s Time taken by pytesseract.image_to_string() when run via Bash: 0.1s This problem only occurs, if there are a lot of processes, executing pytesseract.image_to_string(), being run via supervisord (around 22 instances). If I reduce the number of instances (to around 10), the scripts executed via supervisord also run smoothly. ### System Information OS: Ubuntu 18.04.2 LTS (bionic) Supervisord: Version 3.3.1 Tesseract: Version 4.0.0-beta.1 Python: Version 3.6 PyTesseract: Version 0.2.5 ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127357
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 127357
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Let me know if you need any more information. ## Edit 1 (or I know what's NOT the source of this problem) I am fairly certain that it is not an issue with Supervisord. When I run one instance from an ssh shell, the function (pytesseract.image_to_string()) is executed smoothly (i.e takes only 0.1s), while there are 10 instances being run via Supervisord. When I start another instance from a new ssh shell, both the instances (ones started from ssh) run smoothly most of the time. When I start yet another instance from a new ssh shell, all the three instances start choking, taking around 10s to execute the function. This time keeps on increasing as I add more instances via shell. So the problem can be replicated even with a shell. ### More Information I ran the program with strace -T -f but I could not figure out what exactly is causing the spike in time. For a function call that takes 1s

Top 10 system calls sorted by time taken
1.504530    [pid 29921]  [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30166
0.503915    [pid 29932]  )      = 0 (Timeout)
0.503472    [pid 29932]  )      = 0 (Timeout)
0.500524    [pid 29933]  )      = 0 (Timeout)
0.500515    [pid 29933]  )      = 0 (Timeout)
0.500514    [pid 29932]  )      = 0 (Timeout)
0.500512    [pid 29933]  )      = 0 (Timeout)
0.069869    [pid 30169]  )       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
0.035989    [pid 30167]  )       = 0
0.016002    [pid 30168]  )       = 0

For a function call that takes 9s

Top 10 system calls sorted by time taken
9.795787    [pid 29921]  [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 30106
0.515960    [pid 29933]  )      = 0 (Timeout)
0.511955    [pid 29933]  )      = 0 (Timeout)
0.507979    [pid 29932]  )      = 0 (Timeout)
0.507968    [pid 29932]  )      = 0 (Timeout)
0.505257    [pid 29932]  )      = 0 (Timeout)
0.503988    [pid 29932]  )      = 0 (Timeout)
0.503978    [pid 29932]  )      = 0 (Timeout)
0.503975    [pid 29932]  )      = 0 (Timeout)
0.503974    [pid 29932]  )      = 0 (Timeout)

Ashish (270 rep)

Jul 18, 2019, 08:29 AM • Last activity: Jan 28, 2021, 05:56 PM

0 votes

1 answers

1583 views

Install tesseract offline in RHEL

rhel yum tesseract

I have an RHEL based server that does not connect to the internet. I need to install Tesseract >4.0 on this server. Therefore, my option was to download RPM packages from another and move them to the server and install using `rpm` command. I have used ([https://build.opensuse.org/project/show/home:A...

                                  I have an RHEL based server that does not connect to the internet. I need to install Tesseract >4.0 on this server. Therefore, my option was to download RPM packages from another and move them to the server and install using rpm command. I have used (https://build.opensuse.org/project/show/home:Alexander_Pozdnyakov ) from the official tesseract documentation to download the RPM.

The issue is when I try to install those RPMs, they have a lot of other dependencies which is very difficult to get one by one. Are there any other alternatives to install tesseract without connecting to the internet? Or any other source to download all RPMs at once?

Sathindu (101 rep)

Aug 19, 2020, 10:21 AM • Last activity: Aug 19, 2020, 11:40 AM

3 votes

0 answers

345 views

Debian Buster: Tesseract not supporting URL as argument

debian tesseract

I'm trying to parse text from a hosted image, but it looks like I've miss-configured Tesseract. I'm using Debian Buster, `tesseract-ocr`, `libtesseract-dev` and a Ruby wrapper are installed. ``` # $ tesseract -v tesseract 4.0.0 leptonica-1.76.0 libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpn...

I'm trying to parse text from a hosted image, but it looks like I've miss-configured Tesseract. I'm using Debian Buster, tesseract-ocr, libtesseract-dev and a Ruby wrapper are installed.

#  $ tesseract -v
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

Inside a terminal tesseract output returns Error, cannot read input : No such file or directory. The same error message is raised using the Ruby gem. Did I miss something after installing the packages ? The doc talks about manually placing the traneddata directory on Ubuntu, should it also be done on Debian ? > The traineddata is currently not shipped with the snap package and must be placed manually to ~/snap/tesseract/current. I can get it working by using curl and local path as argument, but it should support URL as argument Thanks **EDIT** I've tested both v4.1.1 and v5.0.0 by following these instructions and setting up tessdata directory. They both explicity returns that they don't support URLs:

Tesseract Open Source OCR Engine v5.0.0-alpha-647-g4a00 with Leptonica
Error, this tesseract has no URL support
Error during processing.

I'm obviously missing something because release notes says it supports URL since 4.1.1

Sumak (273 rep)

May 6, 2020, 05:05 PM • Last activity: May 6, 2020, 06:40 PM

1 votes

0 answers

51 views

script run via keyboard binding does not write to file

linux-mint keyboard-shortcuts tesseract

Following bash script interprets text in an image file and writes to a .txt file. #!/usr/bin/env bash LD_LIBRARY_PATH="/usr/local/lib" export LD_LIBRARY_PATH /usr/local/bin/tesseract /home/martin/work/textpic.png /home/martin/work/tesseract-out When I run it from the terminal the tesseract-out.txt i...

                                  Following bash script interprets text in an image file and writes to a .txt file.

    #!/usr/bin/env bash
    
    LD_LIBRARY_PATH="/usr/local/lib"
    export LD_LIBRARY_PATH

    /usr/local/bin/tesseract /home/martin/work/textpic.png /home/martin/work/tesseract-out

When I run it from the terminal the tesseract-out.txt is created, but when I run it via custom keyboard shortcut nothing is written. I have ensured that the correct script is run by putting echo "test" > /home/martin/work/test.txt in it, which creates the file.

I have run sudo chmod 777 on tesseract in case it was some permission issue.

I have an inkling that tesseract needs some lib files which are not in paths when the script is run by shortcut, so I put these lines at the top of my script file (I know that some of the lib files it needs are in /usr/local/lib):

    LD_LIBRARY_PATH="/usr/local/lib"
    export LD_LIBRARY_PATH

But it did not do the the trick. How can I debug what is going wrong? If I could obtain some kind of error message somehow, that would go a long way.

My linux version:


    DISTRIB_RELEASE=18.3
    DISTRIB_CODENAME=sylvia
    DISTRIB_DESCRIPTION="Linux Mint 18.3 Sylvia"
    NAME="Linux Mint"
    VERSION="18.3 (Sylvia)"
    ID=linuxmint
    ID_LIKE=ubuntu
    PRETTY_NAME="Linux Mint 18.3"
    VERSION_ID="18.3"
    HOME_URL="http://www.linuxmint.com/ "
    SUPPORT_URL="http://forums.linuxmint.com/ "
    BUG_REPORT_URL="http://bugs.launchpad.net/linuxmint/ "
    VERSION_CODENAME=sylvia
    UBUNTU_CODENAME=xenial

The keyboard-shortcuts manager I use is the standard GUI one in Mint 18. Might be I could use something else to get a better error message?



EDIT:

I verified that all needed libs are in path by putting /sbin/ldconfig -N -v $(sed 's/:/ /g'  /home/martin/work/libs-in-path.txt at the bottom of my script and crosschecking the output against readelf -d /usr/local/bin/tesseract | grep NEEDED. 
                                

MyrionSC2 (111 rep)

Dec 24, 2019, 09:41 AM • Last activity: Dec 26, 2019, 10:51 AM

0 votes

1 answers

296 views

Leptonica compilation error

make tesseract

Trying to install leptonica v1.78 on Ubuntu 16, but it's not working for some reason. After running ```./configure``` and ```make```, I keep getting this error: ``` make[2]: Entering directory '/home/user/Documents/leptonica/leptonica-1.78.0/prog' CC convertfilestopdf.o CCLD convertfilestopdf ../src...

Trying to install leptonica v1.78 on Ubuntu 16, but it's not working for some reason. After running

./configure

and

, I keep getting this error:

make: Entering directory '/home/user/Documents/leptonica/leptonica-1.78.0/prog'
  CC       convertfilestopdf.o
  CCLD     convertfilestopdf
../src/.libs/liblept.so: undefined reference to `lzham_z_version'
../src/.libs/liblept.so: undefined reference to `lzham_z_deflateInit'
../src/.libs/liblept.so: undefined reference to `lzham_z_inflate'
../src/.libs/liblept.so: undefined reference to `lzham_z_deflate'
../src/.libs/liblept.so: undefined reference to `lzham_z_deflateEnd'
../src/.libs/liblept.so: undefined reference to `lzham_z_inflateInit'
../src/.libs/liblept.so: undefined reference to `lzham_z_inflateEnd'
collect2: error: ld returned 1 exit status
Makefile:2603: recipe for target 'convertfilestopdf' failed
make: *** [convertfilestopdf] Error 1
make: Leaving directory '/home/user/Documents/leptonica/leptonica-1.78.0/prog'
Makefile:476: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1
make: Leaving directory '/home/user/Documents/leptonica/leptonica-1.78.0'
Makefile:385: recipe for target 'all' failed
make: *** [all] Error 2

I think I installed all the dependencies needed, am I missing something?

Gyakenji (101 rep)

Jun 10, 2019, 07:37 AM • Last activity: Jun 11, 2019, 02:11 AM

2 votes

1 answers

700 views

Where I can get Tesseract binaries for Debian 6 64bit?

debian ocr tesseract

I used apt-get to install Tesseract but it's not really working. Maybe I could just download binaries somewhere, put in a dir and use this way? What's wrong with my Tesseract now: tesseract --help tesseract:Error:Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...] and tesse...

                                  I used apt-get to install Tesseract but it's not really working. Maybe I could just download binaries somewhere, put in a dir and use this way?

What's wrong with my Tesseract now:

    tesseract --help
    tesseract:Error:Usage:tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

and

    tesseract test.tif out2.txt -l pol
    Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/pol.unicharset

I have downloaded and unpacked Polish language into the dir above, but the only pol.* is pol.traindeddate.

buikoto (21 rep)

Jan 23, 2015, 10:05 PM • Last activity: Mar 7, 2018, 09:47 PM

5 votes

1 answers

2029 views

tesseract: is it possible to change font output in OCRed pdf?

fonts pdf evince ocr tesseract

Following up on https://unix.stackexchange.com/questions/301318/how-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf/301319#301319 I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them,...

                                  Following up on https://unix.stackexchange.com/questions/301318/how-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf/301319#301319  I have successfully produced OCRed pdf pages.

In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them and paste them elsewhere successfully. This does not seem to be a bug of Evince: https://bugzilla.redhat.com/show_bug.cgi?id=1364201 

When initiating an OCR of a pdf page with pdfsandwich, tesseract produces a page that 

> contains a font which doesn't have any
usable glyphs (they named it GlyphLessFont). It has only .notdef and
.null replacements (the squares). Evince uses the .notdef glyph if there
is no glyph for the character. The reason that Okular highlight the text
is because it does it in the image not as a regular text as evince does.

pdftotext recognises the characters.

Now, the question is: can tesseract be told to use a different font?
                                

ingli (2029 rep)

Aug 27, 2016, 08:14 AM • Last activity: Mar 22, 2017, 09:39 PM

Showing page 1 of 17 total questions