Is there any ligature-aware alternative for "pdfgrep" in command line?

2 votes

1 answer

859 views

text-processing command-line pdf character-encoding pdfgrep

                          I always use "pdfgrep" to search inside of multiple PDF files from the command line. But I met a problem: This ligature character "ﬁ" (see https://www.compart.com/en/unicode/U+FB01).  ;
"ﬁ" is in the word "fixed", so I could not search the term "fixed point operator" with pdfgrep -iR 'fixed point operator'. However, when I open the file with PDF readers such as Foxit reader and Evince, "ﬁ" is split into "f" and "i", thus searchable. Is there any more reliable alternative for the "pdfgrep"? Or is there any option keywords in "pdfgrep" to expand the encoding? 

The PDF file is http://direct.mit.edu/books/chapter-pdf/238450/9780262321037_can.pdf    . 

Ubuntu 20.04, amd64, kernel version Linux 5.6.0-1018-oem. pdfgrep has an option --unac. But if I install pdfgrep with sudo apt-get install pdfgrep, command --unac will report "pdfgrep: UNAC support disabled at compile time!"

    pdfgrep:
      Installed: 2.1.2-1build1
      Candidate: 2.1.2-1build1
      Version table:
     *** 2.1.2-1build1 500
            500 http://mirrors.huaweicloud.com/ubuntu  focal/universe amd64 Packages
            100 /var/lib/dpkg/status

                        

Asked by la la (21 rep)

Aug 29, 2020, 04:05 AM
Last activity: Dec 28, 2020, 11:56 PM

Is there any ligature-aware alternative for "pdfgrep" in command line?

Related Questions