Extracting table of contents from PDFs

1 vote

2 answers

279 views

text-processing pdf text-formatting djvu

I have a reasonably large personal library with books in various formats. I have tried to organize their metadata, including a text field containing the tables of contents. At the moment I am using the 'Area Text Selection' feature from my document reader to copy the text. Doing this for DJVUs with djview yields nicely formatted tables of contents, like this:

CONTENTS
1. EXPERIMENTS
1.1. The definition of an experiment ..... 1
1.2. Algebras of events as Boolean algebras .... 6
1.3. Operations with experiments ...... 9
1.4. Canonical representation of polynomials of events . . 12
....

I emphasize that all I did was drag my mouse across the page and click "Copy Text". If I try this with a PDF the structure is completely lost and I have to spend some time cleaning up the text selection, moving the page and section numbers around. I might get something like this:

Table of Contents
I
 Introduction
1
 Introduction
1.1
 Table of Contents
1.2
 Acknowledgments
1
3
3
6
II
....

I am looking for a PDF reader that can similarly copy the text but with the 'structure' preserved. The fact that DJVU readers have this capability tells me this ought to be possible. Note: I am not talking about extracting ToCs from the bookmarks: many of my PDFs don't have any. I'd also like to avoid a CLI tool that has to process the entire file: I just want it to pick the text I select, but with the newlines and overall structure intact.

Asked by Luke (13 rep)

Dec 16, 2024, 03:42 PM
Last activity: Dec 16, 2024, 05:00 PM

Extracting table of contents from PDFs

Related Questions