Extracting table of contents from PDFs
1
vote
2
answers
279
views
I have a reasonably large personal library with books in various formats. I have tried to organize their metadata, including a text field containing the tables of contents. At the moment I am using the 'Area Text Selection' feature from my document reader to copy the text. Doing this for DJVUs with djview yields nicely formatted tables of contents, like this:
CONTENTS
1. EXPERIMENTS
1.1. The definition of an experiment ..... 1
1.2. Algebras of events as Boolean algebras .... 6
1.3. Operations with experiments ...... 9
1.4. Canonical representation of polynomials of events . . 12
....
I emphasize that all I did was drag my mouse across the page and click "Copy Text". If I try this with a PDF the structure is completely lost and I have to spend some time cleaning up the text selection, moving the page and section numbers around. I might get something like this:
Table of Contents
I
Introduction
1
Introduction
1.1
Table of Contents
1.2
Acknowledgments
1
3
3
6
II
....
I am looking for a PDF reader that can similarly copy the text but with the 'structure' preserved. The fact that DJVU readers have this capability tells me this ought to be possible.
Note: I am not talking about extracting ToCs from the bookmarks: many of my PDFs don't have any. I'd also like to avoid a CLI tool that has to process the entire file: I just want it to pick the text I select, but with the newlines and overall structure intact.
Asked by Luke
(13 rep)
Dec 16, 2024, 03:42 PM
Last activity: Dec 16, 2024, 05:00 PM
Last activity: Dec 16, 2024, 05:00 PM