My blog about computering and stuff, made with hugo. https://thoughtpile.tastytea.de/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

12 KiB


title: "How I convert PDFs to EPUB semi-automatically" slug: how-i-convert-pdfs-to-epub description: "A step by step guide to clean EPUBs from PDFs using Calibre, Emacs and time." date: 2021-03-15T04:12:00+01:00 type: posts draft: false tags: - E-books - Calibre - Emacs toc: true ---

Sometimes e-books come only in PDF format. Almost always PDFs are a pain to read on e-book readers. You can use Calibre to automatically convert it, but the results are okay-ish at best. If the PDF has footnotes, forget it. Unfortunately, the type of books that most often come only as PDFs are science books and these usually have a lot of footnotes.

One option is to use Calibre to convert and then fix the result, but I have found that I get better results in less time when I create a new EPUB, copy the PDF’s content into Emacs, clean it up there and then copy it over to Calibre. This process is what I want to share with you here. You will need Calibre, Emacs or another editor with keyboard macros and some knowledge of XHTML and CSS to follow this guide. It will take long and is boring, but the result is a clean and enjoyable book.

Note
I will use the Emacs key notation throughout this guide.

Create a new book in Calibre

Click on Add books  Add empty book. Then fill in the metadata and select “EPUB” as format. You can add more metadata and a cover image by right-clicking the book and then selecting Edit metadata. Open Calibre’s editor by right clicking on the book and selecting Edit book.

Calibre creates EPUB 2 books by default. Convert the book to EPUB 3[1] by clicking Tools  Upgrade book internals. This will, among other things, convert toc.ncx to nav.xhtml. To support e-book readers which can’t handle EPUB 3 yet, re-create toc.ncx as an empty file. It is filled automatically when you create the table of contents. Open metadata.opf and replace <spine> with <spine toc="id1"> (id1 is the ID of toc.ncx, defined a few lines above).

You start with a single XHTML file, start.xhtml. I always use that for the title page, the copyright notice and so on. You can force a page break to separate the title and the copyright notice with CSS: Add style="page-break-after: always;" to the last element of the virtual “page” or use a CSS class. To add a CSS file click File  New file and enter a filename ending with .css. Add the CSS file to the document by right clicking on start.xhtml in the file browser and selecting Link stylesheets….

Note
The built-in preview does not show page breaks.

Your files should look similar to this:

start.xhtml
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="de">

<head>
  <title>Meine zwei Jahre in Russland</title>
  <link rel="stylesheet" type="text/css" href="style.css"/>
</head>

<body>

  <section class="pagebreak center" id="title" aria-label="Title">

    <p>Emma Goldman</p>

    <h1>Meine zwei Jahre in Russland</h1>

  </section>

  <section aria-label="Meta data">
    <p>1. Auflage<br/>
    München, Januar 2020</p>

    <p id="copyright">Anti-Copyright (siehe S. 362)</p>

    <p>Die englische Originalausgabe erschien 1921 und 1925 in den
    USA aufgrund eines Versehens in zwei Teilen unter den Titeln
    <em>My Disillusionment in Russia</em> und <em>My Further Disillusionment in
    Russia.</em></p>
  </section>

</body>

</html>
style.css
.pagebreak {
  page-break-after: always;
}
.center {
  text-align: center;
}

I added the IDs “title” and “copyright” to add semantic links to them later.

Some styling advice

Please refrain from using CSS too much. Most people have configured their e-book readers the way they like, with the right font, font-size, margins and so on. If you override their settings, they will be annoyed. I usually only style the title page and headlines.

Do not use <i> or <b> tags to emphasize, do not use font-style: italic or font-style: bold. Use <em> for emphasis and <strong> for importance so screen readers will be able to pronounce the text differently.

Make your books as accessible as possible. Using the right tags and not just <div> and <span> for everything is a good start, using epub:type and/or role as well as aria-label or aria-labelledby is even better. Read more at AccessiblePublishing: EPUB Semantics, ARIA Roles, & Metadata.

Add text to the book

Add a new .xhtml file in Calibre and write in the heading of the first chapter. Then switch to Emacs and copy the first paragraph from the PDF into a text-mode buffer. The emphasis will not be copied over, so you’ll have to re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the rest of the paragraphs of the chapter, leaving 2 blank lines between each paragraph. The paragraphs will be broken and likely be full of hyphens at the end of the lines.

elisp function to add HTML tags easily
(defun my/html-surround-with-tag (beg end)
  "Surround region with HTML tag."
  (interactive "*r")
  (if (region-active-p)
      (let ((tag (completing-read "Tag: "
                                  '("blockquote" "em" "strong"))))
        (insert (concat "<" tag ">"
                        (delete-and-extract-region beg end)
                        "</" tag ">")))
    (message "No active region")))

Make sure that auto-fill-mode is disabled. Position the cursor at the start of the buffer and press <f3> to start recording a macro. Press <end> <deletechar> SPC (space bar) and then <f4> to stop recording. If there is a hyphen at the end of the current line, press <backspace> 2 times. Press <f4> to call the macro and repeat until you are at the end of the paragraph. Move the cursor to the first line of the next paragraph and repeat…

Now you should have a text file with 1 paragraph per line. We need to wrap all lines in <p> tags, except block quotes and sub-headlines. Either use another macro (<p> <end> </p> <down> <down> <home>) or this elisp function:

(defun my/html-paragraphify-buffer ()
  "Wrap every line not beginning with < or a newline in <p> tags."
  (interactive)
  (goto-char (point-min))
  (while (re-search-forward "^\\([^<
].+\\)$" nil t)
    (replace-match "<p>\\1</p>")))

Once you are done, copy the result into Calibre.

Add footnotes

Use the method from above to copy the footnotes into the now empty Emacs buffer and clean them up until you have 1 paragraph per line. Footnotes need to be hyperlink-able, so we can’t just wrap them in plain <p> tags, they need IDs. I like to use <li epub:type="endnote" id="fn1">[…]</li> if the footnote numbers are increasing throughout the book or <li epub:type="endnote" id="fn1_1">[…]</li> if they start with 1 in each chapter. We are going to use a macro with a counter to generate consecutively numbered IDs. First, set the counter to 1 with C-x C-k C-c 1. Then, record this macro:

<li epub:type="endnote" id="fn1_ C-x C-k <tab> "> <end> </li> <down> <down> <home>

Tip
Use M-x describe-key (mapped to C-h k by default) to find out what a key combination does.

Call the macro until every footnote is wrapped and copy them to the end of the chapter or the end of the book in Calibre. Wrap them in <section epub:type="endnotes" role="doc-endnotes"><ol>[…]</ol></section>

Note
If the notes appear at the bottom of each page, they are called “footnotes” (epub:type="footnote"). If they appear at the end of each chapter or the end of the book, they are called “endnotes”. (epub:type="endnote"). Some e-book readers display footnotes on the same page where the link to them is even if you put all footnotes at the end of the chapter / book.

Add references to footnotes

The footnotes are probably superscript numbers in the PDF but normal numbers in the EPUB right now. I found that the footnote-numbers are usually preceded by a space and followed by a space or <. I use the find & replace function in Calibre in Regex-mode to convert them to hyperlinks.

Find:  ([0-9]{1,3})([ <]) (note the leading space)
Replace: <sup><a epub:type="noteref" role="doc-noteref" href="#fn1_\1">\1</a></sup>\2

Press <f3> to search through the text and C-r to replace.

Finishing touches

Click Tools  Table of Contents  Edit table of Contents, remove the existing entry and click Generate ToC from major headings or Generate ToC from all headings.

Click Tools  Set semantics and set the location of the title page, copyright page, beginning of text and so on.

Select Tools  Check book and fix the errors.

Use EPUBCheck (command-line) and/or The Ace App (GUI) for more thorough checks.

You’re done! Enjoy your cleanly formatted book. 😊

Updates


1. EPUB 3 introduces many accessibility features, see AccessiblePublishing: EPUB Semantics, ARIA Roles, & Metadata for details.