280 lines
12 KiB
Plaintext
280 lines
12 KiB
Plaintext
---
|
|
title: "How I convert PDFs to EPUB semi-automatically"
|
|
slug: how-i-convert-pdfs-to-epub
|
|
description: "A step by step guide to clean EPUBs from PDFs using Calibre, Emacs and time."
|
|
date: 2021-03-15T04:12:00+01:00
|
|
type: posts
|
|
draft: false
|
|
tags:
|
|
- E-books
|
|
- Calibre
|
|
- Emacs
|
|
toc: true
|
|
---
|
|
|
|
:source-highlighter: pygments
|
|
:experimental: true
|
|
:toc:
|
|
:toclevels: 2
|
|
|
|
:url-calibre: https://calibre-ebook.com/
|
|
:url-calibre-convert: https://manual.calibre-ebook.com/conversion.html#pdfconversion
|
|
:url-emacs: https://www.gnu.org/software/emacs/
|
|
:url-emacs-key-notation: https://www.emacswiki.org/emacs/EmacsKeyNotation
|
|
:url-helicon-type: https://www.heliconbooks.com/?id=blog&postid=EPUB3Footnotes
|
|
:url-epub-spec: https://www.w3.org/publishing/epub/epub-contentdocs.html
|
|
:url-ap-semantics: https://www.accessiblepublishing.ca/epub-semantic-aria-roles/
|
|
:url-epub3-rendering: http://idpf.org/forum/topic-623
|
|
:url-epubcheck: https://github.com/w3c/epubcheck
|
|
:url-epubcheck-footnote: {url-epubcheck}/issues/1018#issuecomment-809385963
|
|
:url-ace-app: https://daisy.github.io/ace/getting-started/ace-app/
|
|
:wp-pdf: https://en.wikipedia.org/wiki/PDF
|
|
:wp-epub: https://en.wikipedia.org/wiki/EPUB
|
|
:wp-xhtml: https://en.wikipedia.org/wiki/XHTML
|
|
:wp-css: https://en.wikipedia.org/wiki/CSS
|
|
|
|
:abbr-gui: pass:[<abbr title='Graphical User Interface'>GUI</abbr>]
|
|
|
|
Sometimes e-books come only in link:{wp-pdf}[PDF] format. Almost always PDFs are
|
|
a pain to read on e-book readers. You can use link:{url-calibre}[Calibre] to
|
|
automatically convert it, but link:{url-calibre-convert}[the results are
|
|
okay-ish at best]. If the PDF has footnotes, forget it. Unfortunately, the type
|
|
of books that most often come only as PDFs are science books and these usually
|
|
have a lot of footnotes.
|
|
|
|
One option is to use Calibre to convert and then fix the result, but I have
|
|
found that I get better results in less time when I create a new
|
|
link:{wp-epub}[EPUB], copy the PDF's content into link:{url-emacs}[Emacs], clean
|
|
it up there and then copy it over to Calibre. This process is what I want to
|
|
share with you here. You will need Calibre, Emacs or another editor with
|
|
keyboard macros and some knowledge of link:{wp-xhtml}[XHTML] and
|
|
link:{wp-css}[CSS] to follow this guide. It will take long and is boring, but
|
|
the result is a clean and enjoyable book.
|
|
|
|
[NOTE]
|
|
I will use the link:{url-emacs-key-notation}[Emacs key notation] throughout this
|
|
guide.
|
|
|
|
== Create a new book in Calibre
|
|
|
|
Click on menu:Add books[Add empty book]. Then fill in the metadata and select
|
|
“EPUB” as format. You can add more metadata and a cover image by right-clicking
|
|
the book and then selecting menu:Edit metadata[]. Open Calibre's editor by right
|
|
clicking on the book and selecting menu:Edit book[].
|
|
|
|
Calibre creates EPUB 2 books by default. Convert the book to EPUB
|
|
3footnote:[EPUB 3 introduces many accessibility features, see
|
|
link:{url-ap-semantics}[AccessiblePublishing: EPUB Semantics, ARIA Roles, &
|
|
Metadata] for details.] by clicking menu:Tools[Upgrade book internals]. This
|
|
will, among other things, convert `toc.ncx` to `nav.xhtml`. To
|
|
link:{url-epub3-rendering}[support e-book readers which can't handle EPUB 3
|
|
yet], re-create `toc.ncx` as an empty file. It is filled automatically when you
|
|
create the table of contents. Open `metadata.opf` and replace `<spine>` with
|
|
`<spine toc="id1">` (`id1` is the ID of `toc.ncx`, defined a few lines above).
|
|
|
|
You start with a single XHTML file, `start.xhtml`. I always use that for the
|
|
title page, the copyright notice and so on. You can force a page break to
|
|
separate the title and the copyright notice with CSS: Add
|
|
`style="page-break-after: always;"` to the last element of the virtual “page” or
|
|
use a CSS class. To add a CSS file click menu:File[New file] and enter a
|
|
filename ending with `.css`. Add the CSS file to the document by right clicking
|
|
on `start.xhtml` in the file browser and selecting menu:Link stylesheets…[].
|
|
|
|
[NOTE]
|
|
The built-in preview does not show page breaks.
|
|
|
|
Your files should look similar to this:
|
|
|
|
.`start.xhtml`
|
|
[source,html]
|
|
--------------------------------------------------------------------------------
|
|
<?xml version='1.0' encoding='utf-8'?>
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="de">
|
|
|
|
<head>
|
|
<title>Meine zwei Jahre in Russland</title>
|
|
<link rel="stylesheet" type="text/css" href="style.css"/>
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<section class="pagebreak center" id="title" aria-label="Title">
|
|
|
|
<p>Emma Goldman</p>
|
|
|
|
<h1>Meine zwei Jahre in Russland</h1>
|
|
|
|
</section>
|
|
|
|
<section aria-label="Meta data">
|
|
<p>1. Auflage<br/>
|
|
München, Januar 2020</p>
|
|
|
|
<p id="copyright">Anti-Copyright (siehe S. 362)</p>
|
|
|
|
<p>Die englische Originalausgabe erschien 1921 und 1925 in den
|
|
USA aufgrund eines Versehens in zwei Teilen unter den Titeln
|
|
<em>My Disillusionment in Russia</em> und <em>My Further Disillusionment in
|
|
Russia.</em></p>
|
|
</section>
|
|
|
|
</body>
|
|
|
|
</html>
|
|
--------------------------------------------------------------------------------
|
|
|
|
.`style.css`
|
|
[source,css]
|
|
--------------------------------------------------------------------------------
|
|
.pagebreak {
|
|
page-break-after: always;
|
|
}
|
|
.center {
|
|
text-align: center;
|
|
}
|
|
--------------------------------------------------------------------------------
|
|
|
|
I added the IDs “title” and “copyright” to add semantic links to them later.
|
|
|
|
=== Some styling advice
|
|
|
|
Please refrain from using CSS too much. Most people have configured their e-book
|
|
readers the way they like, with the right font, font-size, margins and so on. If
|
|
you override their settings, they will be annoyed. I usually only style the
|
|
title page and headlines.
|
|
|
|
Do not use `<i>` or `<b>` tags to emphasize, do not use `font-style: italic` or
|
|
`font-style: bold`. Use `<em>` for emphasis and `<strong>` for
|
|
importance so screen readers will be able to pronounce the text differently.
|
|
|
|
Make your books as accessible as possible. Using the right tags and not just
|
|
`<div>` and `<span>` for everything is a good start, using `epub:type` and/or
|
|
`role` as well as `aria-label` or `aria-labelledby` is even better. Read more at
|
|
link:{url-ap-semantics}[AccessiblePublishing: EPUB Semantics, ARIA Roles, &
|
|
Metadata].
|
|
|
|
== Add text to the book
|
|
|
|
Add a new `.xhtml` file in Calibre and write in the heading of the first
|
|
chapter. Then switch to Emacs and copy the first paragraph from the PDF into a
|
|
`text-mode` buffer. The emphasis will not be copied over, so you'll have to
|
|
re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the
|
|
rest of the paragraphs of the chapter, leaving 2 blank lines between each
|
|
paragraph. The paragraphs will be broken and likely be full of hyphens at the
|
|
end of the lines.
|
|
|
|
.elisp function to add HTML tags easily
|
|
[source,elisp]
|
|
--------------------------------------------------------------------------------
|
|
(defun my/html-surround-with-tag (beg end)
|
|
"Surround region with HTML tag."
|
|
(interactive "*r")
|
|
(if (region-active-p)
|
|
(let ((tag (completing-read "Tag: "
|
|
'("blockquote" "em" "strong"))))
|
|
(insert (concat "<" tag ">"
|
|
(delete-and-extract-region beg end)
|
|
"</" tag ">")))
|
|
(message "No active region")))
|
|
--------------------------------------------------------------------------------
|
|
|
|
Make sure that `auto-fill-mode` is disabled. Position the cursor at the start of
|
|
the buffer and press kbd:[<f3>] to start recording a macro. Press kbd:[<end>]
|
|
kbd:[<deletechar>] kbd:[SPC] (space bar) and then kbd:[<f4>] to stop
|
|
recording. If there is a hyphen at the end of the current line, press
|
|
kbd:[<backspace>] 2 times. Press kbd:[<f4>] to call the macro and repeat until
|
|
you are at the end of the paragraph. Move the cursor to the first line of the
|
|
next paragraph and repeat…
|
|
|
|
Now you should have a text file with 1 paragraph per line. We need to wrap all
|
|
lines in `<p>` tags, except block quotes and sub-headlines. Either use another
|
|
macro (`<p>` kbd:[<end>] `</p>` kbd:[<down>] kbd:[<down>] kbd:[<home>]) or this
|
|
elisp function:
|
|
|
|
[source,elisp]
|
|
--------------------------------------------------------------------------------
|
|
(defun my/html-paragraphify-buffer ()
|
|
"Wrap every line not beginning with < or a newline in <p> tags."
|
|
(interactive)
|
|
(goto-char (point-min))
|
|
(while (re-search-forward "^\\([^<
|
|
].+\\)$" nil t)
|
|
(replace-match "<p>\\1</p>")))
|
|
--------------------------------------------------------------------------------
|
|
|
|
Once you are done, copy the result into Calibre.
|
|
|
|
== Add footnotes
|
|
|
|
Use the method from above to copy the footnotes into the now empty Emacs buffer
|
|
and clean them up until you have 1 paragraph per line. Footnotes need to be
|
|
hyperlink-able, so we can't just wrap them in plain `<p>` tags, they need IDs. I
|
|
like to use `<li epub:type="endnote" id="fn1">[…]</li>` if the footnote numbers
|
|
are increasing throughout the book or `<li epub:type="endnote"
|
|
id="fn1_1">[…]</li>` if they start with 1 in each chapter. We are going to use a
|
|
macro with a counter to generate consecutively numbered IDs. First, set the
|
|
counter to 1 with kbd:[C-x] kbd:[C-k] kbd:[C-c] `1`. Then, record this macro:
|
|
|
|
`<li epub:type="endnote" id="fn1_` kbd:[C-x] kbd:[C-k] kbd:[<tab>] `">`
|
|
kbd:[<end>] `</li>` kbd:[<down>] kbd:[<down>] kbd:[<home>]
|
|
|
|
[TIP]
|
|
Use kbd:[M-x] `describe-key` (mapped to kbd:[C-h] kbd:[k] by default) to find
|
|
out what a key combination does.
|
|
|
|
Call the macro until every footnote is wrapped and copy them to the end of the
|
|
chapter or the end of the book in Calibre. Wrap them in `<section
|
|
epub:type="endnotes" role="doc-endnotes"><ol>[…]</ol></section>`
|
|
|
|
[NOTE]
|
|
If the notes appear at the bottom of each page, they are called “footnotes”
|
|
(`epub:type="footnote"`). If they appear at the end of each chapter or the end
|
|
of the book, they are called “endnotes”. (`epub:type="endnote"`). Some e-book
|
|
readers display footnotes on the same page where the link to them is even if you
|
|
put all footnotes at the end of the chapter / book.
|
|
|
|
=== Add references to footnotes
|
|
|
|
The footnotes are probably superscript numbers in the PDF but normal numbers in
|
|
the EPUB right now. I found that the footnote-numbers are usually preceded by a
|
|
space and followed by a space or `<`. I use the find & replace function in
|
|
Calibre in Regex-mode to convert them to hyperlinks.
|
|
|
|
Find: `` ([0-9]{1,3})([ <])`` (note the leading space) +
|
|
Replace: `<sup><a epub:type="noteref" role="doc-noteref" href="#fn1_\1">\1</a></sup>\2`
|
|
|
|
Press kbd:[<f3>] to search through the text and kbd:[C-r] to replace.
|
|
|
|
== Finishing touches
|
|
|
|
Click menu:Tools[Table of Contents > Edit table of Contents], remove the
|
|
existing entry and click btn:[Generate ToC from major headings] or btn:[Generate
|
|
ToC from all headings].
|
|
|
|
Click menu:Tools[Set semantics] and set the location of the title page,
|
|
copyright page, beginning of text and so on.
|
|
|
|
Select menu:Tools[Check book] and fix the errors.
|
|
|
|
Use link:{url-epubcheck}[EPUBCheck] (command-line) and/or link:{url-ace-app}[The
|
|
Ace App] ({abbr-gui}) for more thorough checks.
|
|
|
|
You're done! Enjoy your cleanly formatted book. 😊
|
|
|
|
== Updates
|
|
|
|
* [2021-03-21] Added ``epub:type``s to examples. See
|
|
link:{url-helicon-type}[Helicon Books: Footnotes in EPUB3],
|
|
link:{url-epub-spec}#sec-epub-type-attribute[the EPUB specification] and
|
|
link:{url-ap-semantics}[AccessiblePublishing: EPUB Semantics, ARIA Roles, &
|
|
Metadata] for more information.
|
|
* [2021-03-25] Clarified difference between footnotes and endnotes, replaced
|
|
`<span><aside>` with `<li>`.
|
|
* [2021-03-27] Added conversion to EPUB 3, use HTML 5 and ARIA attributes in
|
|
example, added accessibility-advice.
|
|
* [2021-03-29] Removed `role="doc-footnote"` from ``<li>``s because it
|
|
link:{url-epubcheck-footnote}[breaks the list semantics for Assistive
|
|
Technology users].
|
|
|
|
// LocalWords: Calibre
|