blog/content/posts/how-i-convert-pdfs-to-epub.adoc

280 lines
12 KiB
Plaintext
Raw Normal View History

2021-03-15 04:14:01 +01:00
---
title: "How I convert PDFs to EPUB semi-automatically"
slug: how-i-convert-pdfs-to-epub
description: "A step by step guide to clean EPUBs from PDFs using Calibre, Emacs and time."
2021-03-15 04:14:01 +01:00
date: 2021-03-15T04:12:00+01:00
type: posts
draft: false
tags:
- E-books
2021-03-27 20:03:09 +01:00
- Calibre
- Emacs
2021-03-15 16:34:25 +01:00
toc: true
2021-03-15 04:14:01 +01:00
---
:source-highlighter: pygments
:experimental: true
2021-03-15 16:34:25 +01:00
:toc:
:toclevels: 2
2021-03-15 04:14:01 +01:00
:url-calibre: https://calibre-ebook.com/
:url-calibre-convert: https://manual.calibre-ebook.com/conversion.html#pdfconversion
:url-emacs: https://www.gnu.org/software/emacs/
:url-emacs-key-notation: https://www.emacswiki.org/emacs/EmacsKeyNotation
2021-03-21 22:21:52 +01:00
:url-helicon-type: https://www.heliconbooks.com/?id=blog&postid=EPUB3Footnotes
:url-epub-spec: https://www.w3.org/publishing/epub/epub-contentdocs.html
:url-ap-semantics: https://www.accessiblepublishing.ca/epub-semantic-aria-roles/
:url-epub3-rendering: http://idpf.org/forum/topic-623
:url-epubcheck: https://github.com/w3c/epubcheck
:url-epubcheck-footnote: {url-epubcheck}/issues/1018#issuecomment-809385963
:url-ace-app: https://daisy.github.io/ace/getting-started/ace-app/
2021-03-15 04:14:01 +01:00
:wp-pdf: https://en.wikipedia.org/wiki/PDF
:wp-epub: https://en.wikipedia.org/wiki/EPUB
:wp-xhtml: https://en.wikipedia.org/wiki/XHTML
:wp-css: https://en.wikipedia.org/wiki/CSS
:abbr-gui: pass:[<abbr title='Graphical User Interface'>GUI</abbr>]
2021-03-15 04:14:01 +01:00
Sometimes e-books come only in link:{wp-pdf}[PDF] format. Almost always PDFs are
a pain to read on e-book readers. You can use link:{url-calibre}[Calibre] to
automatically convert it, but link:{url-calibre-convert}[the results are
okay-ish at best]. If the PDF has footnotes, forget it. Unfortunately, the type
of books that most often come only as PDFs are science books and these usually
have a lot of footnotes.
One option is to use Calibre to convert and then fix the result, but I have
found that I get better results in less time when I create a new
link:{wp-epub}[EPUB], copy the PDF's content into link:{url-emacs}[Emacs], clean
it up there and then copy it over to Calibre. This process is what I want to
2021-03-15 04:14:01 +01:00
share with you here. You will need Calibre, Emacs or another editor with
keyboard macros and some knowledge of link:{wp-xhtml}[XHTML] and
link:{wp-css}[CSS] to follow this guide. It will take long and is boring, but
2021-03-15 04:14:01 +01:00
the result is a clean and enjoyable book.
[NOTE]
I will use the link:{url-emacs-key-notation}[Emacs key notation] throughout this
guide.
2021-03-15 04:14:01 +01:00
== Create a new book in Calibre
Click on menu:Add books[Add empty book]. Then fill in the metadata and select
2021-03-15 04:14:01 +01:00
“EPUB” as format. You can add more metadata and a cover image by right-clicking
2021-03-15 18:02:02 +01:00
the book and then selecting menu:Edit metadata[]. Open Calibre's editor by right
clicking on the book and selecting menu:Edit book[].
Calibre creates EPUB 2 books by default. Convert the book to EPUB
3footnote:[EPUB 3 introduces many accessibility features, see
link:{url-ap-semantics}[AccessiblePublishing: EPUB Semantics, ARIA Roles, &
Metadata] for details.] by clicking menu:Tools[Upgrade book internals]. This
will, among other things, convert `toc.ncx` to `nav.xhtml`. To
link:{url-epub3-rendering}[support e-book readers which can't handle EPUB 3
yet], re-create `toc.ncx` as an empty file. It is filled automatically when you
create the table of contents. Open `metadata.opf` and replace `<spine>` with
`<spine toc="id1">` (`id1` is the ID of `toc.ncx`, defined a few lines above).
You start with a single XHTML file, `start.xhtml`. I always use that for the
title page, the copyright notice and so on. You can force a page break to
separate the title and the copyright notice with CSS: Add
`style="page-break-after: always;"` to the last element of the virtual “page” or
use a CSS class. To add a CSS file click menu:File[New file] and enter a
filename ending with `.css`. Add the CSS file to the document by right clicking
on `start.xhtml` in the file browser and selecting menu:Link stylesheets…[].
[NOTE]
The built-in preview does not show page breaks.
2021-03-15 04:14:01 +01:00
Your files should look similar to this:
.`start.xhtml`
[source,html]
--------------------------------------------------------------------------------
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" lang="de">
2021-03-15 04:14:01 +01:00
<head>
<title>Meine zwei Jahre in Russland</title>
<link rel="stylesheet" type="text/css" href="style.css"/>
</head>
<body>
<section class="pagebreak center" id="title" aria-label="Title">
2021-03-15 04:14:01 +01:00
<p>Emma Goldman</p>
<h1>Meine zwei Jahre in Russland</h1>
</section>
2021-03-15 04:14:01 +01:00
<section aria-label="Meta data">
<p>1. Auflage<br/>
München, Januar 2020</p>
2021-03-15 04:14:01 +01:00
<p id="copyright">Anti-Copyright (siehe S. 362)</p>
2021-03-15 04:14:01 +01:00
<p>Die englische Originalausgabe erschien 1921 und 1925 in den
USA aufgrund eines Versehens in zwei Teilen unter den Titeln
<em>My Disillusionment in Russia</em> und <em>My Further Disillusionment in
Russia.</em></p>
</section>
2021-03-15 04:14:01 +01:00
</body>
</html>
--------------------------------------------------------------------------------
.`style.css`
[source,css]
--------------------------------------------------------------------------------
.pagebreak {
page-break-after: always;
}
.center {
text-align: center;
}
--------------------------------------------------------------------------------
I added the IDs “title” and “copyright” to add semantic links to them later.
=== Some styling advice
Please refrain from using CSS too much. Most people have configured their e-book
readers the way they like, with the right font, font-size, margins and so on. If
you override their settings, they will be annoyed. I usually only style the
title page and headlines.
Do not use `<i>` or `<b>` tags to emphasize, do not use `font-style: italic` or
`font-style: bold`. Use `<em>` for emphasis and `<strong>` for
importance so screen readers will be able to pronounce the text differently.
Make your books as accessible as possible. Using the right tags and not just
`<div>` and `<span>` for everything is a good start, using `epub:type` and/or
`role` as well as `aria-label` or `aria-labelledby` is even better. Read more at
link:{url-ap-semantics}[AccessiblePublishing: EPUB Semantics, ARIA Roles, &
Metadata].
2021-03-15 04:14:01 +01:00
== Add text to the book
Add a new `.xhtml` file in Calibre and write in the heading of the first
chapter. Then switch to Emacs and copy the first paragraph from the PDF into a
`text-mode` buffer. The emphasis will not be copied over, so you'll have to
re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the
rest of the paragraphs of the chapter, leaving 2 blank lines between each
paragraph. The paragraphs will be broken and likely be full of hyphens at the
end of the lines.
.elisp function to add HTML tags easily
[source,elisp]
--------------------------------------------------------------------------------
(defun my/html-surround-with-tag (beg end)
"Surround region with HTML tag."
(interactive "*r")
(if (region-active-p)
(let ((tag (completing-read "Tag: "
'("blockquote" "em" "strong"))))
(insert (concat "<" tag ">"
(delete-and-extract-region beg end)
"</" tag ">")))
(message "No active region")))
--------------------------------------------------------------------------------
Make sure that `auto-fill-mode` is disabled. Position the cursor at the start of
the buffer and press kbd:[<f3>] to start recording a macro. Press kbd:[<end>]
kbd:[<deletechar>] kbd:[SPC] (space bar) and then kbd:[<f4>] to stop
recording. If there is a hyphen at the end of the current line, press
kbd:[<backspace>] 2 times. Press kbd:[<f4>] to call the macro and repeat until
you are at the end of the paragraph. Move the cursor to the first line of the
next paragraph and repeat…
2021-03-15 04:14:01 +01:00
Now you should have a text file with 1 paragraph per line. We need to wrap all
lines in `<p>` tags, except block quotes and sub-headlines. Either use another
macro (`<p>` kbd:[<end>] `</p>` kbd:[<down>] kbd:[<down>] kbd:[<home>]) or this
elisp function:
2021-03-15 04:14:01 +01:00
[source,elisp]
--------------------------------------------------------------------------------
(defun my/html-paragraphify-buffer ()
"Wrap every line not beginning with < or a newline in <p> tags."
(interactive)
(goto-char (point-min))
(while (re-search-forward "^\\([^<
].+\\)$" nil t)
(replace-match "<p>\\1</p>")))
--------------------------------------------------------------------------------
Once you are done, copy the result into Calibre.
== Add footnotes
2021-03-15 04:14:01 +01:00
Use the method from above to copy the footnotes into the now empty Emacs buffer
and clean them up until you have 1 paragraph per line. Footnotes need to be
2021-03-15 04:14:01 +01:00
hyperlink-able, so we can't just wrap them in plain `<p>` tags, they need IDs. I
like to use `<li epub:type="endnote" id="fn1">[…]</li>` if the footnote numbers
are increasing throughout the book or `<li epub:type="endnote"
id="fn1_1">[…]</li>` if they start with 1 in each chapter. We are going to use a
macro with a counter to generate consecutively numbered IDs. First, set the
counter to 1 with kbd:[C-x] kbd:[C-k] kbd:[C-c] `1`. Then, record this macro:
2021-03-15 04:14:01 +01:00
`<li epub:type="endnote" id="fn1_` kbd:[C-x] kbd:[C-k] kbd:[<tab>] `">`
kbd:[<end>] `</li>` kbd:[<down>] kbd:[<down>] kbd:[<home>]
2021-03-15 04:14:01 +01:00
2021-03-16 00:44:30 +01:00
[TIP]
Use kbd:[M-x] `describe-key` (mapped to kbd:[C-h] kbd:[k] by default) to find
out what a key combination does.
Call the macro until every footnote is wrapped and copy them to the end of the
chapter or the end of the book in Calibre. Wrap them in `<section
epub:type="endnotes" role="doc-endnotes"><ol>[…]</ol></section>`
[NOTE]
If the notes appear at the bottom of each page, they are called “footnotes”
(`epub:type="footnote"`). If they appear at the end of each chapter or the end
of the book, they are called “endnotes”. (`epub:type="endnote"`). Some e-book
readers display footnotes on the same page where the link to them is even if you
put all footnotes at the end of the chapter / book.
2021-03-15 04:14:01 +01:00
=== Add references to footnotes
2021-03-15 04:14:01 +01:00
The footnotes are probably superscript numbers in the PDF but normal numbers in
the EPUB right now. I found that the footnote-numbers are usually preceded by a
space and followed by a space or `<`. I use the find & replace function in
Calibre in Regex-mode to convert them to hyperlinks.
Find: ``&nbsp;([0-9]{1,3})([ <])`` (note the leading space) +
Replace: `<sup><a epub:type="noteref" role="doc-noteref" href="#fn1_\1">\1</a></sup>\2`
2021-03-15 04:14:01 +01:00
2021-03-15 12:48:08 +01:00
Press kbd:[<f3>] to search through the text and kbd:[C-r] to replace.
2021-03-15 04:14:01 +01:00
== Finishing touches
Click menu:Tools[Table of Contents > Edit table of Contents], remove the
2021-03-15 17:55:56 +01:00
existing entry and click btn:[Generate ToC from major headings] or btn:[Generate
ToC from all headings].
2021-03-15 04:14:01 +01:00
Click menu:Tools[Set semantics] and set the location of the title page,
2021-03-15 04:14:01 +01:00
copyright page, beginning of text and so on.
Select menu:Tools[Check book] and fix the errors.
2021-03-15 04:14:01 +01:00
Use link:{url-epubcheck}[EPUBCheck] (command-line) and/or link:{url-ace-app}[The
Ace App] ({abbr-gui}) for more thorough checks.
2021-03-15 04:14:01 +01:00
You're done! Enjoy your cleanly formatted book. 😊
2021-03-21 22:21:52 +01:00
== Updates
* [2021-03-21] Added ``epub:type``s to examples. See
2021-03-21 22:21:52 +01:00
link:{url-helicon-type}[Helicon Books: Footnotes in EPUB3],
link:{url-epub-spec}#sec-epub-type-attribute[the EPUB specification] and
link:{url-ap-semantics}[AccessiblePublishing: EPUB Semantics, ARIA Roles, &
Metadata] for more information.
* [2021-03-25] Clarified difference between footnotes and endnotes, replaced
`<span><aside>` with `<li>`.
2021-03-27 19:39:09 +01:00
* [2021-03-27] Added conversion to EPUB 3, use HTML 5 and ARIA attributes in
example, added accessibility-advice.
* [2021-03-29] Removed `role="doc-footnote"` from ``<li>``s because it
link:{url-epubcheck-footnote}[breaks the list semantics for Assistive
Technology users].
2021-03-15 04:14:01 +01:00
// LocalWords: Calibre