blog/content/posts/how-i-convert-pdfs-to-epub....

227 lines
8.6 KiB
Plaintext
Raw Normal View History

2021-03-15 04:14:01 +01:00
---
title: "How I convert PDFs to EPUB semi-automatically"
slug: how-i-convert-pdfs-to-epub
description: "A step by step guide to clean EPUBs from PDFs using Calibre, Emacs and time."
2021-03-15 04:14:01 +01:00
date: 2021-03-15T04:12:00+01:00
type: posts
draft: false
tags:
- EPUB
- E-books
- Emacs
2021-03-15 16:34:25 +01:00
toc: true
2021-03-15 04:14:01 +01:00
---
:source-highlighter: pygments
:experimental: true
2021-03-15 16:34:25 +01:00
:toc:
:toclevels: 2
2021-03-15 04:14:01 +01:00
:url-calibre: https://calibre-ebook.com/
:url-calibre-convert: https://manual.calibre-ebook.com/conversion.html#pdfconversion
:url-emacs: https://www.gnu.org/software/emacs/
:url-emacs-key-notation: https://www.emacswiki.org/emacs/EmacsKeyNotation
2021-03-15 04:14:01 +01:00
:wp-pdf: https://en.wikipedia.org/wiki/PDF
:wp-epub: https://en.wikipedia.org/wiki/EPUB
:wp-xhtml: https://en.wikipedia.org/wiki/XHTML
:wp-css: https://en.wikipedia.org/wiki/CSS
Sometimes e-books come only in link:{wp-pdf}[PDF] format. Almost always PDFs are
a pain to read on e-book readers. You can use link:{url-calibre}[Calibre] to
automatically convert it, but link:{url-calibre-convert}[the results are
okay-ish at best]. If the PDF has footnotes, forget it. Unfortunately, the type
of books that most often come only as PDFs are science books and these usually
have a lot of footnotes.
One option is to use Calibre to convert and then fix the result, but I have
found that I get better results in less time when I create a new
link:{wp-epub}[EPUB], copy the PDF's content into link:{url-emacs}[Emacs], clean
it up there and then copy it over to Calibre. This process is what I want to
2021-03-15 04:14:01 +01:00
share with you here. You will need Calibre, Emacs or another editor with
keyboard macros and some knowledge of link:{wp-xhtml}[XHTML] and
link:{wp-css}[CSS] to follow this guide. It will take long and is boring, but
2021-03-15 04:14:01 +01:00
the result is a clean and enjoyable book.
[NOTE]
I will use the link:{url-emacs-key-notation}[Emacs key notation] throughout this
guide.
2021-03-15 04:14:01 +01:00
== Create a new book in Calibre
Click on menu:Add books[Add empty book]. Then fill in the metadata and select
2021-03-15 04:14:01 +01:00
“EPUB” as format. You can add more metadata and a cover image by right-clicking
2021-03-15 18:02:02 +01:00
the book and then selecting menu:Edit metadata[]. Open Calibre's editor by right
clicking on the book and selecting menu:Edit book[]. You start with a single XHTML
2021-03-15 04:14:01 +01:00
file, `start.xhtml`. I always use that for the title page, the copyright notice
and so on. You can force a page break to separate the title and the copyright
notice with CSS: Add `style="page-break-after: always;"` to the last element of
the virtual “page” or use a CSS class. To add a CSS file click menu:File[New
file] and enter a filename ending with `.css`. Add the CSS file to the document
2021-03-15 18:02:02 +01:00
by right clicking on `start.xhtml` in the file browser and selecting
menu:Link stylesheets…[].
[NOTE]
The built-in preview does not show page breaks.
2021-03-15 04:14:01 +01:00
Your files should look similar to this:
.`start.xhtml`
[source,html]
--------------------------------------------------------------------------------
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<title>Meine zwei Jahre in Russland</title>
<link rel="stylesheet" type="text/css" href="style.css"/>
</head>
<body>
<div class="pagebreak center" id="title">
<p>Emma Goldman</p>
<h1>Meine zwei Jahre in Russland</h1>
</div>
<p>1. Auflage<br/>
München, Januar 2020</p>
<p id="copyright">Anti-Copyright (siehe S. 362)</p>
<p>Die englische Originalausgabe erschien 1921 und 1925 in den
USA aufgrund eines Versehens in zwei Teilen unter den Titeln
<em>My Disillusionment in Russia</em> und <em>My Further Disillusionment in
Russia.</em></p>
</body>
</html>
--------------------------------------------------------------------------------
.`style.css`
[source,css]
--------------------------------------------------------------------------------
.pagebreak {
page-break-after: always;
}
.center {
text-align: center;
}
--------------------------------------------------------------------------------
I added the IDs “title” and “copyright” to add semantic links to them later.
=== Some styling advice
Please refrain from using CSS too much. Most people have configured their e-book
readers the way they like, with the right font, font-size, margins and so on. If
you override their settings, they will be annoyed. I usually only style the
title page and headlines.
Do not use `<i>` or `<b>` tags to emphasize, do not use `font-style: italic` or
`font-style: bold`. Use `<em>` for emphasis and `<strong>` for
importance so screen readers will be able to pronounce the text differently.
== Add text to the book
Add a new `.xhtml` file in Calibre and write in the heading of the first
chapter. Then switch to Emacs and copy the first paragraph from the PDF into a
`text-mode` buffer. The emphasis will not be copied over, so you'll have to
re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the
rest of the paragraphs of the chapter, leaving 2 blank lines between each
paragraph. The paragraphs will be broken and likely be full of hyphens at the
end of the lines.
.elisp function to add HTML tags easily
[source,elisp]
--------------------------------------------------------------------------------
(defun my/html-surround-with-tag (beg end)
"Surround region with HTML tag."
(interactive "*r")
(if (region-active-p)
(let ((tag (completing-read "Tag: "
'("blockquote" "em" "strong"))))
(insert (concat "<" tag ">"
(delete-and-extract-region beg end)
"</" tag ">")))
(message "No active region")))
--------------------------------------------------------------------------------
Make sure that `auto-fill-mode` is disabled. Position the cursor at the start of
the buffer and press kbd:[<f3>] to start recording a macro. Press kbd:[<end>]
kbd:[<deletechar>] kbd:[SPC] (space bar) and then kbd:[<f4>] to stop
recording. If there is a hyphen at the end of the current line, press
kbd:[<backspace>] 2 times. Press kbd:[<f4>] to call the macro and repeat until
you are at the end of the paragraph. Move the cursor to the first line of the
next paragraph and repeat…
2021-03-15 04:14:01 +01:00
Now you should have a text file with 1 paragraph per line. We need to wrap all
lines in `<p>` tags, except block quotes and sub-headlines. Either use another
macro (`<p>` kbd:[<end>] `</p>` kbd:[<down>] kbd:[<down>] kbd:[<home>]) or this
elisp function:
2021-03-15 04:14:01 +01:00
[source,elisp]
--------------------------------------------------------------------------------
(defun my/html-paragraphify-buffer ()
"Wrap every line not beginning with < or a newline in <p> tags."
(interactive)
(goto-char (point-min))
(while (re-search-forward "^\\([^<
].+\\)$" nil t)
(replace-match "<p>\\1</p>")))
--------------------------------------------------------------------------------
Once you are done, copy the result into Calibre.
== Add footnotes
Use the method from above to copy the footnotes into the now empty Emacs buffer
and clean them up until you have 1 paragraph per line. Footnotes need to be
hyperlink-able, so we can't just wrap them in plain `<p>` tags, they need IDs. I
like to use `<span>1</span><p id="fn1">[…]</p>` if there is only one
footnote-section or `<span>1</span><p id="fn1_1">[…]</p>` for
chapter-footnotes. We are going to use a macro with a counter to generate
consecutively numbered IDs. First, set the counter to 1 with kbd:[C-x]
kbd:[C-k] kbd:[C-c] `1`. Then, record this macro:
2021-03-15 04:14:01 +01:00
`<span>` kbd:[C-x] kbd:[C-k] kbd:[<tab>] kbd:[C-u] `-1` kbd:[C-x] kbd:[C-k]
kbd:[C-a] `</span><p id="fn` kbd:[C-x] kbd:[C-k] kbd:[<tab>] `">` kbd:[<end>]
`</p>` kbd:[<down>] kbd:[<down>] kbd:[<home>]
2021-03-15 04:14:01 +01:00
[NOTE]
kbd:[C-u] `-1` kbd:[C-x] kbd:[C-k] kbd:[C-a] “adds” -1 to the counter, so that
we can use the same number again.
2021-03-15 04:14:01 +01:00
Call the macro until every footnote is wrapped and copy them to Calibre.
=== Add references to footnotes
The footnotes are probably superscript numbers in the PDF but normal numbers in
the EPUB right now. I found that the footnote-numbers are usually preceded by a
space and followed by a space or `<`. I use the find & replace function in
Calibre in Regex-mode to convert them to hyperlinks.
Find: ``&nbsp;([0-9]{1,3})([ <])`` (note the leading space) +
Replace: `<sup><a href="#fn\1">\1</a></sup>\2`
2021-03-15 12:48:08 +01:00
Press kbd:[<f3>] to search through the text and kbd:[C-r] to replace.
2021-03-15 04:14:01 +01:00
== Finishing touches
Click menu:Tools[Table of Contents > Edit table of Contents], remove the
2021-03-15 17:55:56 +01:00
existing entry and click btn:[Generate ToC from major headings] or btn:[Generate
ToC from all headings].
2021-03-15 04:14:01 +01:00
Click menu:Tools[Set semantics] and set the location of the title page,
2021-03-15 04:14:01 +01:00
copyright page, beginning of text and so on.
Select menu:Tools[Check book] and fix the errors.
2021-03-15 04:14:01 +01:00
You're done! Enjoy your cleanly formatted book. 😊
// LocalWords: Calibre