Add PDF → EPUB article.
This commit is contained in:
parent
efb4fb25d0
commit
de98ade9c0
210
content/posts/how-i-convert-pdfs-to-epub.adoc
Normal file
210
content/posts/how-i-convert-pdfs-to-epub.adoc
Normal file
|
@ -0,0 +1,210 @@
|
|||
---
|
||||
title: "How I convert PDFs to EPUB semi-automatically"
|
||||
slug: how-i-convert-pdfs-to-epub
|
||||
description: "A guide to clean EPUBs from PDFs using Calibre, Emacs and time."
|
||||
date: 2021-03-15T04:12:00+01:00
|
||||
type: posts
|
||||
draft: false
|
||||
tags:
|
||||
- epub
|
||||
- e-books
|
||||
- emacs
|
||||
---
|
||||
|
||||
:source-highlighter: pygments
|
||||
|
||||
:url-calibre: https://calibre-ebook.com/
|
||||
:url-calibre-convert: https://manual.calibre-ebook.com/conversion.html#pdfconversion
|
||||
:url-emacs: https://www.gnu.org/software/emacs/
|
||||
:wp-pdf: https://en.wikipedia.org/wiki/PDF
|
||||
:wp-epub: https://en.wikipedia.org/wiki/EPUB
|
||||
:wp-xhtml: https://en.wikipedia.org/wiki/XHTML
|
||||
:wp-css: https://en.wikipedia.org/wiki/CSS
|
||||
|
||||
Sometimes e-books come only in link:{wp-pdf}[PDF] format. Almost always PDFs are
|
||||
a pain to read on e-book readers. You can use link:{url-calibre}[Calibre] to
|
||||
automatically convert it, but link:{url-calibre-convert}[the results are
|
||||
okay-ish at best]. If the PDF has footnotes, forget it. Unfortunately, the type
|
||||
of books that most often come only as PDFs are science books and these usually
|
||||
have a lot of footnotes.
|
||||
|
||||
One option is to use Calibre to convert and then fix the result, but I have
|
||||
found that I get better results in less time when I create a new
|
||||
link:{wp-epub}[EPUB], copy the PDF's content into link:{url-emacs}[Emacs], clean
|
||||
it up there and then copy it over to Calibre. This process is what I'd like to
|
||||
share with you here. You will need Calibre, Emacs or another editor with
|
||||
keyboard macros and some knowledge of link:{wp-xhtml}[XHTML] and
|
||||
link:{wp-css}[CSS] to follow this recipe. It will take long and is boring, but
|
||||
the result is a clean and enjoyable book.
|
||||
|
||||
== Create a new book in Calibre
|
||||
|
||||
Click on “Add books” → “Add empty book”. Then fill in the metadata and select
|
||||
“EPUB” as format. You can add more metadata and a cover image by right-clicking
|
||||
the book and then selecting “Edit metadata”. Open Calibre's editor by right
|
||||
clicking on the book and selecting “Edit book”. You start with a single XHTML
|
||||
file, `start.xhtml`. I always use that for the title page, the copyright notice
|
||||
and so on. You can force a page break to separate the title and the copyright
|
||||
notice with CSS: Add `style="page-break-after: always;"` to the last element of
|
||||
the virtual “page” or use a CSS class. To add a CSS file click “File” → “New
|
||||
file” and enter a filename ending with `.css`. Add the CSS file by right
|
||||
clicking on `start.xhtml` in the file browser and selecting “Link
|
||||
stylesheets…”. Note that the in-built preview does not show page breaks.
|
||||
|
||||
Your files should look similar to this:
|
||||
|
||||
.`start.xhtml`
|
||||
[source,html]
|
||||
--------------------------------------------------------------------------------
|
||||
<?xml version='1.0' encoding='utf-8'?>
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
|
||||
|
||||
<head>
|
||||
<title>Meine zwei Jahre in Russland</title>
|
||||
<link rel="stylesheet" type="text/css" href="style.css"/>
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="pagebreak center" id="title">
|
||||
|
||||
<p>Emma Goldman</p>
|
||||
|
||||
<h1>Meine zwei Jahre in Russland</h1>
|
||||
|
||||
</div>
|
||||
|
||||
<p>1. Auflage<br/>
|
||||
München, Januar 2020</p>
|
||||
|
||||
<p id="copyright">Anti-Copyright (siehe S. 362)</p>
|
||||
|
||||
<p>Die englische Originalausgabe erschien 1921 und 1925 in den
|
||||
USA aufgrund eines Versehens in zwei Teilen unter den Titeln
|
||||
<em>My Disillusionment in Russia</em> und <em>My Further Disillusionment in
|
||||
Russia.</em></p>
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
.`style.css`
|
||||
[source,css]
|
||||
--------------------------------------------------------------------------------
|
||||
.pagebreak {
|
||||
page-break-after: always;
|
||||
}
|
||||
.center {
|
||||
text-align: center;
|
||||
}
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
I added the IDs “title” and “copyright” to add semantic links to them later.
|
||||
|
||||
=== Some styling advice
|
||||
|
||||
Please refrain from using CSS too much. Most people have configured their e-book
|
||||
readers the way they like, with the right font, font-size, margins and so on. If
|
||||
you override their settings, they will be annoyed. I usually only style the
|
||||
title page and headlines.
|
||||
|
||||
Do not use `<i>` or `<b>` tags to emphasize, do not use `font-style: italic` or
|
||||
`font-style: bold`. Use `<em>` for emphasis and `<strong>` for
|
||||
importance so screen readers will be able to pronounce the text differently.
|
||||
|
||||
== Add text to the book
|
||||
|
||||
Add a new `.xhtml` file in Calibre and write in the heading of the first
|
||||
chapter. Then switch to Emacs and copy the first paragraph from the PDF into a
|
||||
`text-mode` buffer. The emphasis will not be copied over, so you'll have to
|
||||
re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the
|
||||
rest of the paragraphs of the chapter, leaving 2 blank lines between each
|
||||
paragraph. The paragraphs will be broken and likely be full of hyphens at the
|
||||
end of the lines.
|
||||
|
||||
.elisp function to add HTML tags easily
|
||||
[source,elisp]
|
||||
--------------------------------------------------------------------------------
|
||||
(defun my/html-surround-with-tag (beg end)
|
||||
"Surround region with HTML tag."
|
||||
(interactive "*r")
|
||||
(if (region-active-p)
|
||||
(let ((tag (completing-read "Tag: "
|
||||
'("blockquote" "em" "strong"))))
|
||||
(insert (concat "<" tag ">"
|
||||
(delete-and-extract-region beg end)
|
||||
"</" tag ">")))
|
||||
(message "No active region")))
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Make sure that `auto-fill-mode` is disabled. Position the cursor at the start of
|
||||
the buffer and press `<f3>` to start recording a macro. Press `<end>`
|
||||
`<deletechar>` `SPC` (space bar) and then `<f4>` to stop recording. If there is
|
||||
a hyphen at the end of the current line, press `<backspace>` 2 times. Press
|
||||
`<f4>` to call the macro and repeat until you are at the end of the
|
||||
paragraph. Move the cursor to the first line of the next paragraph and repeat.
|
||||
|
||||
Now you should have a text file with 1 paragraph per line. We need to wrap all
|
||||
lines in `<p>` tags, except block quotes and sub-headlines. Either use another
|
||||
macro (“<p>” `<end>` “</p>” `<down>` `<down>` `<home>`) or this elisp function:
|
||||
|
||||
[source,elisp]
|
||||
--------------------------------------------------------------------------------
|
||||
(defun my/html-paragraphify-buffer ()
|
||||
"Wrap every line not beginning with < or a newline in <p> tags."
|
||||
(interactive)
|
||||
(goto-char (point-min))
|
||||
(while (re-search-forward "^\\([^<
|
||||
].+\\)$" nil t)
|
||||
(replace-match "<p>\\1</p>")))
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Once you are done, copy the result into Calibre.
|
||||
|
||||
== Add footnotes
|
||||
|
||||
Use the method from above to copy the footnotes into the now empty Emacs buffer
|
||||
and clean them up until you have 1 paragraph per line. Footnotes need to be
|
||||
hyperlink-able, so we can't just wrap them in plain `<p>` tags, they need IDs. I
|
||||
like to use `<span>1</span><p id="fn1">[…]</p>` if there is only one
|
||||
footnote-section or `<span>1</span><p id="fn1_1">[…]</p>` for
|
||||
chapter-footnotes. We are going to use a macro with a counter to generate
|
||||
consecutively numbered IDs. First, set the counter to 1 with `C-x C-k
|
||||
C-c` “1”. Then, record this macro:
|
||||
|
||||
“<span>” `C-x C-k` `<tab>` `C-u` “-1” `C-x C-k C-a` “</span><p id="fn” `C-x C-k`
|
||||
`<tab>` “">” `<end>` “</p>” `<down>` `<down>` `<home>`
|
||||
|
||||
`C-u` “-1” `C-x C-k C-a` “adds” -1 to the counter, so that we can use the same
|
||||
number again.
|
||||
|
||||
Call the macro until every footnote is wrapped and copy them to Calibre.
|
||||
|
||||
=== Add references to footnotes
|
||||
|
||||
The footnotes are probably superscript numbers in the PDF but normal numbers in
|
||||
the EPUB right now. I found that the footnote-numbers are usually preceded by a
|
||||
space and followed by a space or `<`. I use the find & replace function in
|
||||
Calibre in Regex-mode to convert them to hyperlinks.
|
||||
|
||||
Find: `` ([0-9]{1,3})([ <])`` (note the leading space) +
|
||||
Replace: `<sup><a href="#fn\1">\1</a></sup>\2`
|
||||
|
||||
Press `<f3>` to search through the text and `C-r` to replace.
|
||||
|
||||
== Finishing touches
|
||||
|
||||
Click “Tools” → “Table of Contents” → “Edit table of Contents”, remove the
|
||||
existing entry and click “Generate ToC from major headings” or “Generate ToC
|
||||
from all headings”.
|
||||
|
||||
Click “Tools” → “Set semantics” and set the location of the title page,
|
||||
copyright page, beginning of text and so on.
|
||||
|
||||
Select “Tools” → “Check book” and fix the errors.
|
||||
|
||||
You're done! Enjoy your cleanly formatted book. 😊
|
||||
|
||||
|
||||
// LocalWords: Calibre
|
Loading…
Reference in New Issue
Block a user