diff --git a/content/posts/how-i-convert-pdfs-to-epub.adoc b/content/posts/how-i-convert-pdfs-to-epub.adoc new file mode 100644 index 0000000..9b7f50a --- /dev/null +++ b/content/posts/how-i-convert-pdfs-to-epub.adoc @@ -0,0 +1,210 @@ +--- +title: "How I convert PDFs to EPUB semi-automatically" +slug: how-i-convert-pdfs-to-epub +description: "A guide to clean EPUBs from PDFs using Calibre, Emacs and time." +date: 2021-03-15T04:12:00+01:00 +type: posts +draft: false +tags: +- epub +- e-books +- emacs +--- + +:source-highlighter: pygments + +:url-calibre: https://calibre-ebook.com/ +:url-calibre-convert: https://manual.calibre-ebook.com/conversion.html#pdfconversion +:url-emacs: https://www.gnu.org/software/emacs/ +:wp-pdf: https://en.wikipedia.org/wiki/PDF +:wp-epub: https://en.wikipedia.org/wiki/EPUB +:wp-xhtml: https://en.wikipedia.org/wiki/XHTML +:wp-css: https://en.wikipedia.org/wiki/CSS + +Sometimes e-books come only in link:{wp-pdf}[PDF] format. Almost always PDFs are +a pain to read on e-book readers. You can use link:{url-calibre}[Calibre] to +automatically convert it, but link:{url-calibre-convert}[the results are +okay-ish at best]. If the PDF has footnotes, forget it. Unfortunately, the type +of books that most often come only as PDFs are science books and these usually +have a lot of footnotes. + +One option is to use Calibre to convert and then fix the result, but I have +found that I get better results in less time when I create a new +link:{wp-epub}[EPUB], copy the PDF's content into link:{url-emacs}[Emacs], clean +it up there and then copy it over to Calibre. This process is what I'd like to +share with you here. You will need Calibre, Emacs or another editor with +keyboard macros and some knowledge of link:{wp-xhtml}[XHTML] and +link:{wp-css}[CSS] to follow this recipe. It will take long and is boring, but +the result is a clean and enjoyable book. + +== Create a new book in Calibre + +Click on “Add books” → “Add empty book”. Then fill in the metadata and select +“EPUB” as format. You can add more metadata and a cover image by right-clicking +the book and then selecting “Edit metadata”. Open Calibre's editor by right +clicking on the book and selecting “Edit book”. You start with a single XHTML +file, `start.xhtml`. I always use that for the title page, the copyright notice +and so on. You can force a page break to separate the title and the copyright +notice with CSS: Add `style="page-break-after: always;"` to the last element of +the virtual “page” or use a CSS class. To add a CSS file click “File” → “New +file” and enter a filename ending with `.css`. Add the CSS file by right +clicking on `start.xhtml` in the file browser and selecting “Link +stylesheets…”. Note that the in-built preview does not show page breaks. + +Your files should look similar to this: + +.`start.xhtml` +[source,html] +-------------------------------------------------------------------------------- + + + + + Meine zwei Jahre in Russland + + + + + +
+ +

Emma Goldman

+ +

Meine zwei Jahre in Russland

+ +
+ +

1. Auflage
+ München, Januar 2020

+ + + +

Die englische Originalausgabe erschien 1921 und 1925 in den + USA aufgrund eines Versehens in zwei Teilen unter den Titeln + My Disillusionment in Russia und My Further Disillusionment in + Russia.

+ + + + +-------------------------------------------------------------------------------- + +.`style.css` +[source,css] +-------------------------------------------------------------------------------- +.pagebreak { + page-break-after: always; +} +.center { + text-align: center; +} +-------------------------------------------------------------------------------- + +I added the IDs “title” and “copyright” to add semantic links to them later. + +=== Some styling advice + +Please refrain from using CSS too much. Most people have configured their e-book +readers the way they like, with the right font, font-size, margins and so on. If +you override their settings, they will be annoyed. I usually only style the +title page and headlines. + +Do not use `` or `` tags to emphasize, do not use `font-style: italic` or +`font-style: bold`. Use `` for emphasis and `` for +importance so screen readers will be able to pronounce the text differently. + +== Add text to the book + +Add a new `.xhtml` file in Calibre and write in the heading of the first +chapter. Then switch to Emacs and copy the first paragraph from the PDF into a +`text-mode` buffer. The emphasis will not be copied over, so you'll have to +re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the +rest of the paragraphs of the chapter, leaving 2 blank lines between each +paragraph. The paragraphs will be broken and likely be full of hyphens at the +end of the lines. + +.elisp function to add HTML tags easily +[source,elisp] +-------------------------------------------------------------------------------- +(defun my/html-surround-with-tag (beg end) + "Surround region with HTML tag." + (interactive "*r") + (if (region-active-p) + (let ((tag (completing-read "Tag: " + '("blockquote" "em" "strong")))) + (insert (concat "<" tag ">" + (delete-and-extract-region beg end) + ""))) + (message "No active region"))) +-------------------------------------------------------------------------------- + +Make sure that `auto-fill-mode` is disabled. Position the cursor at the start of +the buffer and press `` to start recording a macro. Press `` +`` `SPC` (space bar) and then `` to stop recording. If there is +a hyphen at the end of the current line, press `` 2 times. Press +`` to call the macro and repeat until you are at the end of the +paragraph. Move the cursor to the first line of the next paragraph and repeat. + +Now you should have a text file with 1 paragraph per line. We need to wrap all +lines in `

` tags, except block quotes and sub-headlines. Either use another +macro (“

” `` “

” `` `` ``) or this elisp function: + +[source,elisp] +-------------------------------------------------------------------------------- +(defun my/html-paragraphify-buffer () + "Wrap every line not beginning with < or a newline in

tags." + (interactive) + (goto-char (point-min)) + (while (re-search-forward "^\\([^< +].+\\)$" nil t) + (replace-match "

\\1

"))) +-------------------------------------------------------------------------------- + +Once you are done, copy the result into Calibre. + +== Add footnotes + +Use the method from above to copy the footnotes into the now empty Emacs buffer +and clean them up until you have 1 paragraph per line. Footnotes need to be +hyperlink-able, so we can't just wrap them in plain `

` tags, they need IDs. I +like to use `1

[…]

` if there is only one +footnote-section or `1

[…]

` for +chapter-footnotes. We are going to use a macro with a counter to generate +consecutively numbered IDs. First, set the counter to 1 with `C-x C-k +C-c` “1”. Then, record this macro: + +“” `C-x C-k` `` `C-u` “-1” `C-x C-k C-a` “

” `` “

” `` `` `` + +`C-u` “-1” `C-x C-k C-a` “adds” -1 to the counter, so that we can use the same +number again. + +Call the macro until every footnote is wrapped and copy them to Calibre. + +=== Add references to footnotes + +The footnotes are probably superscript numbers in the PDF but normal numbers in +the EPUB right now. I found that the footnote-numbers are usually preceded by a +space and followed by a space or `<`. I use the find & replace function in +Calibre in Regex-mode to convert them to hyperlinks. + +Find: `` ([0-9]{1,3})([ <])`` (note the leading space) + +Replace: `\1\2` + +Press `` to search through the text and `C-r` to replace. + +== Finishing touches + +Click “Tools” → “Table of Contents” → “Edit table of Contents”, remove the +existing entry and click “Generate ToC from major headings” or “Generate ToC +from all headings”. + +Click “Tools” → “Set semantics” and set the location of the title page, +copyright page, beginning of text and so on. + +Select “Tools” → “Check book” and fix the errors. + +You're done! Enjoy your cleanly formatted book. 😊 + + +// LocalWords: Calibre