--- title: "How I convert PDFs to EPUB semi-automatically" slug: how-i-convert-pdfs-to-epub description: "A step by step guide to clean EPUBs from PDFs using Calibre, Emacs and time." date: 2021-03-15T04:12:00+01:00 type: posts draft: false tags: - EPUB - E-books - Emacs toc: true --- :source-highlighter: pygments :experimental: true :toc: :toclevels: 2 :url-calibre: https://calibre-ebook.com/ :url-calibre-convert: https://manual.calibre-ebook.com/conversion.html#pdfconversion :url-emacs: https://www.gnu.org/software/emacs/ :url-emacs-key-notation: https://www.emacswiki.org/emacs/EmacsKeyNotation :wp-pdf: https://en.wikipedia.org/wiki/PDF :wp-epub: https://en.wikipedia.org/wiki/EPUB :wp-xhtml: https://en.wikipedia.org/wiki/XHTML :wp-css: https://en.wikipedia.org/wiki/CSS Sometimes e-books come only in link:{wp-pdf}[PDF] format. Almost always PDFs are a pain to read on e-book readers. You can use link:{url-calibre}[Calibre] to automatically convert it, but link:{url-calibre-convert}[the results are okay-ish at best]. If the PDF has footnotes, forget it. Unfortunately, the type of books that most often come only as PDFs are science books and these usually have a lot of footnotes. One option is to use Calibre to convert and then fix the result, but I have found that I get better results in less time when I create a new link:{wp-epub}[EPUB], copy the PDF's content into link:{url-emacs}[Emacs], clean it up there and then copy it over to Calibre. This process is what I want to share with you here. You will need Calibre, Emacs or another editor with keyboard macros and some knowledge of link:{wp-xhtml}[XHTML] and link:{wp-css}[CSS] to follow this guide. It will take long and is boring, but the result is a clean and enjoyable book. [NOTE] I will use the link:{url-emacs-key-notation}[Emacs key notation] throughout this guide. == Create a new book in Calibre Click on menu:Add books[Add empty book]. Then fill in the metadata and select “EPUB” as format. You can add more metadata and a cover image by right-clicking the book and then selecting menu:Edit metadata[]. Open Calibre's editor by right clicking on the book and selecting menu:Edit book[]. You start with a single XHTML file, `start.xhtml`. I always use that for the title page, the copyright notice and so on. You can force a page break to separate the title and the copyright notice with CSS: Add `style="page-break-after: always;"` to the last element of the virtual “page” or use a CSS class. To add a CSS file click menu:File[New file] and enter a filename ending with `.css`. Add the CSS file to the document by right clicking on `start.xhtml` in the file browser and selecting menu:Link stylesheets…[]. [NOTE] The built-in preview does not show page breaks. Your files should look similar to this: .`start.xhtml` [source,html] --------------------------------------------------------------------------------
Emma Goldman
1. Auflage
München, Januar 2020
Anti-Copyright (siehe S. 362)
Die englische Originalausgabe erschien 1921 und 1925 in den USA aufgrund eines Versehens in zwei Teilen unter den Titeln My Disillusionment in Russia und My Further Disillusionment in Russia.
-------------------------------------------------------------------------------- .`style.css` [source,css] -------------------------------------------------------------------------------- .pagebreak { page-break-after: always; } .center { text-align: center; } -------------------------------------------------------------------------------- I added the IDs “title” and “copyright” to add semantic links to them later. === Some styling advice Please refrain from using CSS too much. Most people have configured their e-book readers the way they like, with the right font, font-size, margins and so on. If you override their settings, they will be annoyed. I usually only style the title page and headlines. Do not use `` or `` tags to emphasize, do not use `font-style: italic` or `font-style: bold`. Use `` for emphasis and `` for importance so screen readers will be able to pronounce the text differently. == Add text to the book Add a new `.xhtml` file in Calibre and write in the heading of the first chapter. Then switch to Emacs and copy the first paragraph from the PDF into a `text-mode` buffer. The emphasis will not be copied over, so you'll have to re-add it. We ignore (but keep) the footnote numbers for now. Repeat with the rest of the paragraphs of the chapter, leaving 2 blank lines between each paragraph. The paragraphs will be broken and likely be full of hyphens at the end of the lines. .elisp function to add HTML tags easily [source,elisp] -------------------------------------------------------------------------------- (defun my/html-surround-with-tag (beg end) "Surround region with HTML tag." (interactive "*r") (if (region-active-p) (let ((tag (completing-read "Tag: " '("blockquote" "em" "strong")))) (insert (concat "<" tag ">" (delete-and-extract-region beg end) "" tag ">"))) (message "No active region"))) -------------------------------------------------------------------------------- Make sure that `auto-fill-mode` is disabled. Position the cursor at the start of the buffer and press kbd:[` tags, except block quotes and sub-headlines. Either use another macro (`
kbd:[
tags." (interactive) (goto-char (point-min)) (while (re-search-forward "^\\([^< ].+\\)$" nil t) (replace-match "
\\1
"))) -------------------------------------------------------------------------------- Once you are done, copy the result into Calibre. == Add footnotes Use the method from above to copy the footnotes into the now empty Emacs buffer and clean them up until you have 1 paragraph per line. Footnotes need to be hyperlink-able, so we can't just wrap them in plain `` tags, they need IDs. I like to use `1
[…]
` if there is only one footnote-section or `1[…]
` for chapter-footnotes. We are going to use a macro with a counter to generate consecutively numbered IDs. First, set the counter to 1 with `kbd:[C-x] kbd:[C-k] kbd:[C-c] 1`. Then, record this macro: ` kbd:[C-x] kbd:[C-k] kbd:[ kbd:[