Use TOC to find out headline #11

New Issue

tastytea · 2021-06-04T11:08:58+02:00

tastytea commented

2021-06-04 11:08:58 +02:00

In some books, headlines are not in <h1> - <h6> tags, but in <p> tags or whatever. With the TOC we still can extract the last headline.

EPUB 2: OPF file: <spine toc="[ID]"> – <manifest><item id="[ID]" href="toc.ncx" […]

EPUB 3: OPF file: <guide><reference type="toc" href="xhtml/inhalt.xhtml" […] (optional)

Idea: Grab the href to the TOC in zip::list_spine() and inject the headlines in search::cleanup_text(), if we can't identify them in the usual way. Where do we store the href? Maybe we need a class, or a struct with metadata.

In some books, headlines are not in `<h1>` - `<h6>` tags, but in `<p>` tags or whatever. With the TOC we still can extract the last headline. EPUB 2: OPF file: `<spine toc="[ID]">` – `<manifest><item id="[ID]" href="toc.ncx" […]` EPUB 3: OPF file: `<guide><reference type="toc" href="xhtml/inhalt.xhtml" […]` (optional) Idea: Grab the href to the TOC in `zip::list_spine()` and inject the headlines in `search::cleanup_text()`, if we can't identify them in the usual way. Where do we store the href? Maybe we need a class, or a struct with metadata.

tastytea added the

enhancement

label 2021-06-04 11:09:04 +02:00

tastytea commented

2021-06-04 12:24:05 +02:00

zip::list_spine() is called from search::search(). search::cleanup_text() is called from search::search(). We could return the TOC from list_spine(). We would need a new type for that. Something like:

struct epub_file
{
    std::string toc;
    std::vector<std::string> filepaths;
};

`zip::list_spine()` is called from `search::search()`. `search::cleanup_text()` is called from `search::search()`. We could return the TOC from `list_spine()`. We would need a new type for that. Something like: ``` c++ struct epub_file { std::string toc; std::vector<std::string> filepaths; }; ```

tastytea commented

2021-06-04 12:31:21 +02:00

We could parse the TOC from search() and pass ID→headline pairs to cleanup_text(). If there is no ID (The href points to the file without #), leave the ID empty.

In cleanup_text() we search for the tag with the ID and replace it with the headline. If the ID is empty, we add the headline at the top.

We could parse the TOC from `search()` and pass ID→headline pairs to `cleanup_text()`. If there is no ID (The href points to the file without `#`), leave the ID empty. In `cleanup_text()` we search for the tag with the ID and replace it with the headline. If the ID is empty, we add the headline at the top.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: tastytea/epubgrep#11