Use TOC to find out headline #11

Open
opened 2021-06-04 11:08:58 +02:00 by tastytea · 2 comments
Owner

In some books, headlines are not in <h1> - <h6> tags, but in <p> tags or whatever. With the TOC we still can extract the last headline.

EPUB 2: OPF file: <spine toc="[ID]"><manifest><item id="[ID]" href="toc.ncx" […]

EPUB 3: OPF file: <guide><reference type="toc" href="xhtml/inhalt.xhtml" […] (optional)

Idea: Grab the href to the TOC in zip::list_spine() and inject the headlines in search::cleanup_text(), if we can't identify them in the usual way. Where do we store the href? Maybe we need a class, or a struct with metadata.

In some books, headlines are not in `<h1>` - `<h6>` tags, but in `<p>` tags or whatever. With the TOC we still can extract the last headline. EPUB 2: OPF file: `<spine toc="[ID]">` – `<manifest><item id="[ID]" href="toc.ncx" […]` EPUB 3: OPF file: `<guide><reference type="toc" href="xhtml/inhalt.xhtml" […]` (optional) Idea: Grab the href to the TOC in `zip::list_spine()` and inject the headlines in `search::cleanup_text()`, if we can't identify them in the usual way. Where do we store the href? Maybe we need a class, or a struct with metadata.
tastytea added the
enhancement
label 2021-06-04 11:09:04 +02:00
Author
Owner

zip::list_spine() is called from search::search(). search::cleanup_text() is called from search::search(). We could return the TOC from list_spine(). We would need a new type for that. Something like:

struct epub_file
{
    std::string toc;
    std::vector<std::string> filepaths;
};
`zip::list_spine()` is called from `search::search()`. `search::cleanup_text()` is called from `search::search()`. We could return the TOC from `list_spine()`. We would need a new type for that. Something like: ``` c++ struct epub_file { std::string toc; std::vector<std::string> filepaths; }; ```
Author
Owner

We could parse the TOC from search() and pass ID→headline pairs to cleanup_text(). If there is no ID (The href points to the file without #), leave the ID empty.

In cleanup_text() we search for the tag with the ID and replace it with the headline. If the ID is empty, we add the headline at the top.

We could parse the TOC from `search()` and pass ID→headline pairs to `cleanup_text()`. If there is no ID (The href points to the file without `#`), leave the ID empty. In `cleanup_text()` we search for the tag with the ID and replace it with the headline. If the ID is empty, we add the headline at the top.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: tastytea/epubgrep#11
No description provided.