Use TOC to find out headline #11

Open
opened 1 year ago by tastytea · 2 comments
tastytea commented 1 year ago
Owner

In some books, headlines are not in <h1> - <h6> tags, but in <p> tags or whatever. With the TOC we still can extract the last headline.

EPUB 2: OPF file: <spine toc="[ID]"><manifest><item id="[ID]" href="toc.ncx" […]

EPUB 3: OPF file: <guide><reference type="toc" href="xhtml/inhalt.xhtml" […] (optional)

Idea: Grab the href to the TOC in zip::list_spine() and inject the headlines in search::cleanup_text(), if we can't identify them in the usual way. Where do we store the href? Maybe we need a class, or a struct with metadata.

In some books, headlines are not in `<h1>` - `<h6>` tags, but in `<p>` tags or whatever. With the TOC we still can extract the last headline. EPUB 2: OPF file: `<spine toc="[ID]">` – `<manifest><item id="[ID]" href="toc.ncx" […]` EPUB 3: OPF file: `<guide><reference type="toc" href="xhtml/inhalt.xhtml" […]` (optional) Idea: Grab the href to the TOC in `zip::list_spine()` and inject the headlines in `search::cleanup_text()`, if we can't identify them in the usual way. Where do we store the href? Maybe we need a class, or a struct with metadata.
tastytea added the
enhancement
label 1 year ago
Poster
Owner

zip::list_spine() is called from search::search(). search::cleanup_text() is called from search::search(). We could return the TOC from list_spine(). We would need a new type for that. Something like:

struct epub_file
{
    std::string toc;
    std::vector<std::string> filepaths;
};
`zip::list_spine()` is called from `search::search()`. `search::cleanup_text()` is called from `search::search()`. We could return the TOC from `list_spine()`. We would need a new type for that. Something like: ``` c++ struct epub_file { std::string toc; std::vector<std::string> filepaths; }; ```
Poster
Owner

We could parse the TOC from search() and pass ID→headline pairs to cleanup_text(). If there is no ID (The href points to the file without #), leave the ID empty.

In cleanup_text() we search for the tag with the ID and replace it with the headline. If the ID is empty, we add the headline at the top.

We could parse the TOC from `search()` and pass ID→headline pairs to `cleanup_text()`. If there is no ID (The href points to the file without `#`), leave the ID empty. In `cleanup_text()` we search for the tag with the ID and replace it with the headline. If the ID is empty, we add the headline at the top.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.