epubgrep

Author	SHA1	Message	Date
tastytea	f8270369b6	Make whitespace-reduction a bit more efficient. All checks were successful continuous-integration/drone/push Build is passing Details We now use 2 passes instead of 3.	2021-06-08 17:30:29 +02:00
tastytea	f59c86e20d	Don't search for whitespace beyond the start/end of the text.	2021-06-06 23:48:06 +02:00
tastytea	0470acb00e	Make --raw work again. All checks were successful continuous-integration/drone/push Build is passing Details	2021-06-06 22:37:09 +02:00
tastytea	1e29608c7e	Fix positioning of matches in search::search().	2021-06-06 22:34:52 +02:00
tastytea	9708bb69c8	Don't attempt to access a pointer to nowhere.	2021-06-06 21:34:48 +02:00
tastytea	b8431019b7	Don't inject page numbers and headline-markers into the text. Some checks failed continuous-integration/drone/push Build is failing Details The metadata is recorded in position → data pairs. Closes: #13	2021-06-06 21:26:09 +02:00
tastytea	a49c500d0f	Fix <style> and <script> erasure. I didn't take into account that <script […]/> is possible.	2021-06-06 16:06:14 +02:00
tastytea	262aab6671	Add debug log for replacements.	2021-06-06 15:52:09 +02:00
tastytea	9067b387ef	Fix pagebreak-iterators. Oopsie! 😄	2021-06-06 15:50:13 +02:00
tastytea	99e1cd8e98	Re-enabled address sanitizer. All checks were successful continuous-integration/drone/push Build is passing Details Found out what was wrong: I fed boost::regex_search() the pointer to a substring that was created in-place. match[2] was a pointer to a substring inside that. The problem was, that match was declared outside of the if-block. So after the if-block match[2] would point to a now freed memory address. It didn't have any effects because I didn't use match afterwards. I rewrote the whole thing with iterators. Slightly less readable, slightly better performance (probably).	2021-06-05 17:45:07 +02:00
tastytea	bdf9a86651	Fix pagebreak-regex and range in which pagebreaks are searched.	2021-06-05 17:18:35 +02:00
tastytea	f1a0015f28	Disable address sanitizer. It complains about boost/regex/v5/sub_match.hpp:57:30 and I can't figure out what's wrong or how to ignore it.	2021-06-05 14:24:53 +02:00
tastytea	12e1c64fc0	Make text formatting more readable.	2021-06-05 13:34:48 +02:00
tastytea	7b4b9edfe5	Rename file names in search::matches to make it more clear.	2021-06-01 19:15:00 +02:00
tastytea	a7fae314b3	Log some progress info to log file. All checks were successful continuous-integration/drone/push Build is passing Details	2021-06-01 17:17:00 +02:00
tastytea	07915bdf87	Add lots of debug output.	2021-06-01 15:32:10 +02:00
tastytea	76ed0c9dbf	Un-escape named and numbered entities in documents before searching. All checks were successful continuous-integration/drone/push Build is passing Details	2021-05-30 23:32:35 +02:00
tastytea	7ddfe32e30	Move is_whitespace() and urldecode() to helpers.	2021-05-30 21:52:52 +02:00
tastytea	94564fa914	Strip whitespace from headlines.	2021-05-30 21:16:24 +02:00
tastytea	e7633fe134	Rename prefix to before and suffix to after. All checks were successful continuous-integration/drone/push Build is passing Details	2021-05-30 14:47:18 +02:00
tastytea	6255d665af	Replace tabs with a space in search::cleanup().	2021-05-30 14:37:05 +02:00
tastytea	d7ad180721	Use iterators in search::context() and don't return extra whitespace Should be easier to understand now.	2021-05-30 13:45:56 +02:00
tastytea	790e60a055	Fix end-of-headline detection.	2021-05-29 23:00:16 +02:00
tastytea	37e868b3f2	Remove <style> and <script> snippets. Closes: #8	2021-05-29 18:52:03 +02:00
tastytea	00e3edb9f2	Only search files in spine, in the right order. The spine lists all content documents in their linear reading order. So we're finally getting our results in the right order! 🎉 Since we skip the images and fonts, which usually make up the most bytes in an EPUB file, the performance increase is immense. I measured 60-70% in a very short test. Closes: #1	2021-05-29 17:34:43 +02:00
tastytea	4ff796a590	Make regular expressions static variables. All checks were successful continuous-integration/drone/push Build is passing Details Fewer allocations → faster program. About 17% speed increase with 89 books on up to 3 cores. Measured using the average of 4 runs. Before: ~15,5 seconds After: ~12,8 seconds Calls to allocation functions went down from 16.652.583 to 5.059.301.	2021-05-28 19:11:32 +02:00
tastytea	e64591f204	Rework option parsing, change --no-filename. Some checks failed continuous-integration/drone/push Build is failing Details Options are now better accessible, --no-filename accepts the values filesystem, in-epub or all.	2021-05-27 17:20:00 +02:00
tastytea	c376ce8466	Print the EPUB file name if more than 1 input file. Change --no-filename to mean: Don't print the EPUB file name.	2021-05-27 14:46:23 +02:00
tastytea	29ae22cc4a	Make regex const.	2021-05-27 09:46:59 +02:00
tastytea	fe02b155f5	Import std::string into epubgrep::search namespace. All checks were successful continuous-integration/drone/push Build is passing Details	2021-05-26 18:02:27 +02:00
tastytea	e1d29c5893	Don't replace stuff in search::cleanup_text() if nothing matched.	2021-05-24 20:02:27 +02:00
tastytea	09090a1c13	Fix bugs in search::context(). - Don't add context if words == 0 - Handle beginning / end of text correctly.	2021-05-24 19:57:15 +02:00
tastytea	c790c4952c	Extract page numbers.	2021-05-24 18:56:43 +02:00
tastytea	bb4a4c719f	Wrap headlines in <H> and </H> during cleanup.	2021-05-24 18:08:40 +02:00
tastytea	8ab7d0f655	Extract headlines.	2021-05-24 17:27:30 +02:00
tastytea	972ce1d0fe	Don't strip headlines.	2021-05-24 16:37:30 +02:00
tastytea	bb1a43ca92	Move cleanup_text(), document functions.	2021-05-24 16:23:07 +02:00
tastytea	84e2b387e5	Clean up text before searching.	2021-05-24 16:01:41 +02:00
tastytea	1979956f03	Add basic search functionality and context output.	2021-05-24 15:35:49 +02:00
tastytea	1f82d9927a	Add skeleton for search::search(). - Type for matches - Type for options.	2021-05-24 07:52:36 +02:00

40 Commits