Commit graph

13 commits

Author SHA1 Message Date
Marty Oehme 9e713193a8
refactor: Fix circular exception import 2024-06-14 15:18:22 +02:00
Marty Oehme 6b35b2f918
chore: Fix strict pyright analysis errors 2024-06-14 15:13:24 +02:00
Marty Oehme 8093259551
refactor: Remove pymupdf coupling in extraction
The library is only needed for pdf extraction which is taken care of
in its own extractor plugin. In the overall extraction routine we do not
need any knowledge of the existence of pymupdf.
2024-06-14 14:59:39 +02:00
Marty Oehme 163fd63038
fix: Fixed pocketbook extractor trying to read all files
The complete read routine would work before figuring out that it is
a file of xml mimetype. This means that it would try to read to memory
any file as the first thing, pdfs, even binaries. Of course doing
so crashed the program.
2024-01-25 21:42:34 +01:00
Marty Oehme c8e8453b68
feat: Add advanced pocketbook detection heuristic
Added heuristic which checks for the existence of a specific
meta tag written to the pocketbook XHTML file.
2024-01-24 14:57:10 +01:00
Marty Oehme 6a8f8a03bc
refactor: Extract pocketbook file opening method 2024-01-24 14:55:28 +01:00
Marty Oehme c2a5190237
refactor: Improve module availability checks
Followed ruff (Pyflakes) suggestion to use importlib utils directly
instead of try and erroring with imports.
2024-01-24 12:27:21 +01:00
Marty Oehme e7e5258b34
feat: Only activate pocketbook extractor optionally
Since we make the dependencies for pocketbook html extraction optional
as an extra, this commit ensures the extractor (and cli option) only
gets loaded when they exist.
2024-01-24 12:07:04 +01:00
Marty Oehme c53cd563b7
feat: Add pocketbook extraction 2024-01-24 08:56:21 +01:00
Marty Oehme ddb34fca7b
refactor: Move tagging by color to Annotation 2024-01-24 08:53:54 +01:00
Marty Oehme 11d570f9d8
refactor: Rename annotation content variables
Renamed the two variables describing an annotation's highlighted PDF-text and
its appended note if any exists. Previously called 'text' (for the in-PDF
highlighted content) and 'content' (for the additional supplied content).

Now they are called 'content' for the IN PDF words, highlighted.
and 'note' for the appended note given (or not) in an annotation.
2024-01-23 09:54:36 +01:00
Marty Oehme a51205954c
refactor: Extract extractor list to extractor module 2024-01-23 09:21:46 +01:00
Marty Oehme 3b4db7b6b8
refactor: Extract PDF extractor into class
Extractor is a general protocol with the PDF extraction routine now being
one implementation of the protocol. Preparation for adding multiple
extractors (epub,djvu, or specific progammes) in the future.
2024-01-20 18:02:51 +01:00