papis-extract

Author	SHA1	Message	Date
Marty Oehme	24a4812051	chore: Update beautifulsoup4 dependency Updated dependency and, with its newly provided type hints, removed some pyright overrides. Added a cast where there was still not enough type hinting.	2025-09-12 14:53:03 +02:00
Marty Oehme	1a4b5e3a70	chore: Remove python-magic dependency It relies on the libmagic module which is not necessarily installed everywhere. Most of the functionality that we need for our purposes can be recreated with lighter-weight methods.	2025-09-12 14:53:01 +02:00
Marty Oehme	17c6fefd89	chore: Remove unnecessary imports	2025-09-12 14:53:00 +02:00
Marty Oehme	3eb7f3f1c7	feat: Add Readest extractor	2025-09-12 14:53:00 +02:00
Marty Oehme	fd71482526	chore: Log found files for extractors to debug logger	2025-09-12 14:52:59 +02:00
Marty Oehme	a9ff4152af	fix: Do not parse the last ReadEra section	2025-09-12 14:52:59 +02:00
Marty Oehme	db47ad686d	chore: Implement Annotation sort and equality dunders	2025-09-12 14:52:58 +02:00
Marty Oehme	d840609ecb	fix: Fix annotation value comparison	2025-09-12 14:44:04 +02:00
Marty Oehme	e46219151b	refactor: Use generator for PDF extractor	2025-09-12 10:55:25 +02:00
Marty Oehme	f5455b6946	chore: Fix for additional linting rules	2025-09-12 10:55:23 +02:00
Marty Oehme	96cd4929c9	chore: Format files with ruff	2025-09-12 10:55:23 +02:00
Marty Oehme	a854ef00d6	fix: Write annotations if duplicate detection is off Previously we would never add annotations if the detection is off, because we only added an empty list instead of the actual annotations and would thus break out of writing early. JJ: Enter a description for the selected changes.	2025-09-12 10:55:22 +02:00
Marty Oehme	e90a123f88	chore!: Rename force option to duplicates BREAKING CHANGE: Change the `--force/--no-force` cli option to `--duplicates/--no-duplicates` since it describes a little clearer what using it actually achieves (adding quote duplicates or not to output).	2025-09-12 10:55:21 +02:00
Marty Oehme	5f01aa1f2b	feat: Add eof heuristic for readera extractor Every exported ReadEra annotation file also _ends_ with the ubiquitous `*****` pattern, so we look for that to detect the file.	2025-09-12 10:55:21 +02:00
Marty Oehme	3ef45e24f7	feat: Add ReadEra extractor For the readera epub/pdf reader application for android and ios.	2025-09-12 10:55:20 +02:00
Marty Oehme	7a69bd509f	chore: Remove redundant cast Some checks failed ci/woodpecker/push/lint Pipeline was successful Details ci/woodpecker/push/static_analysis Pipeline failed Details ci/woodpecker/push/test Pipeline failed Details ci/woodpecker/manual/lint Pipeline was successful Details ci/woodpecker/manual/static_analysis Pipeline failed Details ci/woodpecker/manual/test Pipeline failed Details	2024-11-30 21:48:59 +01:00
Marty Oehme	9c80281220	fix: Respect minimum color similarity option Previously we would always assign a minimum color similarity of 1.0, regardless of the option set. Now we set a minimum similarity according to the option set in the configuration, otherwise the default set for that option and fall back to a simple default value declared at the top of the file.	2024-11-30 21:45:29 +01:00
Marty Oehme	424ad34c68	refactor: Rename cli options for extractor and template Renamed the extractor selection from the cli to '--input' since it decides the various input formats that are used to gather annotations from. Renamed the template selection from the cli to '--output' since it control the output format that annotations are displayed/written in. This also somewhat more closely mirrors pandoc cli options, which are generally a good guide to follow.	2024-11-30 12:14:55 +01:00
Marty Oehme	103c2ea2fc	chore: Switch to uv packaging and hatch backend Switching this project over to the uv package manager as a pilot project for my personal use. Since this project is not yet widely used I can use it as an experimental playground for discovering uv further without interrupting anybody's workflow.	2024-11-15 11:28:50 +01:00
Marty Oehme	779519f580	fix: Only inform if no extractor finds valid files Some checks failed ci/woodpecker/push/lint Pipeline failed Details ci/woodpecker/push/static_analysis Pipeline was successful Details ci/woodpecker/push/test Pipeline was successful Details Until now whenever an extractor could not find any valid files for a document it would inform the user of this case. However, this is not very useful: if you have a pdf and an epub extractor running, it would inform you for each document which only had one of the two formats as well as those which actually did not have any valid files for any of the extractors running. This commit changes the behavior to only inform the user when none of the running extractors find a valid file, since that is the actual case a user might want to be informed about.	2024-06-14 21:50:55 +02:00
Marty Oehme	9e713193a8	refactor: Fix circular exception import	2024-06-14 15:18:22 +02:00
Marty Oehme	6b35b2f918	chore: Fix strict pyright analysis errors	2024-06-14 15:13:24 +02:00
Marty Oehme	8093259551	refactor: Remove pymupdf coupling in extraction The library is only needed for pdf extraction which is taken care of in its own extractor plugin. In the overall extraction routine we do not need any knowledge of the existence of pymupdf.	2024-06-14 14:59:39 +02:00
Marty Oehme	7261e7d80c	chore: Refactor for strict pyright analysis	2024-06-13 21:20:53 +02:00
Marty Oehme	19599a66d7	chore: Black formatting	2024-06-12 11:46:39 +02:00
Marty Oehme	c2aec7add6	feat: Notify formatters if formatting first entry This allows headers to be created by a formatter, which will only be added to the very first entry created and not to each entry. Currently for example this is used to create a csv header but not for each document in turn.	2024-06-12 11:45:35 +02:00
Marty Oehme	b5c081fbf3	feat: Change count display to lead with count The actual count is now the first item on each line, to make it easier to sort, strip, delete and compare afterwards.	2024-06-12 11:16:13 +02:00
Marty Oehme	d087c366c3	chore: Refactor markdown format string handling	2024-06-12 11:05:13 +02:00
Marty Oehme	905b20a79c	fix: Default markdown atx formatter for note exporter Some checks failed ci/woodpecker/push/lint Pipeline failed Details ci/woodpecker/push/static_analysis Pipeline failed Details ci/woodpecker/push/test Pipeline was successful Details	2024-01-25 22:46:38 +01:00
Marty Oehme	163fd63038	fix: Fixed pocketbook extractor trying to read all files The complete read routine would work before figuring out that it is a file of xml mimetype. This means that it would try to read to memory any file as the first thing, pdfs, even binaries. Of course doing so crashed the program.	2024-01-25 21:42:34 +01:00
Marty Oehme	72ddaaf1bc	refactor: Extract exporters to separate module	2024-01-25 21:42:33 +01:00
Marty Oehme	c8e8453b68	feat: Add advanced pocketbook detection heuristic Added heuristic which checks for the existence of a specific meta tag written to the pocketbook XHTML file.	2024-01-24 14:57:10 +01:00
Marty Oehme	6a8f8a03bc	refactor: Extract pocketbook file opening method	2024-01-24 14:55:28 +01:00
Marty Oehme	86d53a19d4	chore: Fix import lint error	2024-01-24 13:39:01 +01:00
Marty Oehme	c2a5190237	refactor: Improve module availability checks Followed ruff (Pyflakes) suggestion to use importlib utils directly instead of try and erroring with imports.	2024-01-24 12:27:21 +01:00
Marty Oehme	e7e5258b34	feat: Only activate pocketbook extractor optionally Since we make the dependencies for pocketbook html extraction optional as an extra, this commit ensures the extractor (and cli option) only gets loaded when they exist.	2024-01-24 12:07:04 +01:00
Marty Oehme	f6c0189529	fix: Tag automation on tag creation Tagging by color only worked on manually invoking the `annotation.color = ()` setter. Now it works on initial instance creation.	2024-01-24 11:20:00 +01:00
Marty Oehme	67bfc30396	refactor: Switch annotation away from dataclass To ease employing getters and setters, we switch the dataclass to a normal python undecorated class.	2024-01-24 11:15:10 +01:00
Marty Oehme	f4a26292a0	fix: Remove debug print statement	2024-01-24 11:13:35 +01:00
Marty Oehme	c53cd563b7	feat: Add pocketbook extraction	2024-01-24 08:56:21 +01:00
Marty Oehme	ddb34fca7b	refactor: Move tagging by color to Annotation	2024-01-24 08:53:54 +01:00
Marty Oehme	11d570f9d8	refactor: Rename annotation content variables Renamed the two variables describing an annotation's highlighted PDF-text and its appended note if any exists. Previously called 'text' (for the in-PDF highlighted content) and 'content' (for the additional supplied content). Now they are called 'content' for the IN PDF words, highlighted. and 'note' for the appended note given (or not) in an annotation.	2024-01-23 09:54:36 +01:00
Marty Oehme	9169e1c98a	chore: Improve stdout newline handling Always strip all newlines of the end of all entries, and then add a single newline back between entries.	2024-01-23 09:49:31 +01:00
Marty Oehme	2880c06f53	chore: Improve cli option help texts Fixed the write option text to be without wrong negation. Show default settings for flag options.	2024-01-23 09:27:26 +01:00
Marty Oehme	a51205954c	refactor: Extract extractor list to extractor module	2024-01-23 09:21:46 +01:00
Marty Oehme	629932a5e8	feat: Loop through all chosen extractors	2024-01-23 09:10:42 +01:00
Marty Oehme	f477deea7c	feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future.	2024-01-23 08:58:32 +01:00
Marty Oehme	3b4db7b6b8	refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future.	2024-01-20 18:02:51 +01:00
Marty Oehme	765de505bb	refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier.	2024-01-20 16:36:24 +01:00
Marty Oehme	cd5f787220	chore: Update dependencies to fix single-thread warning All checks were successful ci/woodpecker/push/lint Pipeline was successful Details ci/woodpecker/push/static_analysis Pipeline was successful Details ci/woodpecker/push/test Pipeline was successful Details Fixed single-threaded warning provided from the fitz pymupdf library since the issue does not exist for this new version anymore. Bump version along the way.	2024-01-18 18:26:00 +01:00

1 2

77 commits