papis-extract

Author	SHA1	Message	Date
Marty Oehme	5526b3d2c5	docs: Update limitation information	2024-05-07 10:54:11 +02:00
Marty Oehme	905b20a79c	fix: Default markdown atx formatter for note exporter Some checks failed ci/woodpecker/push/lint Pipeline failed Details ci/woodpecker/push/static_analysis Pipeline failed Details ci/woodpecker/push/test Pipeline was successful Details	2024-01-25 22:46:38 +01:00
Marty Oehme	163fd63038	fix: Fixed pocketbook extractor trying to read all files The complete read routine would work before figuring out that it is a file of xml mimetype. This means that it would try to read to memory any file as the first thing, pdfs, even binaries. Of course doing so crashed the program.	2024-01-25 21:42:34 +01:00
Marty Oehme	72ddaaf1bc	refactor: Extract exporters to separate module	2024-01-25 21:42:33 +01:00
Marty Oehme	c8e8453b68	feat: Add advanced pocketbook detection heuristic Added heuristic which checks for the existence of a specific meta tag written to the pocketbook XHTML file.	2024-01-24 14:57:10 +01:00
Marty Oehme	6a8f8a03bc	refactor: Extract pocketbook file opening method	2024-01-24 14:55:28 +01:00
Marty Oehme	86d53a19d4	chore: Fix import lint error	2024-01-24 13:39:01 +01:00
Marty Oehme	c2a5190237	refactor: Improve module availability checks Followed ruff (Pyflakes) suggestion to use importlib utils directly instead of try and erroring with imports.	2024-01-24 12:27:21 +01:00
Marty Oehme	2f41906e6a	docs: Add extractor and install info Added extractor info for the two currently existing extractors. Added install recommendation for pipx.	2024-01-24 12:21:51 +01:00
Marty Oehme	e7e5258b34	feat: Only activate pocketbook extractor optionally Since we make the dependencies for pocketbook html extraction optional as an extra, this commit ensures the extractor (and cli option) only gets loaded when they exist.	2024-01-24 12:07:04 +01:00
Marty Oehme	e1e09b6011	chore: Format pyproject	2024-01-24 12:05:41 +01:00
Marty Oehme	f6c0189529	fix: Tag automation on tag creation Tagging by color only worked on manually invoking the `annotation.color = ()` setter. Now it works on initial instance creation.	2024-01-24 11:20:00 +01:00
Marty Oehme	333bd279b9	fix: Fix test fixtures for new annotation structure	2024-01-24 11:15:10 +01:00
Marty Oehme	67bfc30396	refactor: Switch annotation away from dataclass To ease employing getters and setters, we switch the dataclass to a normal python undecorated class.	2024-01-24 11:15:10 +01:00
Marty Oehme	f4a26292a0	fix: Remove debug print statement	2024-01-24 11:13:35 +01:00
Marty Oehme	c53cd563b7	feat: Add pocketbook extraction	2024-01-24 08:56:21 +01:00
Marty Oehme	ddb34fca7b	refactor: Move tagging by color to Annotation	2024-01-24 08:53:54 +01:00
Marty Oehme	3bd6247888	chore: Add license and dev instructions	2024-01-23 10:15:51 +01:00
Marty Oehme	11d570f9d8	refactor: Rename annotation content variables Renamed the two variables describing an annotation's highlighted PDF-text and its appended note if any exists. Previously called 'text' (for the in-PDF highlighted content) and 'content' (for the additional supplied content). Now they are called 'content' for the IN PDF words, highlighted. and 'note' for the appended note given (or not) in an annotation.	2024-01-23 09:54:36 +01:00
Marty Oehme	9169e1c98a	chore: Improve stdout newline handling Always strip all newlines of the end of all entries, and then add a single newline back between entries.	2024-01-23 09:49:31 +01:00
Marty Oehme	2880c06f53	chore: Improve cli option help texts Fixed the write option text to be without wrong negation. Show default settings for flag options.	2024-01-23 09:27:26 +01:00
Marty Oehme	a51205954c	refactor: Extract extractor list to extractor module	2024-01-23 09:21:46 +01:00
Marty Oehme	629932a5e8	feat: Loop through all chosen extractors	2024-01-23 09:10:42 +01:00
Marty Oehme	f477deea7c	feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future.	2024-01-23 08:58:32 +01:00
Marty Oehme	3b4db7b6b8	refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future.	2024-01-20 18:02:51 +01:00
Marty Oehme	765de505bb	refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier.	2024-01-20 16:36:24 +01:00
Marty Oehme	cd5f787220	chore: Update dependencies to fix single-thread warning All checks were successful ci/woodpecker/push/lint Pipeline was successful Details ci/woodpecker/push/static_analysis Pipeline was successful Details ci/woodpecker/push/test Pipeline was successful Details Fixed single-threaded warning provided from the fitz pymupdf library since the issue does not exist for this new version anymore. Bump version along the way.	2024-01-18 18:26:00 +01:00
Marty Oehme	1ef9a91e55	test: Remove deprecated pipeline instruction All checks were successful ci/woodpecker/push/lint Pipeline was successful Details ci/woodpecker/push/static_analysis Pipeline was successful Details ci/woodpecker/push/test Pipeline was successful Details	2024-01-06 11:15:51 +01:00
Marty Oehme	376282eaaa	test: Fix test running on main branch All checks were successful ci/woodpecker/push/lint Pipeline was successful Details ci/woodpecker/push/static_analysis Pipeline was successful Details ci/woodpecker/push/test Pipeline was successful Details	2023-10-17 22:09:54 +02:00
Marty Oehme	5cd5a05062	chore: Fix black fmt Some checks failed ci/woodpecker/push/test unknown status Details ci/woodpecker/push/lint Pipeline was successful Details ci/woodpecker/push/static_analysis Pipeline was successful Details	2023-10-17 22:07:09 +02:00
Marty Oehme	aeb18ae358	feat: Add option to force-add annotations Will turn off looking for duplicate annotations and simply add all.	2023-10-17 22:05:11 +02:00
Marty Oehme	14f1b9e75c	test: Add poetry-cov library	2023-10-17 21:16:40 +02:00
Marty Oehme	c9736a5f32	test: Add tests for formatter sad paths	2023-10-12 19:27:16 +02:00
Marty Oehme	f67ac8cdb3	chore: Fix markdown lint issues	2023-10-12 19:26:41 +02:00
Marty Oehme	2700e4adc3	test: Add code coverage dev dependency	2023-09-22 21:53:55 +02:00
Marty Oehme	1e29642cba	test: Fix formatting and annotation tests	2023-09-22 21:49:52 +02:00
Marty Oehme	ee4690f52b	feat: Add atx-style markdown Added markdown with atx style headers, can be chosen as alternative markdown template on the cli. The existing 'markdown' template will still default to setext-style headers.	2023-09-21 22:05:39 +02:00
Marty Oehme	7ee8d4911e	refactor: Make formatters functions Formatters have been classes so far which contained some data (the tamplate to use for formatting and the annotations and documents to format) and the actual formatting logic (an execute function). However, we can inject the annotations to be formatted and the templates so far are static only, so they can be simple variables (we can think about how to inject them at another point should it come up, no bikeshedding now). This way, we can simply pass around one function per formatter, which should make the code much lighter, easier to add to and especially less stateful which means less areas of broken interactions to worry about.	2023-09-21 21:54:24 +02:00
Marty Oehme	929e70d7ac	chore: Update poetry.lock	2023-09-21 19:36:00 +02:00
Marty Oehme	31b878c9eb	refactor: Move Annotations into annotation module	2023-09-20 17:22:29 +02:00
Marty Oehme	3670f70319	docs: Add formatting documentation Added documentation on using output templates and that they will invalidate the 'existing' annotation search.	2023-09-20 09:15:00 +02:00
Marty Oehme	e511ffa48d	feat: Add CSV formatter Added formatter for csv-compatible syntax. The formatting is quite basic with no escaping happening should that be necessary. However, for an initial csv output it suffices for me.	2023-09-20 09:15:00 +02:00
Marty Oehme	5f0bc2ffad	feat: Add count formatter Added formatter which counts and outputs the number of annotations in each document.	2023-09-20 09:14:59 +02:00
Marty Oehme	5a6d672c76	refactor: Move formatting logic to formatters Formatters (previously templates) were pure data containers before, continating the 'template' for how things should be formatted using mustache. The formatting would be done a) in the exporters and b) in the annotations. This spread of formatting has now been consolidated into the Formatter, which fixes the overall spread of formatting code and now can coherently format a whole output instead of just individual annotations. A formatter contains references to all documents and contained annotations and will format everything at once by default, but the formatting function can be invoked with reference to a specific annotated document to only format that. This commit should put more separation into the concerns of exporter and formatter and made formatting a concern purely of the formatters and annotation objects.	2023-09-20 09:14:58 +02:00
Marty Oehme	66f937e2a8	test: Add local papis settings for testing	2023-09-20 09:14:55 +02:00
Marty Oehme	cbe2e7cb03	feat: Allow cli option for template choice	2023-09-20 09:14:54 +02:00
Marty Oehme	9674592a9f	docs: Add developer notes to README	2023-09-20 09:14:43 +02:00
Marty Oehme	07d4de9a46	docs: Add docstrings	2023-09-20 09:13:04 +02:00
Marty Oehme	4eb983d9e3	refactor: Move templating to separate file	2023-09-20 09:12:59 +02:00
Marty Oehme	e633c0335e	chore: Make whoosh database optional dependency	2023-09-20 09:12:54 +02:00

1 2

65 commits