Commit graph

77 commits

Author SHA1 Message Date
779519f580
fix: Only inform if no extractor finds valid files
Some checks failed
ci/woodpecker/push/lint Pipeline failed
ci/woodpecker/push/static_analysis Pipeline was successful
ci/woodpecker/push/test Pipeline was successful
Until now whenever an extractor could not find any valid files for a
document it would inform the user of this case. However, this is not
very useful: if you have a pdf and an epub extractor running, it would
inform you for each document which only had one of the two formats as
well as those which actually did not have any valid files for *any* of
the extractors running.

This commit changes the behavior to only inform the user when none of
the running extractors find a valid file, since that is the actual case
a user might want to be informed about.
2024-06-14 21:50:55 +02:00
97b7ec0dc9
chore: Update dependencies 2024-06-14 15:19:57 +02:00
9e713193a8
refactor: Fix circular exception import 2024-06-14 15:18:22 +02:00
6b35b2f918
chore: Fix strict pyright analysis errors 2024-06-14 15:13:24 +02:00
8093259551
refactor: Remove pymupdf coupling in extraction
The library is only needed for pdf extraction which is taken care of
in its own extractor plugin. In the overall extraction routine we do not
need any knowledge of the existence of pymupdf.
2024-06-14 14:59:39 +02:00
7261e7d80c
chore: Refactor for strict pyright analysis 2024-06-13 21:20:53 +02:00
19599a66d7
chore: Black formatting 2024-06-12 11:46:39 +02:00
c2aec7add6
feat: Notify formatters if formatting first entry
This allows headers to be created by a formatter, which will
*only* be added to the very first entry created and not to
each entry. Currently for example this is used to create
a csv header but not for each document in turn.
2024-06-12 11:45:35 +02:00
9eb7399536
chore: Set strict typing mode for pyright lsp 2024-06-12 11:16:32 +02:00
b5c081fbf3
feat: Change count display to lead with count
The actual count is now the first item on each line,
to make it easier to sort, strip, delete and compare
afterwards.
2024-06-12 11:16:13 +02:00
d087c366c3
chore: Refactor markdown format string handling 2024-06-12 11:05:13 +02:00
c21ab4a76c
chore: Update dependencies
Some checks failed
ci/woodpecker/push/lint Pipeline failed
ci/woodpecker/push/static_analysis Pipeline failed
ci/woodpecker/push/test Pipeline was successful
Pin new versions of levenshtein and pymupdf to fix build process.

This also means updating from importing fitz to importing pymupdf soon
in the source.
2024-05-07 10:55:14 +02:00
5526b3d2c5
docs: Update limitation information 2024-05-07 10:54:11 +02:00
905b20a79c
fix: Default markdown atx formatter for note exporter
Some checks failed
ci/woodpecker/push/lint Pipeline failed
ci/woodpecker/push/static_analysis Pipeline failed
ci/woodpecker/push/test Pipeline was successful
2024-01-25 22:46:38 +01:00
163fd63038
fix: Fixed pocketbook extractor trying to read all files
The complete read routine would work before figuring out that it is
a file of xml mimetype. This means that it would try to read to memory
any file as the first thing, pdfs, even binaries. Of course doing
so crashed the program.
2024-01-25 21:42:34 +01:00
72ddaaf1bc
refactor: Extract exporters to separate module 2024-01-25 21:42:33 +01:00
c8e8453b68
feat: Add advanced pocketbook detection heuristic
Added heuristic which checks for the existence of a specific
meta tag written to the pocketbook XHTML file.
2024-01-24 14:57:10 +01:00
6a8f8a03bc
refactor: Extract pocketbook file opening method 2024-01-24 14:55:28 +01:00
86d53a19d4
chore: Fix import lint error 2024-01-24 13:39:01 +01:00
c2a5190237
refactor: Improve module availability checks
Followed ruff (Pyflakes) suggestion to use importlib utils directly
instead of try and erroring with imports.
2024-01-24 12:27:21 +01:00
2f41906e6a
docs: Add extractor and install info
Added extractor info for the two currently existing extractors.
Added install recommendation for pipx.
2024-01-24 12:21:51 +01:00
e7e5258b34
feat: Only activate pocketbook extractor optionally
Since we make the dependencies for pocketbook html extraction optional
as an extra, this commit ensures the extractor (and cli option) only
gets loaded when they exist.
2024-01-24 12:07:04 +01:00
e1e09b6011
chore: Format pyproject 2024-01-24 12:05:41 +01:00
f6c0189529
fix: Tag automation on tag creation
Tagging by color only worked on manually invoking the
`annotation.color = ()` setter. Now it works on initial
instance creation.
2024-01-24 11:20:00 +01:00
333bd279b9
fix: Fix test fixtures for new annotation structure 2024-01-24 11:15:10 +01:00
67bfc30396
refactor: Switch annotation away from dataclass
To ease employing getters and setters, we switch the dataclass
to a normal python undecorated class.
2024-01-24 11:15:10 +01:00
f4a26292a0
fix: Remove debug print statement 2024-01-24 11:13:35 +01:00
c53cd563b7
feat: Add pocketbook extraction 2024-01-24 08:56:21 +01:00
ddb34fca7b
refactor: Move tagging by color to Annotation 2024-01-24 08:53:54 +01:00
3bd6247888
chore: Add license and dev instructions 2024-01-23 10:15:51 +01:00
11d570f9d8
refactor: Rename annotation content variables
Renamed the two variables describing an annotation's highlighted PDF-text and
its appended note if any exists. Previously called 'text' (for the in-PDF
highlighted content) and 'content' (for the additional supplied content).

Now they are called 'content' for the IN PDF words, highlighted.
and 'note' for the appended note given (or not) in an annotation.
2024-01-23 09:54:36 +01:00
9169e1c98a
chore: Improve stdout newline handling
Always strip all newlines of the end of all entries, and then add
a single newline back between entries.
2024-01-23 09:49:31 +01:00
2880c06f53
chore: Improve cli option help texts
Fixed the write option text to be without wrong negation.
Show default settings for flag options.
2024-01-23 09:27:26 +01:00
a51205954c
refactor: Extract extractor list to extractor module 2024-01-23 09:21:46 +01:00
629932a5e8
feat: Loop through all chosen extractors 2024-01-23 09:10:42 +01:00
f477deea7c
feat: Add extractor cli choice
Can only choose pdf for the time being, but allows additional
extractors to be added in the future.
2024-01-23 08:58:32 +01:00
3b4db7b6b8
refactor: Extract PDF extractor into class
Extractor is a general protocol with the PDF extraction routine now being
one implementation of the protocol. Preparation for adding multiple
extractors (epub,djvu, or specific progammes) in the future.
2024-01-20 18:02:51 +01:00
765de505bb
refactor: Remove AnnotatedDocument class
The AnnotatedDocument class was, essentially, a simple tuple of a document
and a list of annotations. While not bad in a vacuum, it is unwieldy and
passing this around instead of a document, annotations, or both where
necessary is more restrictive and frankly unnecessary.

This commit removes the data class and any instances of its use. Instead,
we now pass the individual components around to anything that needs them.
This also frees us up to pass only annotations around for example.

We also do not iterate through the selected papis documents to work on
in each exporter anymore (since we only pass a single document), but
in the main function itself. This leads to less duplication and makes
the overall run function the overall single source of iteration through
selected documents. Everything else only knows about a single document -
the one it is operating on - which seems much neater.

For now, it does not change much, but should make later work on extra
exporters or extractors easier.
2024-01-20 16:36:24 +01:00
cd5f787220
chore: Update dependencies to fix single-thread warning
All checks were successful
ci/woodpecker/push/lint Pipeline was successful
ci/woodpecker/push/static_analysis Pipeline was successful
ci/woodpecker/push/test Pipeline was successful
Fixed single-threaded warning provided from the fitz pymupdf library
since the issue does not exist for this new version anymore.
Bump version along the way.
2024-01-18 18:26:00 +01:00
1ef9a91e55
test: Remove deprecated pipeline instruction
All checks were successful
ci/woodpecker/push/lint Pipeline was successful
ci/woodpecker/push/static_analysis Pipeline was successful
ci/woodpecker/push/test Pipeline was successful
2024-01-06 11:15:51 +01:00
376282eaaa
test: Fix test running on main branch
All checks were successful
ci/woodpecker/push/lint Pipeline was successful
ci/woodpecker/push/static_analysis Pipeline was successful
ci/woodpecker/push/test Pipeline was successful
2023-10-17 22:09:54 +02:00
5cd5a05062
chore: Fix black fmt
Some checks failed
ci/woodpecker/push/test unknown status
ci/woodpecker/push/lint Pipeline was successful
ci/woodpecker/push/static_analysis Pipeline was successful
2023-10-17 22:07:09 +02:00
aeb18ae358
feat: Add option to force-add annotations
Will turn off looking for duplicate annotations and simply add all.
2023-10-17 22:05:11 +02:00
14f1b9e75c
test: Add poetry-cov library 2023-10-17 21:16:40 +02:00
c9736a5f32
test: Add tests for formatter sad paths 2023-10-12 19:27:16 +02:00
f67ac8cdb3
chore: Fix markdown lint issues 2023-10-12 19:26:41 +02:00
2700e4adc3
test: Add code coverage dev dependency 2023-09-22 21:53:55 +02:00
1e29642cba
test: Fix formatting and annotation tests 2023-09-22 21:49:52 +02:00
ee4690f52b
feat: Add atx-style markdown
Added markdown with atx style headers, can be chosen as
alternative markdown template on the cli.
The existing 'markdown' template will still default to
setext-style headers.
2023-09-21 22:05:39 +02:00
7ee8d4911e
refactor: Make formatters functions
Formatters have been classes so far which contained some data (the
tamplate to use for formatting and the annotations and documents to
format) and the actual formatting logic (an execute function).

However, we can inject the annotations to be formatted and the templates
so far are static only, so they can be simple variables (we can think
about how to inject them at another point should it come up, no
bikeshedding now).

This way, we can simply pass around one function per formatter, which
should make the code much lighter, easier to add to and especially less
stateful which means less areas of broken interactions to worry about.
2023-09-21 21:54:24 +02:00