Renamed the extractor selection from the cli to '--input' since it
decides the various input formats that are used to gather annotations
from.
Renamed the template selection from the cli to '--output' since it
control the output format that annotations are displayed/written in.
This also somewhat more closely mirrors pandoc cli options, which are
generally a good guide to follow.
Switching this project over to the uv package manager as a pilot project
for my personal use.
Since this project is not yet widely used I can use it as an
experimental playground for discovering uv further without interrupting
anybody's workflow.
Until now whenever an extractor could not find any valid files for a
document it would inform the user of this case. However, this is not
very useful: if you have a pdf and an epub extractor running, it would
inform you for each document which only had one of the two formats as
well as those which actually did not have any valid files for *any* of
the extractors running.
This commit changes the behavior to only inform the user when none of
the running extractors find a valid file, since that is the actual case
a user might want to be informed about.
The library is only needed for pdf extraction which is taken care of
in its own extractor plugin. In the overall extraction routine we do not
need any knowledge of the existence of pymupdf.
This allows headers to be created by a formatter, which will
*only* be added to the very first entry created and not to
each entry. Currently for example this is used to create
a csv header but not for each document in turn.
The complete read routine would work before figuring out that it is
a file of xml mimetype. This means that it would try to read to memory
any file as the first thing, pdfs, even binaries. Of course doing
so crashed the program.
Since we make the dependencies for pocketbook html extraction optional
as an extra, this commit ensures the extractor (and cli option) only
gets loaded when they exist.
Renamed the two variables describing an annotation's highlighted PDF-text and
its appended note if any exists. Previously called 'text' (for the in-PDF
highlighted content) and 'content' (for the additional supplied content).
Now they are called 'content' for the IN PDF words, highlighted.
and 'note' for the appended note given (or not) in an annotation.
Extractor is a general protocol with the PDF extraction routine now being
one implementation of the protocol. Preparation for adding multiple
extractors (epub,djvu, or specific progammes) in the future.
The AnnotatedDocument class was, essentially, a simple tuple of a document
and a list of annotations. While not bad in a vacuum, it is unwieldy and
passing this around instead of a document, annotations, or both where
necessary is more restrictive and frankly unnecessary.
This commit removes the data class and any instances of its use. Instead,
we now pass the individual components around to anything that needs them.
This also frees us up to pass only annotations around for example.
We also do not iterate through the selected papis documents to work on
in each exporter anymore (since we only pass a single document), but
in the main function itself. This leads to less duplication and makes
the overall run function the overall single source of iteration through
selected documents. Everything else only knows about a single document -
the one it is operating on - which seems much neater.
For now, it does not change much, but should make later work on extra
exporters or extractors easier.
Fixed single-threaded warning provided from the fitz pymupdf library
since the issue does not exist for this new version anymore.
Bump version along the way.
Added markdown with atx style headers, can be chosen as
alternative markdown template on the cli.
The existing 'markdown' template will still default to
setext-style headers.
Formatters have been classes so far which contained some data (the
tamplate to use for formatting and the annotations and documents to
format) and the actual formatting logic (an execute function).
However, we can inject the annotations to be formatted and the templates
so far are static only, so they can be simple variables (we can think
about how to inject them at another point should it come up, no
bikeshedding now).
This way, we can simply pass around one function per formatter, which
should make the code much lighter, easier to add to and especially less
stateful which means less areas of broken interactions to worry about.
Added formatter for csv-compatible syntax. The formatting is quite basic
with no escaping happening should that be necessary. However, for an
initial csv output it suffices for me.
Formatters (previously templates) were pure data containers before,
continating the 'template' for how things should be formatted using
mustache. The formatting would be done a) in the exporters and b) in the
annotations.
This spread of formatting has now been consolidated into the Formatter,
which fixes the overall spread of formatting code and now can coherently
format a whole output instead of just individual annotations.
A formatter contains references to all documents and contained
annotations and will format everything at once by default, but the
formatting function can be invoked with reference to a specific
annotated document to only format that.
This commit should put more separation into the concerns of exporter and
formatter and made formatting a concern purely of the formatters and
annotation objects.