If you manage your python environments with `pipx`, you can also `pipx inject --spec 'git+git+https://git.martyoeh.me/Marty/papis-extract.git` to add it to your specific papis environment.
> This plugin is still in fairly early development.
> It does what I need it to do, but if you have a meticulously organized library *please* make backups before doing any operation which could affect your notes, or make use of the papis-included git options.
> Take care to read the Issues section of this README if you intend to run it over a large collection.
You can get additional help on the plugin command line options with the usual `papis extract --help` command.
The basic command above, `papis extract` without any options or queries will allow you to select an entry in your library from which it will extract all annotations (from all PDF files associated).
Add a query to limit the search, as you do with papis.
```bash
papis extract "author:Einstein"
```
This will print the extracted annotations to the commandline through stdout.
If you invoke the command with the `--write` option, it will write it into your notes instead:
```bash
papis extract --write "author:Einstein"
```
The above command will create notes for the entry you select and fill them with the annotations.
If a note already exists for any of the entries, it will instead append the annotations to the end of it,
**dropping all those that it already finds in the note**.
With this duplication detection you should be able to run extract as often as you wish without doubling up your existing annotations.
**PLEASE** Heed the note above and exercise caution with the `--write` option.
It is not intended to be destructive, but nevertheless create backups or version control your files.
If you wish to invoke the extraction process on all notes included in the query,
use `--all` as usual with papis:
```bash
papis extract --all "author:Einstein"
```
The above command will print out your annotations made on *all* papers by Einstein.
You can invoke the command with `--manual` to instantly edit the notes in your editor:
```bash
papis extract --write --manual "author:Einstein"
```
Will create/append annotations and drop you into the selected Einstein note.
Take care that it will be fairly annoying if you use this option with hundreds
of entries being annotated as it will open one entry after another for editing.
To extract the annotations for all your existing entries in one go, you can use:
```bash
papis extract --write --all
```
However, the warning for your notes' safety goes doubly for this command since it will touch
*most* or *all* of your notes, depending on how many entries in your library have pdfs with annotations attached.
While I have not done extensive optimizations the process should be relatively quick even for larger libraries:
On my current laptop, extracting ~4000 annotations from ~1000 library documents takes around 90 seconds,
though this will vary with the length and size of the PDFs you have.
For smaller workloads the process should be almost instant.
Be aware that if you re-write to your notes using a completely different output format than the original the plugin will *not* detect old annotations and drop them,
For example, if you always highlight the most essential arguments and findings in red and always highlight things you have to follow up on in blue, you can assign the meanings 'important' and 'todo' to them respectively as follows:
Annotations you have in notes might change if you, for example, fix small spacing mistakes or a letter/punctuation that has been falsely recognized in the PDF or change similar things.
Generally, this should be fine as it is but you should change this value if you either get new annotations dropped though they should be added (decrease the value) or annotations are added duplicating existing ones (increase the value).
---
`minimum_similarity_content` sets the required similarity of an annotation's note and in-pdf written words to be viewed as one. Any annotation that has both and is *under* the minimum similarity will be added in the following form:
```markdown
> my annotation
Note: my additional thoughts
```
That is, the extractor detects additional written words by whoever annotated and adds them to the extraction.
The option should generally not take too much tuning, but it is there if you need it.
---
`minimum_similarity_color` sets the required similarity of highlight/annotation colors to be recognized as the 'pure' versions of themselves for color mapping (see 'automatic tagging'). With a low required similarity dark green and light green, for example, will both be recognized simply as 'green' while a high similarity will not match them, instead only matching closer matches to a pure (0, 255, 0) green value.
This should generally be an alright default but is here to be changed for example if you work with a lot of different annotation colors (where dark purple and light purple may different meanings) and get false positives in automatic tag recognition, or no tags are recognized at all.
First, a note in general: There is the functionality to run this plugin over your whole library in a single command and also in a way that makes permanent changes to that library.
This is intended and, in my view, an important aspect of what this plugin provides and the batch functionality of cli programs in general.
However, it can also lead to frustrating clean-up time if something messes up or, in the worst case, data loss.
The extractors attempt to ascertain what files they can operate on with certain heuristics but will not be fail-safe.
Take the note at the top of this README to heart and always have backups on hand before larger operations.