Compare commits
No commits in common. "d8706e84f69dc27e61c03e49983273cb0583667d" and "dcf367b385ce2c2a2f3e23f98dd0d2014322a6b1" have entirely different histories.
d8706e84f6
...
dcf367b385
4 changed files with 25 additions and 20 deletions
37
README.md
37
README.md
|
@ -1,9 +1,6 @@
|
|||
# pubs-extract
|
||||
|
||||
[![status-badge](https://ci.martyoeh.me/api/badges/Marty/pubs-extract/status.svg)](https://ci.martyoeh.me/Marty/pubs-extract)
|
||||
|
||||
Quickly extract annotations from your pdf files with the help of the pubs bibliography manager.
|
||||
Easily organize your highlights and thoughts next to your documents.
|
||||
|
||||
## Installation:
|
||||
|
||||
|
@ -22,9 +19,6 @@ active = extract
|
|||
To check if everything is working you can do `pubs --help` which should show you the new extract command.
|
||||
You will be set up with the default options but if you want to change anything, read on in configuration below.
|
||||
|
||||
> **Note**
|
||||
> This plugin is in fairly early development. It does what I need it to do, but if you have a meticulously organized library *please* make backups before doing any operation on your notes, or make use of the pubs-included git plugin.
|
||||
|
||||
## Configuration:
|
||||
|
||||
In your pubs configuration file:
|
||||
|
@ -156,7 +150,7 @@ Pull requests tackling one of these areas of course very welcome.
|
|||
|
||||
## Issues
|
||||
|
||||
A note on the extraction: Highlights in pdfs can be somewhat difficult to parse
|
||||
A note on the extraction. Highlights in pdfs can be somewhat difficult to parse
|
||||
(as are most things in them). Sometimes they contain the selected text that is written on the
|
||||
page, sometimes they contain the annotators thoughts as a note, sometimes they contain nothing.
|
||||
This plugin makes an effort to find the right combination and extract the written words,
|
||||
|
@ -170,12 +164,25 @@ or even cut a few off.
|
|||
|
||||
I am not sure if there is much I can do about this.
|
||||
|
||||
---
|
||||
## Roadmap:
|
||||
|
||||
If you spot a bug or have an idea feel free to open an issue.\
|
||||
I might be slow to respond but will consider them all!
|
||||
|
||||
Pull requests are warmly welcomed.\
|
||||
If they are big changes or additions let's talk about them in an issue first.
|
||||
|
||||
Thanks for using my software ❤️
|
||||
- [x] extracts highlights and annotations from a doc file (e.g. using PyMuPDF)
|
||||
- [x] puts those in the annotation file of a doc in a customizable format
|
||||
- [x] option to have it automatically run after a file is added?
|
||||
- option to have it run whenever a pdf in the library was updated?
|
||||
- [ ] needs some way to delimit where it puts stuff and user stuff is in note
|
||||
- [ ] one way is to have it look at `> [17] here be extracted annotation from page seventeen` annotations and put it in between
|
||||
- [x] another, probably simpler first, is to just append missing annotations to the end of the note
|
||||
- [ ] use similarity search instead of literal search for existing annotation (levenshtein)?
|
||||
- [x] some highlights (or annotations in general) do not contain text as content
|
||||
- [x] pymupdf can extract the content of the underlying rectangle (mostly)
|
||||
- [x] issue is that sometimes the highlight contents are in content, sometimes a user comment instead
|
||||
- [x] we could have a comparison function which estimates how 'close' the two text snippets are and act accordingly -> using levenshtein distance
|
||||
- [ ] sometimes the underyling rectangle is empty too, what to do then? discard annotation?
|
||||
- [x] config option to map colors in annotations to meaning ('read', 'important', 'extra') in pubs
|
||||
- [x] colors are given in very exact 0.6509979 RGB values, meaning we could once again estimate if a color is 'close enough' in distance to tag it accordingly -> using euclidian distance
|
||||
- [ ] support custom colors by setting a float tuple in configuration
|
||||
- [x] make invoking the command run a query if corresponding option provided (or whatever) in pubs syntax and use resulting papers
|
||||
- [x] confirm for many papers?
|
||||
- [ ] warning when the amount of annotations in file is different than the amount extracted?
|
||||
- [ ] tests tests tests tests tests, lah-di-dah
|
||||
|
|
|
@ -87,8 +87,7 @@ class Annotation:
|
|||
def colorname(self):
|
||||
"""Return the stringified version of the annotation color.
|
||||
|
||||
Finds the closest named color to the annotation and returns it,
|
||||
using euclidian distance between the two color vectors.
|
||||
Finds the closest named color to the annotation and returns it.
|
||||
"""
|
||||
annot_colors = (
|
||||
self.colors.get("stroke") or self.colors.get("fill") or (0.0, 0.0, 0.0)
|
||||
|
|
|
@ -225,8 +225,7 @@ class ExtractPlugin(PapersPlugin):
|
|||
that is only the written words, sometimes that is only
|
||||
annotation notes, sometimes it is both. Runs a similarity
|
||||
comparison between strings to find out whether they
|
||||
should both be included or are doubling up, using
|
||||
Levenshtein distance.
|
||||
should both be included or are doubling up.
|
||||
"""
|
||||
content = annotation.info["content"].replace("\n", " ")
|
||||
written = page.get_textbox(annotation.rect).replace("\n", " ")
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
[tool.poetry]
|
||||
name = "pubs-extract"
|
||||
version = "0.2.0"
|
||||
version = "0.1.0"
|
||||
description = "A pdf annotation extraction plugin for pubs bibliography manager"
|
||||
authors = ["Marty Oehme <marty.oehme@gmail.com>"]
|
||||
license = "LGPL-3.0"
|
||||
|
|
Loading…
Reference in a new issue