Add extraction for no-content and note highlights
This commit is contained in:
parent
d14a95e18b
commit
c9f286fc33
2 changed files with 23 additions and 3 deletions
13
README.md
13
README.md
|
@ -62,6 +62,18 @@ What follows is a not-very-sorted train of though on where the plugin is at and
|
||||||
could see myself taking it one day, provided I find the time.
|
could see myself taking it one day, provided I find the time.
|
||||||
Pull requests tackling one of these areas of course very welcome.
|
Pull requests tackling one of these areas of course very welcome.
|
||||||
|
|
||||||
|
## Issues
|
||||||
|
|
||||||
|
A note on the extraction. Highlights in pdfs are somewhat difficult to parse
|
||||||
|
(as are most things in them). Sometimes they contain the selected text that is written on the
|
||||||
|
page, sometimes they contain the annotators thoughts as a note, sometimes they contain nothing.
|
||||||
|
This plugin makes an effort to find the right combination and extract the written words,
|
||||||
|
as well as any additional notes made - but things *will* slip through or extract weirdly every now
|
||||||
|
and again.
|
||||||
|
|
||||||
|
The easiest extraction is provided if your program writes the selection itself into the highlight
|
||||||
|
content, because then we can just use that. It is harder to parse if it does not.
|
||||||
|
|
||||||
## Roadmap:
|
## Roadmap:
|
||||||
|
|
||||||
- [x] extracts highlights and annotations from a doc file (e.g. using PyMuPDF)
|
- [x] extracts highlights and annotations from a doc file (e.g. using PyMuPDF)
|
||||||
|
@ -79,6 +91,7 @@ Pull requests tackling one of these areas of course very welcome.
|
||||||
- [ ] colors are given in very exact 0.6509979 RGB values, meaning we could once again estimate if a color is 'close enough' in distance to tag it accordingly
|
- [ ] colors are given in very exact 0.6509979 RGB values, meaning we could once again estimate if a color is 'close enough' in distance to tag it accordingly
|
||||||
- [ ] make invoking the command run a query if corresponding option provided (or whatever) in pubs syntax and use resulting papers
|
- [ ] make invoking the command run a query if corresponding option provided (or whatever) in pubs syntax and use resulting papers
|
||||||
- [ ] confirm for many papers?
|
- [ ] confirm for many papers?
|
||||||
|
- [ ] warning when the amount of annotations in file is different than the amount extracted?
|
||||||
|
|
||||||
## Things that would also be nice in pubs in general and don't really belong in this repository
|
## Things that would also be nice in pubs in general and don't really belong in this repository
|
||||||
|
|
||||||
|
|
|
@ -129,13 +129,20 @@ class ExtractPlugin(PapersPlugin):
|
||||||
with fitz.Document(filename) as doc:
|
with fitz.Document(filename) as doc:
|
||||||
for page in doc:
|
for page in doc:
|
||||||
for annot in page.annots():
|
for annot in page.annots():
|
||||||
content = annot.get_text() or annot.info["content"].replace(
|
content = self._retrieve_annotation_content(page, annot)
|
||||||
"\n", ""
|
|
||||||
)
|
|
||||||
if content:
|
if content:
|
||||||
annotations.append(f"[{(page.number or 0) + 1}] {content}")
|
annotations.append(f"[{(page.number or 0) + 1}] {content}")
|
||||||
return annotations
|
return annotations
|
||||||
|
|
||||||
|
def _retrieve_annotation_content(self, page, annotation):
|
||||||
|
content = annotation.info["content"].replace("\n", " ")
|
||||||
|
written = page.get_textbox(annotation.rect).replace("\n", " ")
|
||||||
|
if written in content:
|
||||||
|
return content
|
||||||
|
elif content:
|
||||||
|
return f"{written}\nNote: {content}"
|
||||||
|
return written
|
||||||
|
|
||||||
def _to_stdout(self, annotated_papers):
|
def _to_stdout(self, annotated_papers):
|
||||||
"""Write annotations to stdout.
|
"""Write annotations to stdout.
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue