papis-extract/papis_extract/extraction.py

import re
from pathlib import Path
from typing import Protocol

import papis.config
import papis.document
import papis.logging
from papis.document import Document

from papis_extract.annotation import Annotation
from papis_extract.exceptions import ExtractionError

logger = papis.logging.get_logger(__name__)


class Extractor(Protocol):
    def can_process(self, filename: Path) -> bool: ...

    def run(self, filename: Path) -> list[Annotation]: ...


def start(
    extractor: Extractor,
    document: Document,
) -> list[Annotation]:
    """Extract all annotations from passed documents.

    Returns all annotations contained in the papis
    documents passed in.
    """
    annotations: list[Annotation] = []
    file_available: bool = False

    for file in document.get_files():
        fname = Path(file)
        if not extractor.can_process(fname):
            continue
        file_available = True

        try:
            annotations.extend(extractor.run(fname))
        except ExtractionError as e:
            print(f"File extraction errors for {file}.\n{e}")

    if not file_available:
        # have to remove curlys or papis logger gets upset
        desc = re.sub("[{}]", "", papis.document.describe(document))
        logger.info(f"No {type(extractor)} file for document: {desc}")

    return annotations
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`import re`
initial commit 2023-08-28 08:28:06 +00:00			`from pathlib import Path`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`from typing import Protocol`
initial commit 2023-08-28 08:28:06 +00:00
			`import papis.config`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`import papis.document`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`import papis.logging`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`from papis.document import Document`
initial commit 2023-08-28 08:28:06 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`from papis_extract.annotation import Annotation`
refactor: Fix circular exception import 2024-06-14 13:18:02 +00:00			`from papis_extract.exceptions import ExtractionError`
initial commit 2023-08-28 08:28:06 +00:00
Add debug logging for extractor 2023-08-28 10:53:03 +00:00			`logger = papis.logging.get_logger(__name__)`

Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`class Extractor(Protocol):`
chore: Black formatting 2024-06-12 09:46:39 +00:00			`def can_process(self, filename: Path) -> bool: ...`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00
chore: Black formatting 2024-06-12 09:46:39 +00:00			`def run(self, filename: Path) -> list[Annotation]: ...`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00

Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`def start(`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00			`extractor: Extractor,`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`document: Document,`
			`) -> list[Annotation]:`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`"""Extract all annotations from passed documents.`
initial commit 2023-08-28 08:28:06 +00:00
Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00			`Returns all annotations contained in the papis`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`documents passed in.`
			`"""`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`annotations: list[Annotation] = []`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`file_available: bool = False`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`for file in document.get_files():`
			`fname = Path(file)`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00			`if not extractor.can_process(fname):`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`continue`
			`file_available = True`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`try:`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00			`annotations.extend(extractor.run(fname))`
refactor: Remove pymupdf coupling in extraction The library is only needed for pdf extraction which is taken care of in its own extractor plugin. In the overall extraction routine we do not need any knowledge of the existence of pymupdf. 2024-06-14 12:59:39 +00:00			`except ExtractionError as e:`
			`print(f"File extraction errors for {file}.\n{e}")`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`if not file_available:`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`# have to remove curlys or papis logger gets upset`
			`desc = re.sub("[{}]", "", papis.document.describe(document))`
feat: Add pocketbook extraction 2024-01-24 07:55:43 +00:00			`logger.info(f"No {type(extractor)} file for document: {desc}")`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00
			`return annotations`