papis-extract/papis_extract/extractor.py

import re
from pathlib import Path
from typing import Protocol

import fitz
import papis.config
import papis.document
import papis.logging
from papis.document import Document

from papis_extract.annotation import Annotation
from papis_extract.extractors.pdf import PdfExtractor

logger = papis.logging.get_logger(__name__)


class Extractor(Protocol):
    def can_process(self, filename: Path) -> bool:
        ...

    def run(self, filename: Path) -> list[Annotation]:
        ...


def start(
    document: Document,
) -> list[Annotation]:
    """Extract all annotations from passed documents.

    Returns all annotations contained in the papis
    documents passed in.
    """

    pdf_extractor: Extractor = PdfExtractor()

    annotations: list[Annotation] = []
    file_available: bool = False
    for file in document.get_files():
        fname = Path(file)
        if not pdf_extractor.can_process(fname):
            continue
        file_available = True

        try:
            annotations.extend(pdf_extractor.run(fname))
        except fitz.FileDataError as e:
            print(f"File structure errors for {file}.\n{e}")

    if not file_available:
        # have to remove curlys or papis logger gets upset
        desc = re.sub("[{}]", "", papis.document.describe(document))
        logger.warning("Did not find suitable file for document: " f"{desc}")

    return annotations


extractors: dict[str, Extractor] = {
    "pdf": PdfExtractor(),
}
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`import re`
initial commit 2023-08-28 08:28:06 +00:00			`from pathlib import Path`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`from typing import Protocol`
initial commit 2023-08-28 08:28:06 +00:00
chore: Update dependencies to fix single-thread warning Fixed single-threaded warning provided from the fitz pymupdf library since the issue does not exist for this new version anymore. Bump version along the way. 2024-01-18 17:24:19 +00:00			`import fitz`
initial commit 2023-08-28 08:28:06 +00:00			`import papis.config`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`import papis.document`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`import papis.logging`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`from papis.document import Document`
initial commit 2023-08-28 08:28:06 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`from papis_extract.annotation import Annotation`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`from papis_extract.extractors.pdf import PdfExtractor`
initial commit 2023-08-28 08:28:06 +00:00
Add debug logging for extractor 2023-08-28 10:53:03 +00:00			`logger = papis.logging.get_logger(__name__)`

Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`class Extractor(Protocol):`
			`def can_process(self, filename: Path) -> bool:`
			`...`

			`def run(self, filename: Path) -> list[Annotation]:`
			`...`


Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`def start(`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`document: Document,`
			`) -> list[Annotation]:`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`"""Extract all annotations from passed documents.`
initial commit 2023-08-28 08:28:06 +00:00
Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00			`Returns all annotations contained in the papis`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`documents passed in.`
			`"""`

refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`pdf_extractor: Extractor = PdfExtractor()`

refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`annotations: list[Annotation] = []`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`file_available: bool = False`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`for file in document.get_files():`
			`fname = Path(file)`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`if not pdf_extractor.can_process(fname):`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`continue`
			`file_available = True`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`try:`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`annotations.extend(pdf_extractor.run(fname))`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`except fitz.FileDataError as e:`
			`print(f"File structure errors for {file}.\n{e}")`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`if not file_available:`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`# have to remove curlys or papis logger gets upset`
			`desc = re.sub("[{}]", "", papis.document.describe(document))`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`logger.warning("Did not find suitable file for document: " f"{desc}")`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00
			`return annotations`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`extractors: dict[str, Extractor] = {`
			`"pdf": PdfExtractor(),`
			`}`