papis-extract/papis_extract/extraction.py

import re
from pathlib import Path
from typing import Protocol

import papis.config
import papis.document
import papis.logging
from papis.document import Document

from papis_extract.annotation import Annotation
from papis_extract.exceptions import ExtractionError

logger = papis.logging.get_logger(__name__)


class Extractor(Protocol):
    def can_process(self, filename: Path) -> bool: ...

    def run(self, filename: Path) -> list[Annotation]: ...


def start(
    extractor: Extractor,
    document: Document,
) -> list[Annotation] | None:
    """Extract all annotations from passed documents.

    Returns all annotations contained in the papis
    documents passed in (empty list if no annotations).
    If there are no files that the extractor can process,
    returns None instead.
    """
    annotations: list[Annotation] = []
    file_available: bool = False

    for file in document.get_files():
        fname = Path(file)
        if not extractor.can_process(fname):
            continue
        file_available = True

        try:
            annotations.extend(extractor.run(fname))
        except ExtractionError as e:
            logger.error(f"File extraction errors for {file}. File may be damaged.\n{e}")

    if not file_available:
        return None

    return annotations
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`import re`
initial commit 2023-08-28 08:28:06 +00:00			`from pathlib import Path`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`from typing import Protocol`
initial commit 2023-08-28 08:28:06 +00:00
			`import papis.config`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`import papis.document`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`import papis.logging`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`from papis.document import Document`
initial commit 2023-08-28 08:28:06 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`from papis_extract.annotation import Annotation`
refactor: Fix circular exception import 2024-06-14 13:18:02 +00:00			`from papis_extract.exceptions import ExtractionError`
initial commit 2023-08-28 08:28:06 +00:00
Add debug logging for extractor 2023-08-28 10:53:03 +00:00			`logger = papis.logging.get_logger(__name__)`

Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00			`class Extractor(Protocol):`
chore: Black formatting 2024-06-12 09:46:39 +00:00			`def can_process(self, filename: Path) -> bool: ...`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00
chore: Black formatting 2024-06-12 09:46:39 +00:00			`def run(self, filename: Path) -> list[Annotation]: ...`
refactor: Extract PDF extractor into class Extractor is a general protocol with the PDF extraction routine now being one implementation of the protocol. Preparation for adding multiple extractors (epub,djvu, or specific progammes) in the future. 2024-01-20 17:02:18 +00:00

Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`def start(`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00			`extractor: Extractor,`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`document: Document,`
fix: Only inform if no extractor finds valid files Until now whenever an extractor could not find any valid files for a document it would inform the user of this case. However, this is not very useful: if you have a pdf and an epub extractor running, it would inform you for each document which only had one of the two formats as well as those which actually did not have any valid files for any of the extractors running. This commit changes the behavior to only inform the user when none of the running extractors find a valid file, since that is the actual case a user might want to be informed about. 2024-06-14 18:02:52 +00:00			`) -> list[Annotation] \| None:`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`"""Extract all annotations from passed documents.`
initial commit 2023-08-28 08:28:06 +00:00
Change annotation color to simple rgb tuple 2023-08-29 20:23:52 +00:00			`Returns all annotations contained in the papis`
fix: Only inform if no extractor finds valid files Until now whenever an extractor could not find any valid files for a document it would inform the user of this case. However, this is not very useful: if you have a pdf and an epub extractor running, it would inform you for each document which only had one of the two formats as well as those which actually did not have any valid files for any of the extractors running. This commit changes the behavior to only inform the user when none of the running extractors find a valid file, since that is the actual case a user might want to be informed about. 2024-06-14 18:02:52 +00:00			`documents passed in (empty list if no annotations).`
			`If there are no files that the extractor can process,`
			`returns None instead.`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00			`"""`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`annotations: list[Annotation] = []`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`file_available: bool = False`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`for file in document.get_files():`
			`fname = Path(file)`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00			`if not extractor.can_process(fname):`
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`continue`
			`file_available = True`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00			`try:`
feat: Loop through all chosen extractors 2024-01-23 08:10:42 +00:00			`annotations.extend(extractor.run(fname))`
refactor: Remove pymupdf coupling in extraction The library is only needed for pdf extraction which is taken care of in its own extractor plugin. In the overall extraction routine we do not need any knowledge of the existence of pymupdf. 2024-06-14 12:59:39 +00:00			`except ExtractionError as e:`
fix: Only inform if no extractor finds valid files Until now whenever an extractor could not find any valid files for a document it would inform the user of this case. However, this is not very useful: if you have a pdf and an epub extractor running, it would inform you for each document which only had one of the two formats as well as those which actually did not have any valid files for any of the extractors running. This commit changes the behavior to only inform the user when none of the running extractors find a valid file, since that is the actual case a user might want to be informed about. 2024-06-14 18:02:52 +00:00			`logger.error(f"File extraction errors for {file}. File may be damaged.\n{e}")`
Move all extraction logic into extractor module The publically accessible default interface only contains the command line command interface and a single run function. 2023-08-29 10:40:36 +00:00
feat: Add extractor cli choice Can only choose pdf for the time being, but allows additional extractors to be added in the future. 2024-01-23 07:58:32 +00:00			`if not file_available:`
fix: Only inform if no extractor finds valid files Until now whenever an extractor could not find any valid files for a document it would inform the user of this case. However, this is not very useful: if you have a pdf and an epub extractor running, it would inform you for each document which only had one of the two formats as well as those which actually did not have any valid files for any of the extractors running. This commit changes the behavior to only inform the user when none of the running extractors find a valid file, since that is the actual case a user might want to be informed about. 2024-06-14 18:02:52 +00:00			`return None`
refactor: Remove AnnotatedDocument class The AnnotatedDocument class was, essentially, a simple tuple of a document and a list of annotations. While not bad in a vacuum, it is unwieldy and passing this around instead of a document, annotations, or both where necessary is more restrictive and frankly unnecessary. This commit removes the data class and any instances of its use. Instead, we now pass the individual components around to anything that needs them. This also frees us up to pass only annotations around for example. We also do not iterate through the selected papis documents to work on in each exporter anymore (since we only pass a single document), but in the main function itself. This leads to less duplication and makes the overall run function the overall single source of iteration through selected documents. Everything else only knows about a single document - the one it is operating on - which seems much neater. For now, it does not change much, but should make later work on extra exporters or extractors easier. 2024-01-20 15:34:10 +00:00
			`return annotations`