AmazonTextractPDFParser#

class langchain_community.document_loaders.parsers.pdf.AmazonTextractPDFParser(
textract_features: Sequence[int] | None = None,
client: Any | None = None,
*,
linearization_config: TextLinearizationConfig | None = None,
)[source]#

Send PDF files to Amazon Textract and parse them.

For parsing multi-page PDFs, they have to reside on S3.

The AmazonTextractPDFLoader calls the [Amazon Textract Service](https://aws.amazon.com/textract/) to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.

For the call to be successful an AWS account is required, similar to the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) requirements.

Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.

`python from langchain_community.document_loaders import AmazonTextractPDFLoader loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg") documents = loader.load() `

One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract

```python from langchain_community.document_loaders import AmazonTextractPDFLoader # you can mix and match each of the features loader=AmazonTextractPDFLoader(

“example_data/alejandro_rosalez_sample-small.jpeg”, textract_features=[“TABLES”, “LAYOUT”])

documents = loader.load() ```

it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.

Document objects are returned with metadata that includes the source and a 1-based index of the page number in page. Note that page represents the index of the result returned from Textract, not necessarily the as-written page number in the document.

Initializes the parser.

Parameters:
  • textract_features (Optional[Sequence[int]]) – Features to be used for extraction, each feature should be passed as an int that conforms to the enum Textract_Features, see amazon-textract-caller pkg

  • client (Optional[Any]) – boto3 textract client

  • linearization_config (Optional[TextLinearizationConfig]) – Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg

Methods

__init__([textract_features, client, ...])

Initializes the parser.

lazy_parse(blob)

Iterates over the Blob pages and returns an Iterator with a Document for each page, like the other parsers If multi-page document, blob.path has to be set to the S3 URI and for single page docs the blob.data is taken

parse(blob)

Eagerly parse the blob into a document or documents.

__init__(
textract_features: Sequence[int] | None = None,
client: Any | None = None,
*,
linearization_config: TextLinearizationConfig | None = None,
) None[source]#

Initializes the parser.

Parameters:
  • textract_features (Optional[Sequence[int]]) – Features to be used for extraction, each feature should be passed as an int that conforms to the enum Textract_Features, see amazon-textract-caller pkg

  • client (Optional[Any]) – boto3 textract client

  • linearization_config (Optional[TextLinearizationConfig]) – Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg

Return type:

None

lazy_parse(
blob: Blob,
) Iterator[Document][source]#

Iterates over the Blob pages and returns an Iterator with a Document for each page, like the other parsers If multi-page document, blob.path has to be set to the S3 URI and for single page docs the blob.data is taken

Parameters:

blob (Blob)

Return type:

Iterator[Document]

parse(blob: Blob) list[Document]#

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters:

blob (Blob) – Blob instance

Returns:

List of documents

Return type:

list[Document]