AmazonTextractPDFLoader#
- class langchain_community.document_loaders.pdf.AmazonTextractPDFLoader(
- file_path: str | PurePath,
- textract_features: Sequence[str] | None = None,
- client: Any | None = None,
- credentials_profile_name: str | None = None,
- region_name: str | None = None,
- endpoint_url: str | None = None,
- headers: dict | None = None,
- *,
- linearization_config: TextLinearizationConfig | None = None,
Load PDF files from a local file system, HTTP or S3.
To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.
Make sure the credentials / roles used have the required policies to access the Amazon Textract service.
Example
Initialize the loader.
- Parameters:
file_path (str | PurePath) β A file, url or s3 path for input file
textract_features (Sequence[str] | None) β Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg
client (Any | None) β boto3 textract client (Optional)
credentials_profile_name (str | None) β AWS profile name, if not default (Optional)
region_name (str | None) β AWS region, eg us-east-1 (Optional)
endpoint_url (str | None) β endpoint url for the textract service (Optional)
linearization_config (TextLinearizationConfig | None) β Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg
headers (dict | None)
Attributes
source
Methods
__init__
(file_path[,Β textract_features,Β ...])Initialize the loader.
A lazy loader for Documents.
aload
()Load data into Document objects.
Lazy load documents
load
()Load given path as pages.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(
- file_path: str | PurePath,
- textract_features: Sequence[str] | None = None,
- client: Any | None = None,
- credentials_profile_name: str | None = None,
- region_name: str | None = None,
- endpoint_url: str | None = None,
- headers: dict | None = None,
- *,
- linearization_config: TextLinearizationConfig | None = None,
Initialize the loader.
- Parameters:
file_path (str | PurePath) β A file, url or s3 path for input file
textract_features (Sequence[str] | None) β Features to be used for extraction, each feature should be passed as a str that conforms to the enum Textract_Features, see amazon-textract-caller pkg
client (Any | None) β boto3 textract client (Optional)
credentials_profile_name (str | None) β AWS profile name, if not default (Optional)
region_name (str | None) β AWS region, eg us-east-1 (Optional)
endpoint_url (str | None) β endpoint url for the textract service (Optional)
linearization_config (TextLinearizationConfig | None) β Config to be used for linearization of the output should be an instance of TextLinearizationConfig from the textractor pkg
headers (dict | None)
- Return type:
None
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- load_and_split(
- text_splitter: TextSplitter | None = None,
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) β TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
list[Document]
Examples using AmazonTextractPDFLoader