Document Processing#
ragbits.document_search.ingestion.document_processor.DocumentProcessorRouter
#
The DocumentProcessorRouter is responsible for routing the document to the correct provider based on the document metadata such as the document type.
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/document_processor.py
from_dict_to_providers_config
staticmethod
#
Creates ProvidersConfig from dictionary that maps document types to the provider configuration.
PARAMETER | DESCRIPTION |
---|---|
dict_config |
The dictionary with configuration.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
ProvidersConfig
|
ProvidersConfig object. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If a provider class can't be found or is not the correct type. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/document_processor.py
from_config
classmethod
#
from_config(providers: ProvidersConfig | None = None) -> DocumentProcessorRouter
Create a DocumentProcessorRouter from a configuration. If the configuration is not provided, the default configuration will be used. If the configuration is provided, it will be merged with the default configuration, overriding the default values for the document types that are defined in the configuration. Example of the configuration: { DocumentType.TXT: YourCustomProviderClass(), DocumentType.PDF: UnstructuredProvider(), }
PARAMETER | DESCRIPTION |
---|---|
providers |
The dictionary with the providers configuration, mapping the document types to the provider class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DocumentProcessorRouter
|
The DocumentProcessorRouter. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/document_processor.py
get_provider
#
get_provider(document_meta: DocumentMeta) -> BaseProvider
Get the provider for the document.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document metadata.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
BaseProvider
|
The provider for processing the document. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no provider is found for the document type. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/document_processor.py
Providers#
ragbits.document_search.ingestion.providers.base.BaseProvider
#
Bases: WithConstructionConfig
, ABC
A base class for the document processing providers.
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
process
abstractmethod
async
#
process(document_meta: DocumentMeta) -> Sequence[Element | IntermediateElement]
Process the document.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[Element | IntermediateElement]
|
The list of elements extracted from the document. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/base.py
validate_document_type
#
validate_document_type(document_type: DocumentType) -> None
Check if the provider supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/base.py
ragbits.document_search.ingestion.providers.dummy.DummyProvider
#
Bases: BaseProvider
This is a mock provider that returns a TextElement with the content of the document. It should be used for testing purposes only.
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
validate_document_type
#
validate_document_type(document_type: DocumentType) -> None
Check if the provider supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/base.py
process
async
#
process(document_meta: DocumentMeta) -> list[Element | IntermediateElement]
Process the text document.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[Element | IntermediateElement]
|
List with a single TextElement containing the content of the document. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/dummy.py
ragbits.document_search.ingestion.providers.unstructured.UnstructuredDefaultProvider
#
UnstructuredDefaultProvider(partition_kwargs: dict | None = None, chunking_kwargs: dict | None = None, api_key: str | None = None, api_server: str | None = None, use_api: bool = False, ignore_images: bool = False)
Bases: BaseProvider
A provider that uses the Unstructured API or local SDK to process the documents.
Initialize the UnstructuredDefaultProvider.
PARAMETER | DESCRIPTION |
---|---|
partition_kwargs |
The additional arguments for the partitioning. Refer to the Unstructured API documentation for the available options: https://docs.unstructured.io/api-reference/api-services/api-parameters
TYPE:
|
chunking_kwargs |
The additional arguments for the chunking.
TYPE:
|
api_key |
The API key to use for the Unstructured API. If not specified, the UNSTRUCTURED_API_KEY environment variable will be used.
TYPE:
|
api_server |
The API server URL to use for the Unstructured API. If not specified, the UNSTRUCTURED_SERVER_URL environment variable will be used.
TYPE:
|
use_api |
whether to use Unstructured API, otherwise use local version of Unstructured library
TYPE:
|
ignore_images |
if True images will be skipped
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/default.py
SUPPORTED_DOCUMENT_TYPES
class-attribute
instance-attribute
#
SUPPORTED_DOCUMENT_TYPES = {TXT, MD, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, HTML, EPUB, ORG, ODT, RST, RTF, TSV, XML}
partition_kwargs
instance-attribute
#
partition_kwargs = partition_kwargs or DEFAULT_PARTITION_KWARGS
client
property
#
Get the UnstructuredClient instance. If the client is not initialized, it will be created.
RETURNS | DESCRIPTION |
---|---|
UnstructuredClient
|
The UnstructuredClient instance. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the UNSTRUCTURED_API_KEY_ENV environment variable is not set. |
ValueError
|
If the UNSTRUCTURED_SERVER_URL_ENV environment variable is not set. |
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
validate_document_type
#
validate_document_type(document_type: DocumentType) -> None
Check if the provider supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/base.py
process
async
#
process(document_meta: DocumentMeta) -> Sequence[Element | IntermediateElement]
Process the document using the Unstructured API.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[Element | IntermediateElement]
|
The list of elements extracted from the document. |
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/default.py
ragbits.document_search.ingestion.providers.unstructured.UnstructuredImageProvider
#
UnstructuredImageProvider(partition_kwargs: dict | None = None, chunking_kwargs: dict | None = None, api_key: str | None = None, api_server: str | None = None, use_api: bool = False)
Bases: UnstructuredDefaultProvider
A specialized provider that handles pngs and jpgs using the Unstructured
Initialize the UnstructuredPdfProvider.
PARAMETER | DESCRIPTION |
---|---|
partition_kwargs |
The additional arguments for the partitioning. Refer to the Unstructured API documentation for the available options: https://docs.unstructured.io/api-reference/api-services/api-parameters
TYPE:
|
chunking_kwargs |
The additional arguments for the chunking.
TYPE:
|
api_key |
The API key to use for the Unstructured API. If not specified, the UNSTRUCTURED_API_KEY environment variable will be used.
TYPE:
|
api_server |
The API server URL to use for the Unstructured API. If not specified, the UNSTRUCTURED_SERVER_URL environment variable will be used.
TYPE:
|
use_api |
Whether to use the Unstructured API. If False, the provider will only use the local processing.
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/images.py
partition_kwargs
instance-attribute
#
partition_kwargs = partition_kwargs or DEFAULT_PARTITION_KWARGS
client
property
#
Get the UnstructuredClient instance. If the client is not initialized, it will be created.
RETURNS | DESCRIPTION |
---|---|
UnstructuredClient
|
The UnstructuredClient instance. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the UNSTRUCTURED_API_KEY_ENV environment variable is not set. |
ValueError
|
If the UNSTRUCTURED_SERVER_URL_ENV environment variable is not set. |
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
process
async
#
process(document_meta: DocumentMeta) -> Sequence[Element | IntermediateElement]
Process the document using the Unstructured API.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[Element | IntermediateElement]
|
The list of elements extracted from the document. |
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/default.py
validate_document_type
#
validate_document_type(document_type: DocumentType) -> None
Check if the provider supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/base.py
ragbits.document_search.ingestion.providers.unstructured.UnstructuredPdfProvider
#
UnstructuredPdfProvider(partition_kwargs: dict | None = None, chunking_kwargs: dict | None = None, api_key: str | None = None, api_server: str | None = None, use_api: bool = False)
Bases: UnstructuredImageProvider
A specialized provider that handles pdfs using the Unstructured
Initialize the UnstructuredPdfProvider.
PARAMETER | DESCRIPTION |
---|---|
partition_kwargs |
The additional arguments for the partitioning. Refer to the Unstructured API documentation for the available options: https://docs.unstructured.io/api-reference/api-services/api-parameters
TYPE:
|
chunking_kwargs |
The additional arguments for the chunking.
TYPE:
|
api_key |
The API key to use for the Unstructured API. If not specified, the UNSTRUCTURED_API_KEY environment variable will be used.
TYPE:
|
api_server |
The API server URL to use for the Unstructured API. If not specified, the UNSTRUCTURED_SERVER_URL environment variable will be used.
TYPE:
|
use_api |
Whether to use the Unstructured API. If False, the provider will only use the local processing.
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/images.py
partition_kwargs
instance-attribute
#
partition_kwargs = partition_kwargs or DEFAULT_PARTITION_KWARGS
client
property
#
Get the UnstructuredClient instance. If the client is not initialized, it will be created.
RETURNS | DESCRIPTION |
---|---|
UnstructuredClient
|
The UnstructuredClient instance. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the UNSTRUCTURED_API_KEY_ENV environment variable is not set. |
ValueError
|
If the UNSTRUCTURED_SERVER_URL_ENV environment variable is not set. |
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
process
async
#
process(document_meta: DocumentMeta) -> Sequence[Element | IntermediateElement]
Process the document using the Unstructured API.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document to process.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[Element | IntermediateElement]
|
The list of elements extracted from the document. |
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/providers/unstructured/default.py
validate_document_type
#
validate_document_type(document_type: DocumentType) -> None
Check if the provider supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
DocumentTypeNotSupportedError
|
If the document type is not supported. |