Document Parsers#
ragbits.document_search.ingestion.parsers.router.DocumentParserRouter
#
DocumentParserRouter(parsers: Mapping[DocumentType, DocumentParser] | None = None)
Bases: WithConstructionConfig
The class responsible for routing the document to the correct parser based on the document type.
Initialize the DocumentParserRouter instance.
PARAMETER | DESCRIPTION |
---|---|
parsers |
The mapping of document types and their parsers. To override default Unstructured parsers.
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/router.py
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initialize the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
The DocumentParserRouter. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If any of the provided parsers cannot be initialized. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/router.py
get
#
get(document_type: DocumentType) -> DocumentParser
Get the parser for the document.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DocumentParser
|
The parser for processing the document. |
RAISES | DESCRIPTION |
---|---|
ParserNotFoundError
|
If no parser is found for the document type. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/router.py
ragbits.document_search.ingestion.parsers.base.DocumentParser
#
Bases: WithConstructionConfig
, ABC
Base class for document parsers, responsible for converting the document into a list of elements.
supported_document_types
class-attribute
instance-attribute
#
supported_document_types: set[DocumentType] = set()
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
parse
abstractmethod
async
#
Parse the document.
PARAMETER | DESCRIPTION |
---|---|
document |
The document to parse.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[Element]
|
The list of elements extracted from the document. |
RAISES | DESCRIPTION |
---|---|
ParserError
|
If the parsing of the document failed. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
validate_document_type
classmethod
#
validate_document_type(document_type: DocumentType) -> None
Check if the parser supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type to validate against the parser.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
ragbits.document_search.ingestion.parsers.base.TextDocumentParser
#
Bases: DocumentParser
Simple parser that maps a text to the text element.
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
validate_document_type
classmethod
#
validate_document_type(document_type: DocumentType) -> None
Check if the parser supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type to validate against the parser.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
parse
async
#
Parse the document.
PARAMETER | DESCRIPTION |
---|---|
document |
The document to parse.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[Element]
|
List with an text element with the text content. |
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported by the parser. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
ragbits.document_search.ingestion.parsers.base.ImageDocumentParser
#
Bases: DocumentParser
Simple parser that maps an image to the image element.
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
validate_document_type
classmethod
#
validate_document_type(document_type: DocumentType) -> None
Check if the parser supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type to validate against the parser.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
parse
async
#
Parse the document.
PARAMETER | DESCRIPTION |
---|---|
document |
The document to parse.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[Element]
|
List with an image element with the image content. |
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported by the parser. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
ragbits.document_search.ingestion.parsers.unstructured.UnstructuredDocumentParser
#
UnstructuredDocumentParser(partition_kwargs: dict | None = None, chunking_kwargs: dict | None = None, api_key: str | None = None, api_server: str | None = None, use_api: bool = False, ignore_images: bool = False)
Bases: DocumentParser
Parser that uses the Unstructured API or local SDK to process the documents.
Initialize the UnstructuredDocumentParser instance.
PARAMETER | DESCRIPTION |
---|---|
partition_kwargs |
The additional arguments for the partitioning. Refer to the Unstructured API documentation for the available options: https://docs.unstructured.io/api-reference/api-services/api-parameters
TYPE:
|
chunking_kwargs |
The additional arguments for the chunking.
TYPE:
|
api_key |
The API key to use for the Unstructured API. If not specified, the UNSTRUCTURED_API_KEY environment variable will be used.
TYPE:
|
api_server |
The API server URL to use for the Unstructured API. If not specified, the UNSTRUCTURED_SERVER_URL environment variable will be used.
TYPE:
|
use_api |
whether to use Unstructured API, otherwise use local version of Unstructured library
TYPE:
|
ignore_images |
if True images will be skipped
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/unstructured.py
supported_document_types
class-attribute
instance-attribute
#
supported_document_types = {TXT, MD, PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, HTML, EPUB, ORG, ODT, RST, RTF, TSV, JSON, XML, JPG, PNG}
subclass_from_config
classmethod
#
Initializes the class with the provided configuration. May return a subclass of the class, if requested by the configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A model containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The class can't be found or is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
subclass_from_factory
classmethod
#
Creates the class using the provided factory function. May return a subclass of the class, if requested by the factory.
PARAMETER | DESCRIPTION |
---|---|
factory_path |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided factory function. |
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
The factory can't be found or the object returned is not a subclass of the current class. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
preferred_subclass
classmethod
#
preferred_subclass(config: CoreConfig, factory_path_override: str | None = None, yaml_path_override: Path | None = None) -> Self
Tries to create an instance by looking at project's component preferences, either from YAML or from the factory. Takes optional overrides for both, which takes a higher precedence.
PARAMETER | DESCRIPTION |
---|---|
config |
The CoreConfig instance containing preferred factory and configuration details.
TYPE:
|
factory_path_override |
A string representing the path to the factory function in the format of "module.submodule:factory_name".
TYPE:
|
yaml_path_override |
A string representing the path to the YAML file containing the Ragstack instance configuration.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidConfigError
|
If the default factory or configuration can't be found. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
from_config
classmethod
#
Initializes the class with the provided configuration.
PARAMETER | DESCRIPTION |
---|---|
config |
A dictionary containing configuration details for the class.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Self
|
An instance of the class initialized with the provided configuration. |
Source code in packages/ragbits-core/src/ragbits/core/utils/config_handling.py
validate_document_type
classmethod
#
validate_document_type(document_type: DocumentType) -> None
Check if the parser supports the document type.
PARAMETER | DESCRIPTION |
---|---|
document_type |
The document type to validate against the parser.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/ingestion/parsers/base.py
parse
async
#
Parse the document using the Unstructured API.
PARAMETER | DESCRIPTION |
---|---|
document |
The document to parse.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[Element]
|
The list of elements extracted from the document. |
RAISES | DESCRIPTION |
---|---|
ParserDocumentNotSupportedError
|
If the document type is not supported by the parser. |