Documents and Elements#
ragbits.document_search.documents.document.Document
#
Bases: BaseModel
An object representing a document which is downloaded and stored locally.
from_document_meta
classmethod
#
from_document_meta(document_meta: DocumentMeta, local_path: Path) -> Document
Create a document from a document metadata. Based on the document type, it will return a different object.
PARAMETER | DESCRIPTION |
---|---|
document_meta |
The document metadata.
TYPE:
|
local_path |
The local path to the document.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Document
|
The document. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/document.py
ragbits.document_search.documents.document.DocumentType
#
Bases: str
, Enum
Document types that can be parsed.
ragbits.document_search.documents.document.DocumentMeta
#
Bases: BaseModel
An object representing a document metadata.
fetch
async
#
fetch() -> Document
This method fetches the document from source (potentially remote) and creates an object to interface with it. Based on the document type, it will return a different object.
RETURNS | DESCRIPTION |
---|---|
Document
|
The document. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/document.py
create_text_document_from_literal
classmethod
#
create_text_document_from_literal(content: str) -> DocumentMeta
Create a text document from a literal content.
PARAMETER | DESCRIPTION |
---|---|
content |
The content of the document.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DocumentMeta
|
The document metadata. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/document.py
from_local_path
classmethod
#
from_local_path(local_path: Path) -> DocumentMeta
Create a document metadata from a local path.
PARAMETER | DESCRIPTION |
---|---|
local_path |
The local path to the document.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DocumentMeta
|
The document metadata. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/document.py
from_source
async
classmethod
#
from_source(source: Source) -> DocumentMeta
Create a document metadata from a source.
PARAMETER | DESCRIPTION |
---|---|
source |
The source from which the document is fetched.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DocumentMeta
|
The document metadata. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/document.py
ragbits.document_search.documents.element.Element
#
Bases: BaseModel
, ABC
An object representing an element in a document.
id
property
#
Retrieve the ID of the element, primarily used to represent the element's data.
RETURNS | DESCRIPTION |
---|---|
str
|
string representing element
TYPE:
|
key
property
#
Get the representation of the element for embedding.
RETURNS | DESCRIPTION |
---|---|
str | None
|
The representation for embedding. |
text_representation
abstractmethod
property
#
Get the text representation of the element.
RETURNS | DESCRIPTION |
---|---|
str | None
|
The text representation. |
image_representation
property
#
Get the image representation of the element.
RETURNS | DESCRIPTION |
---|---|
bytes | None
|
The image representation. |
get_id_components
#
Creates a dictionary of key value pairs of id components
RETURNS | DESCRIPTION |
---|---|
dict
|
a dictionary
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/element.py
from_vector_db_entry
classmethod
#
from_vector_db_entry(db_entry: VectorStoreEntry) -> Element
Create an element from a vector database entry.
PARAMETER | DESCRIPTION |
---|---|
db_entry |
The vector database entry.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Element
|
The element. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/element.py
to_vector_db_entry
#
to_vector_db_entry() -> VectorStoreEntry
Create a vector database entry from the element.
RETURNS | DESCRIPTION |
---|---|
VectorStoreEntry
|
The vector database entry |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/element.py
ragbits.document_search.documents.sources.Source
#
Bases: BaseModel
, ABC
An object representing a source.
class_identifier
classmethod
#
source_type
#
fetch
abstractmethod
async
#
from_uri
abstractmethod
async
classmethod
#
from_uri(path: str) -> Sequence[Source]
Create Source instances from a URI path.
The path can contain glob patterns (asterisks) to match multiple sources, but pattern support varies by source type. Each source implementation defines which patterns it supports:
- LocalFileSource: Supports full glob patterns ('', '*', etc.) via Path.glob
- GCSSource: Supports simple prefix matching with '*' at the end of path
- HuggingFaceSource: Does not support glob patterns
PARAMETER | DESCRIPTION |
---|---|
path |
The path part of the URI (after protocol://). Pattern support depends on source type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[Source]
|
A sequence of Source objects matching the path pattern |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the path contains unsupported pattern for this source type |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/base.py
ragbits.document_search.documents.sources.AzureBlobStorageSource
#
Bases: Source
An object representing an Azure Blob Storage dataset source.
class_identifier
classmethod
#
source_type
#
fetch
async
#
Downloads the blob to a temporary local file and returns the file path.
RETURNS | DESCRIPTION |
---|---|
Path
|
Path to the downloaded file. |
RAISES | DESCRIPTION |
---|---|
SourceNotFoundError
|
If the blob source is not available. |
SourceConnectionError
|
If the blob service connection is not available. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/azure.py
from_uri
async
classmethod
#
from_uri(path: str) -> Sequence[AzureBlobStorageSource]
Parses an Azure Blob Storage URI and returns an instance of AzureBlobStorageSource.
PARAMETER | DESCRIPTION |
---|---|
path |
The Azure Blob Storage URI.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[AzureBlobStorageSource]
|
Sequence["AzureBlobStorageSource"]: The parsed Azure Blob Storage URI. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the Azure Blob Storage URI is invalid. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/azure.py
list_sources
async
classmethod
#
list_sources(account_name: str, container: str, blob_name: str = '') -> list[AzureBlobStorageSource]
List all sources in the given Azure container, matching the prefix.
PARAMETER | DESCRIPTION |
---|---|
account_name |
The Azure storage account name.
TYPE:
|
container |
The Azure container name.
TYPE:
|
blob_name |
The prefix to match.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[AzureBlobStorageSource]
|
List of source objects. |
RAISES | DESCRIPTION |
---|---|
ImportError
|
If the required 'azure-storage-blob' package is not installed |
SourceConnectionError
|
If there's an error connecting to Azure |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/azure.py
ragbits.document_search.documents.sources.GCSSource
#
Bases: Source
An object representing a GCS file source.
id
property
#
Get unique identifier of the object in the source.
RETURNS | DESCRIPTION |
---|---|
str
|
Unique identifier. |
class_identifier
classmethod
#
source_type
#
set_storage
classmethod
#
Set the storage client for all instances.
PARAMETER | DESCRIPTION |
---|---|
storage |
The
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/gcs.py
fetch
async
#
Fetch the file from Google Cloud Storage and store it locally.
The file is downloaded to a local directory specified by local_dir
. If the file already exists locally,
it will not be downloaded again. If the file doesn't exist locally, it will be fetched from GCS.
The local directory is determined by the environment variable LOCAL_STORAGE_DIR
. If this environment
variable is not set, a temporary directory is used.
RETURNS | DESCRIPTION |
---|---|
Path
|
The local path to the downloaded file.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ImportError
|
If the 'gcp' extra is not installed. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/gcs.py
list_sources
async
classmethod
#
list_sources(bucket: str, prefix: str = '') -> list[GCSSource]
List all sources in the given GCS bucket, matching the prefix.
PARAMETER | DESCRIPTION |
---|---|
bucket |
The GCS bucket.
TYPE:
|
prefix |
The prefix to match.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[GCSSource]
|
List of source objects. |
RAISES | DESCRIPTION |
---|---|
ImportError
|
If the required 'gcloud-aio-storage' package is not installed |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/gcs.py
from_uri
async
classmethod
#
from_uri(path: str) -> Sequence[GCSSource]
Create GCSSource instances from a URI path.
Supports simple prefix matching with '' at the end of path. For example: - "bucket/folder/" - matches all files in the folder - "bucket/folder/prefix*" - matches all files starting with prefix
More complex patterns like '**' or '?' are not supported.
PARAMETER | DESCRIPTION |
---|---|
path |
The path part of the URI (after gcs://). Can end with '*' for pattern matching.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[GCSSource]
|
A sequence of GCSSource objects matching the pattern |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If an unsupported pattern is used |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/gcs.py
ragbits.document_search.documents.sources.GitSource
#
Bases: Source
An object representing a file in a Git repository.
id
property
#
Get the source ID, which is a unique identifier of the object.
RETURNS | DESCRIPTION |
---|---|
str
|
The source ID. |
class_identifier
classmethod
#
source_type
#
fetch
async
#
Clone the Git repository and return the path to the specific file.
RETURNS | DESCRIPTION |
---|---|
Path
|
The local path to the specific file in the cloned repository. |
RAISES | DESCRIPTION |
---|---|
SourceNotFoundError
|
If the repository cannot be cloned or the file doesn't exist. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/git.py
list_sources
async
classmethod
#
list_sources(repo_url: str, file_pattern: str = '**/*', branch: str | None = None) -> list[GitSource]
List all files in the repository matching the pattern.
PARAMETER | DESCRIPTION |
---|---|
repo_url |
URL of the git repository.
TYPE:
|
file_pattern |
The glob pattern to match files.
TYPE:
|
branch |
Optional branch name.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[GitSource]
|
List of GitSource objects, one for each file matching the pattern. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/git.py
from_uri
async
classmethod
#
from_uri(uri: str) -> Sequence[GitSource]
Create GitSource instances from a URI path.
Supported URI formats: - git://https://github.com/username/repo.git:path/to/file.txt - git://https://github.com/username/repo.git:branch:path/to/file.txt - git@github.com:username/repo.git:path/to/file.txt - git@github.com:username/repo.git:branch:path/to/file.txt
PARAMETER | DESCRIPTION |
---|---|
uri |
The URI path in the format described above.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[GitSource]
|
A sequence containing a GitSource instance. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/git.py
ragbits.document_search.documents.sources.HuggingFaceSource
#
Bases: Source
An object representing a Hugging Face dataset source.
id
property
#
Get unique identifier of the object in the source.
RETURNS | DESCRIPTION |
---|---|
str
|
Unique identifier. |
class_identifier
classmethod
#
source_type
#
fetch
async
#
Fetch the file from Hugging Face and store it locally.
RETURNS | DESCRIPTION |
---|---|
Path
|
The local path to the downloaded file.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ImportError
|
If the 'huggingface' extra is not installed. |
SourceConnectionError
|
If the source connection fails. |
SourceNotFoundError
|
If the source document is not found. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/hf.py
from_uri
async
classmethod
#
from_uri(path: str) -> Sequence[HuggingFaceSource]
Create HuggingFaceSource instances from a URI path.
Pattern matching is not supported. The path must be in the format: huggingface://dataset_path/split/row
PARAMETER | DESCRIPTION |
---|---|
path |
The path part of the URI (after huggingface://)
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[HuggingFaceSource]
|
A sequence containing a single HuggingFaceSource |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the path contains patterns or has invalid format |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/hf.py
list_sources
async
classmethod
#
list_sources(path: str, split: str) -> list[HuggingFaceSource]
List all sources in the given Hugging Face repository.
PARAMETER | DESCRIPTION |
---|---|
path |
Path or name of the dataset.
TYPE:
|
split |
Dataset split.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[HuggingFaceSource]
|
List of source objects. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/hf.py
ragbits.document_search.documents.sources.LocalFileSource
#
Bases: Source
An object representing a local file source.
id
property
#
Get unique identifier of the object in the source.
RETURNS | DESCRIPTION |
---|---|
str
|
Unique identifier. |
class_identifier
classmethod
#
source_type
#
fetch
async
#
Fetch the source.
RETURNS | DESCRIPTION |
---|---|
Path
|
The local path to the object fetched from the source. |
RAISES | DESCRIPTION |
---|---|
SourceNotFoundError
|
If the source document is not found. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/local.py
list_sources
classmethod
#
list_sources(path: Path, file_pattern: str = '*') -> list[LocalFileSource]
List all sources in the given directory, matching the file pattern.
PARAMETER | DESCRIPTION |
---|---|
path |
The path to the directory.
TYPE:
|
file_pattern |
The file pattern to match.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[LocalFileSource]
|
List of source objects. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/local.py
from_uri
async
classmethod
#
from_uri(path: str) -> Sequence[LocalFileSource]
Create LocalFileSource instances from a URI path.
Supports full glob patterns via Path.glob: - "/.txt" - all .txt files in any subdirectory - ".py" - all Python files in the current directory - "/*" - all files in any subdirectory - '?' matches exactly one character
PARAMETER | DESCRIPTION |
---|---|
path |
The path part of the URI (after file://). Pattern support depends on source type.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[LocalFileSource]
|
A sequence of LocalFileSource objects |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/local.py
ragbits.document_search.documents.sources.S3Source
#
Bases: Source
An object representing an AWS S3 Storage dataset source.
class_identifier
classmethod
#
source_type
#
fetch
async
#
Download a file in the given bucket_name with the given key.
RETURNS | DESCRIPTION |
---|---|
Path
|
The local path to the downloaded file.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ClientError
|
If the file doesn't exist or credentials are incomplete. |
NoCredentialsError
|
If no credentials are available. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/s3.py
list_sources
async
classmethod
#
list_sources(bucket_name: str, prefix: str) -> Sequence[S3Source]
List all files under the given bucket name and with the given prefix.
PARAMETER | DESCRIPTION |
---|---|
bucket_name |
The name of the S3 bucket to use.
TYPE:
|
prefix |
The path to the files and prefix to look for.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence
|
The Sequence of AWS S3 sources.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
ClientError
|
If the source doesn't exist. |
NoCredentialsError
|
If no credentials are available. |
PartialCredentialsError
|
If credentials are incomplete. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/s3.py
from_uri
async
classmethod
#
from_uri(path: str) -> Sequence[S3Source]
Create S3Source instances from a URI path.
The supported paths formats are:
s3://
PARAMETER | DESCRIPTION |
---|---|
path |
The URI path.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[S3Source]
|
A sequence containing a S3Source instances. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If the path has invalid format |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/s3.py
ragbits.document_search.documents.sources.WebSource
#
Bases: Source
An object representing a Web dataset source.
class_identifier
classmethod
#
source_type
#
fetch
async
#
Download a file available in the given url.
RETURNS | DESCRIPTION |
---|---|
Path
|
The local path to the downloaded file.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
WebDownloadError
|
If the download failed. |
SourceNotFoundError
|
If the URL is invalid. |
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/web.py
list_sources
async
classmethod
#
list_sources(url: str) -> Sequence[WebSource]
List the file under the given URL.
PARAMETER | DESCRIPTION |
---|---|
url |
The URL to the file.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence
|
The Sequence with Web source.
TYPE:
|
Source code in packages/ragbits-document-search/src/ragbits/document_search/documents/sources/web.py
from_uri
async
classmethod
#
from_uri(uri: str) -> Sequence[WebSource]
Create WebSource instances from a URI path.
The supported uri format is:
PARAMETER | DESCRIPTION |
---|---|
uri |
The URI path. Needs to include the protocol.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Sequence[WebSource]
|
A sequence containing a WebSource instance. |