Skip to content

How-To: Load dataset with sources#

Ragbits provides an abstraction for handling datasets. The Source component is designed to define interactions with any data source, such as downloading and querying.

Supported sources#

This is the list of currently supported sources by Ragbits.

Source URI Schema Class
Azure Blob Storage azure://https://account_name.blob.core.windows.net/<container-name>|<blob-name> AzureBlobStorageSource
Google Cloud Storage gcs://<bucket-name>/<prefix> GCSSource
Git git://<https-url>|<ssh-url> GitSource
Hugging Face hf://<dataset-path>/<split>/<row> HuggingFaceSource
Local file file://<file-path>|<blob-pattern> LocalFileSource
Amazon S3 s3://<bucket-name>/<prefix> S3Source
Web web://<https-url> WebSource

Custom source#

To define a new sources, extend the Source class.

from ragbits.core.sources import Source


class CustomSource(Source):
    """
    Source that downloads file from the web.
    """

    protocol: ClassVar[str] = "custom"
    source_url: str
    ...

    @property
    def id(self) -> str:
        """
        Source unique identifier.
        """
        return f"{self.protocol}:{self.source_url}"

    @classmethod
    async def from_uri(cls, uri: str) -> list[Self]:
        """
        Create source instances from a URI path.

        Args:
            uri: The URI path.

        Returns:
            The list of sources.
        """
        return [cls(...), ...]

    async def fetch(self) -> Path:
        """
        Download a file for the given url.

        Returns:
            The local path to the downloaded file.
        """
        ...
        return Path(f"/tmp/{self.source_url}")