Skip to content

How-To: Load dataset from sources#

Ragbits provides an abstraction for handling datasets. The Source component is designed to define interactions with any data source, such as downloading and querying.

Supported sources#

This is the list of currently supported sources by Ragbits.

Source URI Schema Class
Azure Blob Storage azure://https://<account-name>.blob.core.windows.net/<container-name>/<blob-name> AzureBlobStorageSource
Google Cloud Storage gcs://<bucket-name>/<prefix> GCSSource
Git git://<https-url>|<ssh-url> GitSource
Hugging Face hf://<dataset-path>/<split>/<row> HuggingFaceSource
Local file local://<file-path>|<blob-pattern> LocalFileSource
Amazon S3 s3://<bucket-name>/<prefix> S3Source
Web web://<https-url> WebSource

Custom source#

To define a new sources, extend the Source class.

from ragbits.core.sources import Source


class CustomSource(Source):
    """
    Source that downloads file from the web.
    """

    protocol: ClassVar[str] = "custom"
    source_url: str
    ...

    @property
    def id(self) -> str:
        """
        Get the source identifier.
        """
        return f"{self.protocol}:{self.source_url}"

    async def fetch(self) -> Path:
        """
        Download a file for the given url.

        Returns:
            The local path to the downloaded file.
        """
        ...
        return Path(f"/tmp/{self.source_url}")

    @classmethod
    async def list_sources(cls, source_url: str) -> Iterable[Self]:
        """
        List all sources from the given storage.

        Args:
            source_url: The source url to list sources from.

        Returns:
            The iterable of Source objects.
        """
        ...
        return [cls(...), ...]

    @classmethod
    async def from_uri(cls, uri: str) -> Iterable[Self]:
        """
        Create source instances from a URI path.

        Args:
            uri: The URI path.

        Returns:
            The iterable of Source objects matching the path pattern.
        """
        ...
        return await self.list_sources(...)

Hint

To use a custom source via the CLI, make sure that the custom source class is registered in pyproject.toml. You can find information on how to do this here.