How-To: Setup and Query Google Drive Source#

This guide shows you how to set up and use Google Drive as a source in Ragbits, including downloading files and folders from Google Drive.

Prerequisites Setup#

1. Enable Google Drive API#

First, you need to enable the Google Drive API in your Google Cloud project:

Go to the Google Cloud Console
Select your project (or create a new one)
Navigate to APIs & Services > Library
Search for "Google Drive API"
Click on "Google Drive API" and click Enable

2. Create a Service Account#

To authenticate with Google Drive programmatically, you'll need a service account:

In Google Cloud Console, go to IAM & Admin > Service Accounts
Click Create Service Account
Enter a name (e.g., "ragbits-google-drive")
Add a description (optional)
Click Create and Continue
Skip role assignment for now (click Continue)
Click Done

3. Generate Service Account Key#

Now you need to create and download the JSON credentials file:

In the Service Accounts list, click on your newly created service account
Go to the Keys tab
Click Add Key > Create new key
Select JSON format
Click Create
The JSON file will be downloaded automatically
Save this file securely (e.g., as service-account-key.json)

Security Note

Keep your service account key file secure and never commit it to version control. Consider using environment variables or secure secret management.

4. Grant Access to Google Drive Files/Folders#

Since the service account is not a regular user, you need to share the Google Drive files or folders with the service account:

Open the JSON key file and copy the client_email value (it looks like your-service@project.iam.gserviceaccount.com)
In Google Drive, right-click on the file or folder you want to access
Click Share
Paste the service account email and set permissions (Viewer is sufficient for reading)
Click Send

Basic Usage#

Setting Up Credentials#

from ragbits.core.sources.google_drive import GoogleDriveSource

# Set the path to your service account key file
GoogleDriveSource.set_credentials_file_path("path/to/service-account-key.json")

Example: Download Files from Google Drive#

import asyncio
from ragbits.core.sources.google_drive import GoogleDriveSource

async def download_google_drive_files():
    # Set credentials file path
    GoogleDriveSource.set_credentials_file_path("service-account-key.json")

    # Example 1: Download a single file by ID
    file_id = "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"  # Example Google Sheets ID
    sources = await GoogleDriveSource.from_uri(file_id)

    for source in sources:
        if not source.is_folder:
            local_path = await source.fetch()
            print(f"Downloaded: {source.file_name} to {local_path}")

    # Example 2: Download all files from a folder (non-recursive)
    folder_id = "1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms"
    sources = await GoogleDriveSource.from_uri(f"{folder_id}/*")

    for source in sources:
        if not source.is_folder:
            local_path = await source.fetch()
            print(f"Downloaded: {source.file_name} to {local_path}")

    # Example 3: Download all files recursively from a folder
    sources = await GoogleDriveSource.from_uri(f"{folder_id}/**")

    for source in sources:
        if not source.is_folder:
            try:
                local_path = await source.fetch()
                print(f"Downloaded: {source.file_name} to {local_path}")
            except Exception as e:
                print(f"Failed to download {source.file_name}: {e}")

# Run the example
asyncio.run(download_google_drive_files())

URI Patterns#

The Google Drive source supports several URI patterns:

Pattern	Description	Example
`<file_id>`	Single file or folder by ID	`1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms`
`<folder_id>/*`	All files directly in folder	`1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/*`
`<folder_id>/<prefix>*`	Files in folder starting with prefix	`1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/report*`
`<folder_id>/**`	All files recursively in folder	`1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/**`

Environment Variables#

You can also set up credentials using environment variables:

# Set the service account key as JSON string
export GOOGLE_DRIVE_CLIENTID_JSON='{"type": "service_account", "project_id": "...", ...}'

# Or set the path to the key file
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

Advanced Example: Processing Documents#

import asyncio
from ragbits.core.sources.google_drive import GoogleDriveSource

async def process_drive_documents():
    """Example of processing documents from Google Drive."""

    # Set up credentials
    GoogleDriveSource.set_credentials_file_path("service-account-key.json")

    # Define the folder containing documents
    documents_folder_id = "your-folder-id-here"

    try:
        # Get all files from the folder recursively
        sources = await GoogleDriveSource.from_uri(f"{documents_folder_id}/**")

        processed_count = 0
        skipped_count = 0

        for source in sources:
            if source.is_folder:
                print(f"Skipping folder: {source.file_name}")
                continue

            # Filter by file type (example: only process text and document files)
            if source.mime_type in [
                'text/plain',
                'application/pdf',
                'application/vnd.google-apps.document',
                'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
            ]:
                try:
                    local_path = await source.fetch()
                    print(f" Processed: {source.file_name} (Type: {source.mime_type})")

                    # Here you could add your document processing logic
                    # For example: extract text, analyze content, etc.

                    processed_count += 1
                except Exception as e:
                    print(f" Failed to process {source.file_name}: {e}")
            else:
                print(f"  Skipped: {source.file_name} (Type: {source.mime_type})")
                skipped_count += 1

        print(f"\n Summary:")
        print(f"   Processed: {processed_count} files")
        print(f"   Skipped: {skipped_count} files")

    except Exception as e:
        print(f"Error accessing Google Drive: {e}")

# Run the example
asyncio.run(process_drive_documents())

Troubleshooting#

Common Issues#

"Service account info was not in the expected format"
- Make sure you're using a service account key file, not OAuth2 client credentials
- Verify the JSON file contains required fields: client_email, private_key, token_uri
"File not found" or "Permission denied"
- Ensure the file/folder is shared with your service account email
- Check that the file ID is correct
- Verify the service account has at least "Viewer" permissions
"Google Drive API not enabled"
- Enable the Google Drive API in Google Cloud Console
- Wait a few minutes for the API to be fully activated
"Quota exceeded"
- Google Drive API has usage limits
- Implement rate limiting in your code
- Consider upgrading your Google Cloud quotas if needed
"Export size limit exceeded"
- Google Workspace files (Docs, Sheets, etc.) have a 9MB export limit
- Large Google Workspace files may fail to download
- Consider splitting large documents or using alternative export methods

Getting File/Folder IDs#

You can find Google Drive file or folder IDs in several ways:

From the URL: When viewing a file in Google Drive, the ID is in the URL:

https://drive.google.com/file/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/view
                                ^--- This is the file ID ---^

Right-click method: Right-click → "Get link" → Extract ID from the shareable link
Programmatically: Use the Google Drive API to search and list files

Configuration Options#

Local Storage Directory#

By default, downloaded files are stored in a temporary directory. You can customize this:

import os

# Set custom download directory
os.environ["LOCAL_STORAGE_DIR"] = "/path/to/your/download/directory"

Supported File Types#

The Google Drive source automatically handles various file types:

Google Workspace files: Automatically exported to common formats (Docs → DOCX, Sheets → XLSX, etc.)
Regular files: Downloaded as-is
Large files: Handled with resumable downloads for reliability

File Size Limitations

Google Workspace files (Google Docs, Sheets, Slides, etc.) have a 9MB export limit when converting to standard formats (DOCX, XLSX, PPTX). Files larger than this limit may fail to download. For large documents, consider:

Breaking them into smaller documents
Using Google's native format instead of exporting
Accessing them directly through the Google Workspace APIs

Best Practices#

Security: Store service account keys securely and rotate them regularly
Permissions: Use the principle of least privilege - only grant necessary permissions
Error Handling: Always implement proper error handling for network and API failures
Rate Limiting: Respect Google Drive API quotas and implement appropriate delays
Monitoring: Log operations for debugging and monitoring purposes