language-learning-app/api/docs/design-doc-object-storage.md

83 lines
4.8 KiB
Markdown
Raw Normal View History

# Design Document: Object Storage with Bunny CDN
This is a technical design document for implementing object (e.g. audio file) storage with Bunny CDN. This directory (`api/docs`) contains other similar files, notably `architecture.md` and `domain.md`. When you have worked through the change described here, please update `architecture.md`
## The problem
Language Learning App has audio as a core component, which requires files to be delivered to the end user. When developing locally, these files have been stored in a min.io service, mimicking an S3-like storage bucket.
Using this approach on a deployed instance (e.g. on a VPS using Docker), would result in high bandwidth and therefore a high cost. Using a dedicated, EU-based service like Bunny allows us to offload the delivery of content to a third-party at reduced cost (great!)
## The current implementation
Object storage was one of the first features built into this software in MVP state, as such it does not fit within the current architecture.
Right now `api/app/storage.py` contains some helper functions, notably the `upload_audio` and `download_audio` functions.
Users (through the web client) retrieve the media through two URLs (detailed in `api/app/routers/media.py`):
- `GET /media/adventure-audio/{filename:path}` for the choose-your-own-adventure file names
- `GET /media/{filename:path}`, used for the summary transcriptions
## The solution
We are going to use Bunny (bunny.net) as the CDN for all objects in deployed environments (right now, just production — in the future preprod or staging may exist).
Locally, for development purposes, we retain the use of MinIO. To decide which backend to use, we introduce an environment variable `STORAGE_PROVIDER` with a default value of `local` and an accepted alternative of `bunny`.
In situations where we use `local`, the existing `/media/..` proxy endpoints are returned when constructing audio URLs (e.g. in `api/app/routers/bff/articles.py` and `api/app/routers/bff/adventure.py`). When we use `bunny`, the Bunny CDN URL is returned directly so the request is never proxied through our service.
### Client interface
We will create a `BunnyClient` in `api/app/outbound/bunny/bunny_client.py` and extract the current MinIO logic into a `MinioClient` in `api/app/outbound/minio/minio_client.py`. Both implement a shared `StorageClient` protocol.
The interface is **generic** — the clients are storage adapters and must not encode domain concepts. Path construction (which directory, which filename) is the responsibility of the caller (the service layer), not the client.
```python
class StorageClient(Protocol):
def upload(self, path: str, data: bytes) -> bool: ...
def get_url(self, path: str) -> str: ...
def delete(self, path: str) -> bool: ...
```
Services construct paths using hardcoded directory prefixes (e.g. `"adventure-audio/"`, `"audio/"`). These are constants, not environment variables — they are not environment-specific and do not belong in config.
### Factory and instantiation
A factory function reads `STORAGE_PROVIDER` and returns the appropriate `StorageClient` implementation. The client is instantiated **once at app startup** (e.g. in `main.py`) as a module-level singleton — not per-request. This is consistent with how other outbound clients (`AnthropicClient`, `GeminiClient`, etc.) are handled.
### Bunny configuration
Bunny requires the following environment variables:
- `BUNNY_ZONE` — the storage zone name (the zone `languagelearningapp` has been created in the Bunny UI). No "DEFAULT" suffix; there is one zone.
- `BUNNY_API_KEY` — the Bunny API key for upload/delete operations.
- `BUNNY_CDN_BASE_URL` — the public CDN hostname used to construct delivery URLs.
### Signed vs. public URLs
Audio files are user-specific (i.e. one user should not be able to use another user's audio URL), Bunny signed URLs are required. Public CDN URLs are shareable by anyone who has the link.
As per Bunny's own documentation they recommend the token.py package:
```py
from token import sign_url
url = sign_url(
"https://myzone.b-cdn.net/videos/stream1/playlist.m3u8",
"your-security-key",
expiration_time=3600,
is_directory=True,
path_allowed="/videos/stream1/",
countries_allowed="GB",
)
```
`get_url(path)` on the `BunnyClient` must generate a time-limited (pick a sensible default for audio content here) signed URL using the Bunny Token Authentication feature. The MinIO implementation would use pre-signed S3 URLs for consistency.
Create a sibling method that explicitely creates public URLs for any future public content, call this `get_public_url`.
### Misc
`pcm_to_wav()` currently lives in `api/app/storage.py` but is a Gemini output concern. Move it to the Gemini client module (`api/app/outbound/gemini/`) when carrying out this refactor.