466 lines
19 KiB
Markdown
466 lines
19 KiB
Markdown
# Technical Design: Persistent Job Queue with Procrastinate
|
||
|
||
**Status:** Draft — drafted by LLM, reviewed by human developer.
|
||
**Scope:** Migration of adventure entry pipeline from in-process `asyncio.Queue` to Procrastinate (PostgreSQL-backed), plus groundwork for future scheduled jobs.
|
||
|
||
---
|
||
|
||
## Problem
|
||
|
||
`app/worker.py` is a plain `asyncio.Queue` running inside the API process. Its limitations:
|
||
|
||
- **No persistence.** Any enqueued jobs are silently lost if the API process restarts or is redeployed.
|
||
- **No retry.** A transient failure (network blip calling Anthropic/DeepL/Gemini) permanently sets the entry status to `'error'`.
|
||
- **No scheduling.** We want to run nightly jobs (e.g. news digest generation) on a cron trigger.
|
||
- **Contention.** Long-running LLM + TTS + NLP pipelines share the same process and event loop as the HTTP API.
|
||
|
||
---
|
||
|
||
## Solution: Procrastinate + separate worker container
|
||
|
||
[Procrastinate](https://procrastinate.readthedocs.io) is a Python asyncio task queue backed by PostgreSQL. It uses `LISTEN/NOTIFY` for fast job dispatch with a polling fallback. No new infrastructure is needed — the existing PostgreSQL instance is the queue backing store.
|
||
|
||
A dedicated `worker` Docker container is added to every compose file. It shares the same Docker image as `api` (same `./api` build context) but runs a different command. Both containers connect to the same PostgreSQL instance.
|
||
|
||
```
|
||
┌─────────────────────┐ defer_async() ┌──────────────────────┐
|
||
│ api (FastAPI) │ ────────────────→ │ PostgreSQL │
|
||
│ port 8000 │ │ procrastinate_jobs │
|
||
└─────────────────────┘ │ procrastinate_events │
|
||
└──────────────────────┘
|
||
┌─────────────────────┐ LISTEN/NOTIFY + │
|
||
│ worker │ ←──────────────── │
|
||
│ (Procrastinate) │ polling fallback │
|
||
└─────────────────────┘ └
|
||
```
|
||
|
||
---
|
||
|
||
## Files Changed
|
||
|
||
### New: `api/app/tasks.py`
|
||
|
||
Single source of truth for all task definitions. Both the API (to _defer_ tasks) and the worker (to _execute_ them) import this module.
|
||
|
||
```python
|
||
import uuid
|
||
import logging
|
||
import procrastinate
|
||
from procrastinate.contrib.sqlalchemy import SQLAlchemyAsyncConnector
|
||
|
||
from .config import settings
|
||
from .outbound.postgres.database import engine, AsyncSessionLocal
|
||
from .outbound.anthropic.anthropic_client import AnthropicClient
|
||
from .outbound.deepl.deepl_client import DeepLClient
|
||
from .outbound.gemini.gemini_client import GeminiClient
|
||
from .outbound.spacy.spacy_client import SpacyClient
|
||
from .outbound.postgres.repositories.adventure_repository import (
|
||
PostgresAdventureRepository,
|
||
PostgresAdventureEntryRepository,
|
||
PostgresAdventureEntryChoiceRepository,
|
||
PostgresAdventureEntryDecisionRepository,
|
||
PostgresAdventureEntryTranslationRepository,
|
||
PostgresAdventureEntryAudioRepository,
|
||
)
|
||
from .domain.services.adventure_service import AdventureService
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
procrastinate_app = procrastinate.App(
|
||
connector=SQLAlchemyAsyncConnector(engine)
|
||
)
|
||
|
||
|
||
def _make_adventure_service(db) -> AdventureService:
|
||
"""
|
||
Moved here from adventures.py so the worker can construct the service
|
||
without importing the router module.
|
||
"""
|
||
if settings.stub_generation:
|
||
from .routers.api.adventures import ( # avoid circular import at module level
|
||
_StubAnthropicClient, _StubDeepLClient, _StubGeminiClient, _StubSpacyClient
|
||
)
|
||
anthropic = _StubAnthropicClient()
|
||
deepl = _StubDeepLClient()
|
||
gemini = _StubGeminiClient()
|
||
spacy = _StubSpacyClient()
|
||
else:
|
||
anthropic = AnthropicClient.new(settings.anthropic_api_key)
|
||
deepl = DeepLClient(settings.deepl_api_key)
|
||
gemini = GeminiClient(settings.gemini_api_key)
|
||
spacy = SpacyClient()
|
||
|
||
return AdventureService(
|
||
adventure_repo=PostgresAdventureRepository(db),
|
||
entry_repo=PostgresAdventureEntryRepository(db),
|
||
choice_repo=PostgresAdventureEntryChoiceRepository(db),
|
||
decision_repo=PostgresAdventureEntryDecisionRepository(db),
|
||
translation_repo=PostgresAdventureEntryTranslationRepository(db),
|
||
audio_repo=PostgresAdventureEntryAudioRepository(db),
|
||
anthropic_client=anthropic,
|
||
deepl_client=deepl,
|
||
gemini_client=gemini,
|
||
spacy_client=spacy,
|
||
)
|
||
|
||
|
||
@procrastinate_app.task(queue="adventure_pipeline")
|
||
async def generate_adventure_entry(
|
||
adventure_id: str, entry_id: str, user_id: str
|
||
) -> None:
|
||
async with AsyncSessionLocal() as db:
|
||
service = _make_adventure_service(db)
|
||
await service.run_entry_pipeline(
|
||
uuid.UUID(adventure_id),
|
||
uuid.UUID(entry_id),
|
||
uuid.UUID(user_id),
|
||
)
|
||
```
|
||
|
||
**EDITOR'S NOTE** There's no good reason why the stubs should live in `routers.api.adventures`, and therefore risk circular dependencies. They should be moved to the `outbound.SERVICE_NAME` (e.g. `outbound.bunny.stub_bunny_client`). This will involve updating the dependencies in the API router.
|
||
|
||
**Notes on retry strategy:**
|
||
`run_entry_pipeline` currently catches all exceptions internally and writes `status='error'` to the DB — it never raises. From Procrastinate's point of view the job always succeeds, so retry is not wired up in this first pass. This preserves the existing behaviour exactly.
|
||
|
||
A follow-up improvement (out of scope here) would be to remove the internal catch-all, let exceptions propagate, and configure:
|
||
|
||
```python
|
||
@procrastinate_app.task(
|
||
queue="adventure_pipeline",
|
||
retry=procrastinate.RetryStrategy(
|
||
max_attempts=3,
|
||
wait_minimum=30,
|
||
wait_multiplier=2,
|
||
wait_jitter=30,
|
||
),
|
||
)
|
||
```
|
||
|
||
with an `on_abort` hook responsible for writing the `'error'` status after all attempts are exhausted.
|
||
|
||
---
|
||
|
||
### New: `api/app/worker_main.py`
|
||
|
||
The worker process entrypoint. The Docker `command` points here.
|
||
|
||
```python
|
||
import asyncio
|
||
import logging
|
||
from . import tasks # side-effect: registers all task definitions
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
|
||
async def _run() -> None:
|
||
async with tasks.procrastinate_app.open_async():
|
||
logger.info("Procrastinate worker started, queue=adventure_pipeline")
|
||
await tasks.procrastinate_app.run_worker_async(
|
||
queues=["adventure_pipeline"],
|
||
concurrency=4,
|
||
)
|
||
|
||
|
||
if __name__ == "__main__":
|
||
logging.basicConfig(level=logging.INFO)
|
||
asyncio.run(_run())
|
||
```
|
||
|
||
Run command (in docker-compose): `python -m app.worker_main`
|
||
|
||
---
|
||
|
||
### New: Alembic migration `YYYYMMDD_0019_add_procrastinate_schema.py`
|
||
|
||
Procrastinate manages its own schema independently, but we embed it in Alembic to keep all DB changes in one place and ensure it runs as part of `alembic upgrade head`.
|
||
|
||
```python
|
||
from alembic import op
|
||
import procrastinate.contrib.sqlalchemy
|
||
|
||
def upgrade() -> None:
|
||
# Procrastinate provides its DDL via the schema manager.
|
||
# Run: `procrastinate schema --app=app.tasks.procrastinate_app print-sql`
|
||
# to get the SQL and paste it here, or use:
|
||
op.execute(procrastinate.contrib.sqlalchemy.SQLAlchemyAsyncConnector.get_schema_manager().get_schema())
|
||
|
||
def downgrade() -> None:
|
||
op.execute("DROP SCHEMA procrastinate CASCADE;")
|
||
# or the equivalent table-by-table drops if procrastinate uses public schema
|
||
```
|
||
|
||
**Alternative (simpler):** Add `await tasks.procrastinate_app.schema.apply_schema_async()` in `worker_main.py` before starting the worker. This runs Procrastinate's own migration tool, which is idempotent. It's less consistent with the project's Alembic-only convention but simpler to maintain as Procrastinate is upgraded.
|
||
|
||
**REVIEW NOTE** - yes, let's stick to Alembic, two solutions for migration management would add complexity.
|
||
|
||
---
|
||
|
||
### Modified: `api/app/main.py`
|
||
|
||
Remove the `worker_loop` asyncio task; open the Procrastinate connector in lifespan so that `defer_async` calls from API routes work.
|
||
|
||
```python
|
||
# Before
|
||
from . import worker
|
||
|
||
@asynccontextmanager
|
||
async def lifespan(app: FastAPI):
|
||
init_storage()
|
||
setup_observability(app)
|
||
worker_task = asyncio.create_task(worker.worker_loop())
|
||
yield
|
||
worker_task.cancel()
|
||
try:
|
||
await worker_task
|
||
except asyncio.CancelledError:
|
||
pass
|
||
|
||
# After
|
||
from . import tasks
|
||
|
||
@asynccontextmanager
|
||
async def lifespan(app: FastAPI):
|
||
async with tasks.procrastinate_app.open_async():
|
||
init_storage()
|
||
setup_observability(app)
|
||
yield
|
||
```
|
||
|
||
The `import asyncio` line can be removed from `main.py` if it has no other uses after this change.
|
||
|
||
---
|
||
|
||
### Modified: `api/app/routers/api/adventures.py`
|
||
|
||
Two changes:
|
||
|
||
1. **Remove** the `_make_service`, `_run_entry_pipeline_task`, and stub client definitions (they move to `tasks.py`). Keep `_make_service` as a thin shim that delegates to `tasks._make_adventure_service` if any remaining synchronous use in the router still needs it (e.g. the `can_translate_to` check in `create_adventure` — this needs a `DeepLClient` instance, which is currently built inline anyway, so no change needed there).
|
||
|
||
2. **Replace `worker.enqueue(...)` with `defer_async`** in the two endpoints that trigger pipeline work:
|
||
|
||
**`POST /adventures` (create_adventure)**
|
||
|
||
```python
|
||
# Before
|
||
await worker.enqueue(
|
||
partial(
|
||
_run_entry_pipeline_task, uuid.UUID(adventure.id), uuid.UUID(first_entry.id), user_id
|
||
)
|
||
)
|
||
|
||
# After
|
||
await tasks.generate_adventure_entry.defer_async(
|
||
adventure_id=str(adventure.id),
|
||
entry_id=str(first_entry.id),
|
||
user_id=str(user_id),
|
||
)
|
||
```
|
||
|
||
**`POST /adventures/{adventure_id}/decisions` (record_decision)**
|
||
|
||
```python
|
||
# Before
|
||
await worker.enqueue(
|
||
partial(
|
||
_run_entry_pipeline_task,
|
||
uuid.UUID(next_entry.adventure_id),
|
||
uuid.UUID(next_entry.id),
|
||
user_id,
|
||
)
|
||
)
|
||
|
||
# After
|
||
await tasks.generate_adventure_entry.defer_async(
|
||
adventure_id=str(next_entry.adventure_id),
|
||
entry_id=str(next_entry.id),
|
||
user_id=str(user_id),
|
||
)
|
||
```
|
||
|
||
Procrastinate task arguments must be JSON-serialisable. `uuid.UUID` objects are converted to `str` at the call site; the task function converts them back with `uuid.UUID(...)`.
|
||
|
||
---
|
||
|
||
### Deleted: `api/app/worker.py`
|
||
|
||
Removed entirely once the migration is complete. No other callers exist outside `adventures.py` and `main.py`.
|
||
|
||
---
|
||
|
||
## Docker Compose Changes
|
||
|
||
### `docker-compose.yml` (base / local dev)
|
||
|
||
Add a `worker` service after `api`:
|
||
|
||
```yaml
|
||
worker:
|
||
build: ./api
|
||
volumes:
|
||
- ./api:/app:z # hot-reload on code change (same as api)
|
||
command: python -m worker.main
|
||
environment:
|
||
DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-langlearn}:${POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB:-langlearn}
|
||
JWT_SECRET: ${JWT_SECRET}
|
||
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
|
||
DEEPL_API_KEY: ${DEEPL_API_KEY}
|
||
DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY}
|
||
GEMINI_API_KEY: ${GEMINI_API_KEY}
|
||
PYTHONPATH: /app
|
||
STORAGE_ENDPOINT_URL: http://storage:9000
|
||
STORAGE_ACCESS_KEY: ${STORAGE_ACCESS_KEY:-langlearn}
|
||
STORAGE_SECRET_KEY: ${STORAGE_SECRET_KEY}
|
||
STORAGE_BUCKET: ${STORAGE_BUCKET:-langlearn}
|
||
OTEL_SERVICE_NAME: ${OTEL_SERVICE_NAME:-language-learning-worker}
|
||
depends_on:
|
||
db:
|
||
condition: service_healthy
|
||
storage:
|
||
condition: service_healthy
|
||
restart: unless-stopped
|
||
```
|
||
|
||
The worker does not need `ports`, the Prometheus exporter config, or `API_BASE_URL`. OTEL service name is changed so traces are distinguishable in Grafana.
|
||
|
||
### `docker-compose-dev.yml`
|
||
|
||
Same addition as above. The `volumes: - ./api:/app:z` mount means worker code reloads on save — but note that `python -m worker.main` does **not** hot-reload automatically (unlike uvicorn). For local dev, just restart the worker container after code changes: `docker compose restart worker`.
|
||
|
||
If hot-reload matters during development, the command can be wrapped with `watchfiles`:
|
||
|
||
```yaml
|
||
command: watchfiles --filter python "python -m worker.main" /app
|
||
```
|
||
|
||
(This requires `watchfiles` in the Python dependencies.)
|
||
|
||
### `docker-compose-prod.yml`
|
||
|
||
Add a `worker` service. The production command does _not_ run `alembic upgrade head` because migrations are already applied by the `api` container's startup command. The worker just starts:
|
||
|
||
```yaml
|
||
worker:
|
||
build: ./api
|
||
command: python -m worker.main
|
||
environment:
|
||
DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-langlearn}:${POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB:-langlearn}
|
||
JWT_SECRET: ${JWT_SECRET}
|
||
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
|
||
DEEPL_API_KEY: ${DEEPL_API_KEY}
|
||
DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY}
|
||
GEMINI_API_KEY: ${GEMINI_API_KEY}
|
||
PYTHONPATH: /app
|
||
STORAGE_PROVIDER: bunny
|
||
BUNNY_ZONE: ${BUNNY_ZONE}
|
||
BUNNY_API_KEY: ${BUNNY_API_KEY}
|
||
BUNNY_CDN_BASE_URL: ${BUNNY_CDN_BASE_URL}
|
||
BUNNY_TOKEN_AUTH_KEY: ${BUNNY_TOKEN_AUTH_KEY}
|
||
BUNNY_STORAGE_ENDPOINT: ${BUNNY_STORAGE_ENDPOINT}
|
||
OTEL_SERVICE_NAME: ${OTEL_SERVICE_NAME:-language-learning-worker}
|
||
depends_on:
|
||
db:
|
||
condition: service_healthy
|
||
restart: unless-stopped
|
||
deploy:
|
||
resources:
|
||
limits:
|
||
cpus: "1"
|
||
memory: 1G
|
||
```
|
||
|
||
The worker depends on `db` only (no `storage` healthcheck needed since storage is Bunny CDN in prod, not a local container).
|
||
|
||
### `docker-compose.test.yml`
|
||
|
||
Add a `worker` service. Critically, it must receive `STUB_GENERATION: "true"` so it uses stub clients, matching what the API does in tests.
|
||
|
||
```yaml
|
||
worker:
|
||
build: ./api
|
||
command: python -m worker.main
|
||
environment:
|
||
DATABASE_URL: postgresql+asyncpg://langlearn_test:testpassword@db:5432/langlearn_test
|
||
JWT_SECRET: test-jwt-secret-not-for-production
|
||
ANTHROPIC_API_KEY: test-key
|
||
DEEPL_API_KEY: test-key
|
||
DEEPGRAM_API_KEY: test-key
|
||
GEMINI_API_KEY: test-key
|
||
STORAGE_ENDPOINT_URL: http://storage:9000
|
||
STORAGE_ACCESS_KEY: langlearn_test
|
||
STORAGE_SECRET_KEY: testpassword123
|
||
STORAGE_BUCKET: langlearn-test
|
||
PYTHONPATH: /app
|
||
STUB_GENERATION: "true"
|
||
depends_on:
|
||
db:
|
||
condition: service_healthy
|
||
storage:
|
||
condition: service_healthy
|
||
```
|
||
|
||
No healthcheck needed — the worker has no HTTP endpoint, and if it starts late, pending jobs simply wait in the queue until it picks them up.
|
||
|
||
---
|
||
|
||
## Testing
|
||
|
||
### Integration tests — no changes required
|
||
|
||
`tests/test_adventures.py` already polls with `_wait_for_adventure_status(client, id, "active", timeout=30)`. This pattern is compatible with the worker being a separate process: the test enqueues a job, the worker processes it asynchronously, and the test polls until the adventure status reflects completion.
|
||
|
||
With stub generation, the pipeline completes in milliseconds. The 30-second timeout is more than sufficient even accounting for worker container startup time.
|
||
|
||
The one risk is a **startup race**: if the first test creates an adventure before the worker container has opened its Procrastinate connection, the job sits in the queue unprocessed until the worker is ready. Since `docker compose up --wait` waits for containers with healthchecks to pass (i.e. `api` is healthy before tests run), and the worker starts immediately after `db` is healthy (a prerequisite already met before `api` is healthy), the worker will typically be ready before the first test fires. No action needed, but if this proves flaky in CI, adding a short `pg_isready`-style healthcheck to the worker is the fix.
|
||
|
||
### What the tests implicitly verify after migration
|
||
|
||
- `POST /adventures` returns `201` and adventure status is `'awaiting_first_entry'` ✓
|
||
- Worker picks up the job and calls `run_entry_pipeline` ✓
|
||
- Adventure transitions to `'active'`, entry to `'complete'` (polled via `_wait_for_adventure_status`) ✓
|
||
- Decision endpoint triggers a second pipeline job ✓
|
||
|
||
The existing tests cover all of this already. No new tests are required for the migration itself, though a test that verifies behaviour on worker restart (jobs not lost) would be a nice addition.
|
||
|
||
---
|
||
|
||
## Implementation Order
|
||
|
||
1. DONE: Add `procrastinate` (and `procrastinate[sqlalchemy]`) to `api/pyproject.toml` / `requirements`.
|
||
2. Write Alembic migration for Procrastinate schema. Run it locally.
|
||
3. Move stub services into their `app/outbound` directories
|
||
4. Create `app/tasks.py` with the `procrastinate_app` instance and the `generate_adventure_entry` task.
|
||
5. Create `app/worker_main.py`.
|
||
6. Modify `app/main.py`: remove `worker_loop`, add `procrastinate_app.open_async()` to lifespan.
|
||
7. Modify `app/routers/api/adventures.py`: replace `worker.enqueue(...)` with `tasks.generate_adventure_entry.defer_async(...)`.
|
||
8. Add `worker` service to all four compose files.
|
||
9. Run the test suite: `docker compose -f docker-compose.test.yml up --build --wait -d && pytest`.
|
||
10. Delete `app/worker.py`.
|
||
|
||
Steps 1–4 can be done before touching the API, so the migration can be tested end-to-end before cutting over.
|
||
|
||
---
|
||
|
||
## Open Questions
|
||
|
||
1. [ANSWERED] As mentioned above, so this small refactor, it makes sense. **Stub client placement.** The stub classes inside `adventures.py` need to be reachable from `tasks.py`. The proposal above lazy-imports them; the cleaner fix is to extract them to `app/outbound/stubs.py`. Doing this in the same PR keeps scope small but is worth doing if it avoids the circular-import smell.
|
||
|
||
2. **Worker concurrency.** `concurrency=4` is a placeholder. Adventure pipeline jobs are I/O-heavy (network calls), not CPU-heavy, so higher concurrency is fine. Tune based on Anthropic/DeepL API rate limits.
|
||
|
||
3. **Procrastinate schema management.** Procrastinate has its own versioned migration system (separate from Alembic). When upgrading Procrastinate in future, run `procrastinate schema --app=app.tasks.procrastinate_app migrate` (or wrap it in an Alembic migration). Don't forget this step.
|
||
|
||
4. **Observability.** Procrastinate emits structured log lines per job. These will appear in Loki automatically. A future improvement would be to add the `job_id` to the OpenTelemetry trace context inside `generate_adventure_entry`.
|
||
|
||
---
|
||
|
||
## Future: Scheduled Jobs (News Digest)
|
||
|
||
With Procrastinate in place, cron-style jobs are first-class citizens. Once `tasks.py` exists, adding a nightly job is:
|
||
|
||
```python
|
||
@procrastinate_app.periodic(cron="0 2 * * *") # 2am daily UTC
|
||
async def generate_nightly_news_digest() -> None:
|
||
async with AsyncSessionLocal() as db:
|
||
await NewsDigestService(db, ...).run()
|
||
```
|
||
|
||
The worker process runs periodic tasks automatically; no additional scheduler container is needed. Procrastinate tracks the last fire time in the `procrastinate_periodic_defers` table, so missed runs (e.g. worker was down) fire once on the next startup.
|