# Technical Design: Persistent Job Queue with Procrastinate **Status:** Draft — drafted by LLM, reviewed by human developer. **Scope:** Migration of adventure entry pipeline from in-process `asyncio.Queue` to Procrastinate (PostgreSQL-backed), plus groundwork for future scheduled jobs. --- ## Problem `app/worker.py` is a plain `asyncio.Queue` running inside the API process. Its limitations: - **No persistence.** Any enqueued jobs are silently lost if the API process restarts or is redeployed. - **No retry.** A transient failure (network blip calling Anthropic/DeepL/Gemini) permanently sets the entry status to `'error'`. - **No scheduling.** We want to run nightly jobs (e.g. news digest generation) on a cron trigger. - **Contention.** Long-running LLM + TTS + NLP pipelines share the same process and event loop as the HTTP API. --- ## Solution: Procrastinate + separate worker container [Procrastinate](https://procrastinate.readthedocs.io) is a Python asyncio task queue backed by PostgreSQL. It uses `LISTEN/NOTIFY` for fast job dispatch with a polling fallback. No new infrastructure is needed — the existing PostgreSQL instance is the queue backing store. A dedicated `worker` Docker container is added to every compose file. It shares the same Docker image as `api` (same `./api` build context) but runs a different command. Both containers connect to the same PostgreSQL instance. ``` ┌─────────────────────┐ defer_async() ┌──────────────────────┐ │ api (FastAPI) │ ────────────────→ │ PostgreSQL │ │ port 8000 │ │ procrastinate_jobs │ └─────────────────────┘ │ procrastinate_events │ └──────────────────────┘ ┌─────────────────────┐ LISTEN/NOTIFY + │ │ worker │ ←──────────────── │ │ (Procrastinate) │ polling fallback │ └─────────────────────┘ └ ``` --- ## Files Changed ### New: `api/app/tasks.py` Single source of truth for all task definitions. Both the API (to _defer_ tasks) and the worker (to _execute_ them) import this module. ```python import uuid import logging import procrastinate from procrastinate.contrib.sqlalchemy import SQLAlchemyAsyncConnector from .config import settings from .outbound.postgres.database import engine, AsyncSessionLocal from .outbound.anthropic.anthropic_client import AnthropicClient from .outbound.deepl.deepl_client import DeepLClient from .outbound.gemini.gemini_client import GeminiClient from .outbound.spacy.spacy_client import SpacyClient from .outbound.postgres.repositories.adventure_repository import ( PostgresAdventureRepository, PostgresAdventureEntryRepository, PostgresAdventureEntryChoiceRepository, PostgresAdventureEntryDecisionRepository, PostgresAdventureEntryTranslationRepository, PostgresAdventureEntryAudioRepository, ) from .domain.services.adventure_service import AdventureService logger = logging.getLogger(__name__) procrastinate_app = procrastinate.App( connector=SQLAlchemyAsyncConnector(engine) ) def _make_adventure_service(db) -> AdventureService: """ Moved here from adventures.py so the worker can construct the service without importing the router module. """ if settings.stub_generation: from .routers.api.adventures import ( # avoid circular import at module level _StubAnthropicClient, _StubDeepLClient, _StubGeminiClient, _StubSpacyClient ) anthropic = _StubAnthropicClient() deepl = _StubDeepLClient() gemini = _StubGeminiClient() spacy = _StubSpacyClient() else: anthropic = AnthropicClient.new(settings.anthropic_api_key) deepl = DeepLClient(settings.deepl_api_key) gemini = GeminiClient(settings.gemini_api_key) spacy = SpacyClient() return AdventureService( adventure_repo=PostgresAdventureRepository(db), entry_repo=PostgresAdventureEntryRepository(db), choice_repo=PostgresAdventureEntryChoiceRepository(db), decision_repo=PostgresAdventureEntryDecisionRepository(db), translation_repo=PostgresAdventureEntryTranslationRepository(db), audio_repo=PostgresAdventureEntryAudioRepository(db), anthropic_client=anthropic, deepl_client=deepl, gemini_client=gemini, spacy_client=spacy, ) @procrastinate_app.task(queue="adventure_pipeline") async def generate_adventure_entry( adventure_id: str, entry_id: str, user_id: str ) -> None: async with AsyncSessionLocal() as db: service = _make_adventure_service(db) await service.run_entry_pipeline( uuid.UUID(adventure_id), uuid.UUID(entry_id), uuid.UUID(user_id), ) ``` **EDITOR'S NOTE** There's no good reason why the stubs should live in `routers.api.adventures`, and therefore risk circular dependencies. They should be moved to the `outbound.SERVICE_NAME` (e.g. `outbound.bunny.stub_bunny_client`). This will involve updating the dependencies in the API router. **Notes on retry strategy:** `run_entry_pipeline` currently catches all exceptions internally and writes `status='error'` to the DB — it never raises. From Procrastinate's point of view the job always succeeds, so retry is not wired up in this first pass. This preserves the existing behaviour exactly. A follow-up improvement (out of scope here) would be to remove the internal catch-all, let exceptions propagate, and configure: ```python @procrastinate_app.task( queue="adventure_pipeline", retry=procrastinate.RetryStrategy( max_attempts=3, wait_minimum=30, wait_multiplier=2, wait_jitter=30, ), ) ``` with an `on_abort` hook responsible for writing the `'error'` status after all attempts are exhausted. --- ### New: `api/app/worker_main.py` The worker process entrypoint. The Docker `command` points here. ```python import asyncio import logging from . import tasks # side-effect: registers all task definitions logger = logging.getLogger(__name__) async def _run() -> None: async with tasks.procrastinate_app.open_async(): logger.info("Procrastinate worker started, queue=adventure_pipeline") await tasks.procrastinate_app.run_worker_async( queues=["adventure_pipeline"], concurrency=4, ) if __name__ == "__main__": logging.basicConfig(level=logging.INFO) asyncio.run(_run()) ``` Run command (in docker-compose): `python -m app.worker_main` --- ### New: Alembic migration `YYYYMMDD_0019_add_procrastinate_schema.py` Procrastinate manages its own schema independently, but we embed it in Alembic to keep all DB changes in one place and ensure it runs as part of `alembic upgrade head`. ```python from alembic import op import procrastinate.contrib.sqlalchemy def upgrade() -> None: # Procrastinate provides its DDL via the schema manager. # Run: `procrastinate schema --app=app.tasks.procrastinate_app print-sql` # to get the SQL and paste it here, or use: op.execute(procrastinate.contrib.sqlalchemy.SQLAlchemyAsyncConnector.get_schema_manager().get_schema()) def downgrade() -> None: op.execute("DROP SCHEMA procrastinate CASCADE;") # or the equivalent table-by-table drops if procrastinate uses public schema ``` **Alternative (simpler):** Add `await tasks.procrastinate_app.schema.apply_schema_async()` in `worker_main.py` before starting the worker. This runs Procrastinate's own migration tool, which is idempotent. It's less consistent with the project's Alembic-only convention but simpler to maintain as Procrastinate is upgraded. **REVIEW NOTE** - yes, let's stick to Alembic, two solutions for migration management would add complexity. --- ### Modified: `api/app/main.py` Remove the `worker_loop` asyncio task; open the Procrastinate connector in lifespan so that `defer_async` calls from API routes work. ```python # Before from . import worker @asynccontextmanager async def lifespan(app: FastAPI): init_storage() setup_observability(app) worker_task = asyncio.create_task(worker.worker_loop()) yield worker_task.cancel() try: await worker_task except asyncio.CancelledError: pass # After from . import tasks @asynccontextmanager async def lifespan(app: FastAPI): async with tasks.procrastinate_app.open_async(): init_storage() setup_observability(app) yield ``` The `import asyncio` line can be removed from `main.py` if it has no other uses after this change. --- ### Modified: `api/app/routers/api/adventures.py` Two changes: 1. **Remove** the `_make_service`, `_run_entry_pipeline_task`, and stub client definitions (they move to `tasks.py`). Keep `_make_service` as a thin shim that delegates to `tasks._make_adventure_service` if any remaining synchronous use in the router still needs it (e.g. the `can_translate_to` check in `create_adventure` — this needs a `DeepLClient` instance, which is currently built inline anyway, so no change needed there). 2. **Replace `worker.enqueue(...)` with `defer_async`** in the two endpoints that trigger pipeline work: **`POST /adventures` (create_adventure)** ```python # Before await worker.enqueue( partial( _run_entry_pipeline_task, uuid.UUID(adventure.id), uuid.UUID(first_entry.id), user_id ) ) # After await tasks.generate_adventure_entry.defer_async( adventure_id=str(adventure.id), entry_id=str(first_entry.id), user_id=str(user_id), ) ``` **`POST /adventures/{adventure_id}/decisions` (record_decision)** ```python # Before await worker.enqueue( partial( _run_entry_pipeline_task, uuid.UUID(next_entry.adventure_id), uuid.UUID(next_entry.id), user_id, ) ) # After await tasks.generate_adventure_entry.defer_async( adventure_id=str(next_entry.adventure_id), entry_id=str(next_entry.id), user_id=str(user_id), ) ``` Procrastinate task arguments must be JSON-serialisable. `uuid.UUID` objects are converted to `str` at the call site; the task function converts them back with `uuid.UUID(...)`. --- ### Deleted: `api/app/worker.py` Removed entirely once the migration is complete. No other callers exist outside `adventures.py` and `main.py`. --- ## Docker Compose Changes ### `docker-compose.yml` (base / local dev) Add a `worker` service after `api`: ```yaml worker: build: ./api volumes: - ./api:/app:z # hot-reload on code change (same as api) command: python -m worker.main environment: DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-langlearn}:${POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB:-langlearn} JWT_SECRET: ${JWT_SECRET} ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} DEEPL_API_KEY: ${DEEPL_API_KEY} DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY} GEMINI_API_KEY: ${GEMINI_API_KEY} PYTHONPATH: /app STORAGE_ENDPOINT_URL: http://storage:9000 STORAGE_ACCESS_KEY: ${STORAGE_ACCESS_KEY:-langlearn} STORAGE_SECRET_KEY: ${STORAGE_SECRET_KEY} STORAGE_BUCKET: ${STORAGE_BUCKET:-langlearn} OTEL_SERVICE_NAME: ${OTEL_SERVICE_NAME:-language-learning-worker} depends_on: db: condition: service_healthy storage: condition: service_healthy restart: unless-stopped ``` The worker does not need `ports`, the Prometheus exporter config, or `API_BASE_URL`. OTEL service name is changed so traces are distinguishable in Grafana. ### `docker-compose-dev.yml` Same addition as above. The `volumes: - ./api:/app:z` mount means worker code reloads on save — but note that `python -m worker.main` does **not** hot-reload automatically (unlike uvicorn). For local dev, just restart the worker container after code changes: `docker compose restart worker`. If hot-reload matters during development, the command can be wrapped with `watchfiles`: ```yaml command: watchfiles --filter python "python -m worker.main" /app ``` (This requires `watchfiles` in the Python dependencies.) ### `docker-compose-prod.yml` Add a `worker` service. The production command does _not_ run `alembic upgrade head` because migrations are already applied by the `api` container's startup command. The worker just starts: ```yaml worker: build: ./api command: python -m worker.main environment: DATABASE_URL: postgresql+asyncpg://${POSTGRES_USER:-langlearn}:${POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB:-langlearn} JWT_SECRET: ${JWT_SECRET} ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} DEEPL_API_KEY: ${DEEPL_API_KEY} DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY} GEMINI_API_KEY: ${GEMINI_API_KEY} PYTHONPATH: /app STORAGE_PROVIDER: bunny BUNNY_ZONE: ${BUNNY_ZONE} BUNNY_API_KEY: ${BUNNY_API_KEY} BUNNY_CDN_BASE_URL: ${BUNNY_CDN_BASE_URL} BUNNY_TOKEN_AUTH_KEY: ${BUNNY_TOKEN_AUTH_KEY} BUNNY_STORAGE_ENDPOINT: ${BUNNY_STORAGE_ENDPOINT} OTEL_SERVICE_NAME: ${OTEL_SERVICE_NAME:-language-learning-worker} depends_on: db: condition: service_healthy restart: unless-stopped deploy: resources: limits: cpus: "1" memory: 1G ``` The worker depends on `db` only (no `storage` healthcheck needed since storage is Bunny CDN in prod, not a local container). ### `docker-compose.test.yml` Add a `worker` service. Critically, it must receive `STUB_GENERATION: "true"` so it uses stub clients, matching what the API does in tests. ```yaml worker: build: ./api command: python -m worker.main environment: DATABASE_URL: postgresql+asyncpg://langlearn_test:testpassword@db:5432/langlearn_test JWT_SECRET: test-jwt-secret-not-for-production ANTHROPIC_API_KEY: test-key DEEPL_API_KEY: test-key DEEPGRAM_API_KEY: test-key GEMINI_API_KEY: test-key STORAGE_ENDPOINT_URL: http://storage:9000 STORAGE_ACCESS_KEY: langlearn_test STORAGE_SECRET_KEY: testpassword123 STORAGE_BUCKET: langlearn-test PYTHONPATH: /app STUB_GENERATION: "true" depends_on: db: condition: service_healthy storage: condition: service_healthy ``` No healthcheck needed — the worker has no HTTP endpoint, and if it starts late, pending jobs simply wait in the queue until it picks them up. --- ## Testing ### Integration tests — no changes required `tests/test_adventures.py` already polls with `_wait_for_adventure_status(client, id, "active", timeout=30)`. This pattern is compatible with the worker being a separate process: the test enqueues a job, the worker processes it asynchronously, and the test polls until the adventure status reflects completion. With stub generation, the pipeline completes in milliseconds. The 30-second timeout is more than sufficient even accounting for worker container startup time. The one risk is a **startup race**: if the first test creates an adventure before the worker container has opened its Procrastinate connection, the job sits in the queue unprocessed until the worker is ready. Since `docker compose up --wait` waits for containers with healthchecks to pass (i.e. `api` is healthy before tests run), and the worker starts immediately after `db` is healthy (a prerequisite already met before `api` is healthy), the worker will typically be ready before the first test fires. No action needed, but if this proves flaky in CI, adding a short `pg_isready`-style healthcheck to the worker is the fix. ### What the tests implicitly verify after migration - `POST /adventures` returns `201` and adventure status is `'awaiting_first_entry'` ✓ - Worker picks up the job and calls `run_entry_pipeline` ✓ - Adventure transitions to `'active'`, entry to `'complete'` (polled via `_wait_for_adventure_status`) ✓ - Decision endpoint triggers a second pipeline job ✓ The existing tests cover all of this already. No new tests are required for the migration itself, though a test that verifies behaviour on worker restart (jobs not lost) would be a nice addition. --- ## Implementation Order 1. DONE: Add `procrastinate` (and `procrastinate[sqlalchemy]`) to `api/pyproject.toml` / `requirements`. 2. Write Alembic migration for Procrastinate schema. Run it locally. 3. Move stub services into their `app/outbound` directories 4. Create `app/tasks.py` with the `procrastinate_app` instance and the `generate_adventure_entry` task. 5. Create `app/worker_main.py`. 6. Modify `app/main.py`: remove `worker_loop`, add `procrastinate_app.open_async()` to lifespan. 7. Modify `app/routers/api/adventures.py`: replace `worker.enqueue(...)` with `tasks.generate_adventure_entry.defer_async(...)`. 8. Add `worker` service to all four compose files. 9. Run the test suite: `docker compose -f docker-compose.test.yml up --build --wait -d && pytest`. 10. Delete `app/worker.py`. Steps 1–4 can be done before touching the API, so the migration can be tested end-to-end before cutting over. --- ## Open Questions 1. [ANSWERED] As mentioned above, so this small refactor, it makes sense. **Stub client placement.** The stub classes inside `adventures.py` need to be reachable from `tasks.py`. The proposal above lazy-imports them; the cleaner fix is to extract them to `app/outbound/stubs.py`. Doing this in the same PR keeps scope small but is worth doing if it avoids the circular-import smell. 2. **Worker concurrency.** `concurrency=4` is a placeholder. Adventure pipeline jobs are I/O-heavy (network calls), not CPU-heavy, so higher concurrency is fine. Tune based on Anthropic/DeepL API rate limits. 3. **Procrastinate schema management.** Procrastinate has its own versioned migration system (separate from Alembic). When upgrading Procrastinate in future, run `procrastinate schema --app=app.tasks.procrastinate_app migrate` (or wrap it in an Alembic migration). Don't forget this step. 4. **Observability.** Procrastinate emits structured log lines per job. These will appear in Loki automatically. A future improvement would be to add the `job_id` to the OpenTelemetry trace context inside `generate_adventure_entry`. --- ## Future: Scheduled Jobs (News Digest) With Procrastinate in place, cron-style jobs are first-class citizens. Once `tasks.py` exists, adding a nightly job is: ```python @procrastinate_app.periodic(cron="0 2 * * *") # 2am daily UTC async def generate_nightly_news_digest() -> None: async with AsyncSessionLocal() as db: await NewsDigestService(db, ...).run() ``` The worker process runs periodic tasks automatically; no additional scheduler container is needed. Procrastinate tracks the last fire time in the `procrastinate_periodic_defers` table, so missed runs (e.g. worker was down) fire once on the next startup.