wilson 376757df51 docs: [api] Update domain.md to explain the domain context

2026-04-14 10:18:16 +01:00

20 KiB

Raw Blame History

Domain

This document explains the core domain concepts of the API — how it models second language acquisition and the processes a learner goes through. It is written for both human developers and LLMs working on this codebase.

What this system does

A user is learning a foreign language (currently French, with Spanish, Italian, and German planned). They encounter words in reading material, add those words to a personal vocabulary bank, resolve any ambiguity about which specific meaning they encountered, and then practise those words via spaced-repetition flashcards.

The system models that full cycle: from a raw word in context, through dictionary lookup and disambiguation, to a durable flashcard that can be studied repeatedly.

Linguistic concepts

Before reading the entity descriptions, these distinctions are essential. Conflating them is the most common source of modelling mistakes in this codebase.

Lemma, wordform, and token

A lemma is the canonical dictionary form of a word: aller, banque, bon.
A wordform is an inflected surface form: allons, allais, banques, bonne. Wordforms are derived from a lemma by applying grammatical rules.
A token is what spaCy returns when it processes a sentence — it is a wordform in context. spaCy provides both the raw token text and its lemma.

Dictionary entries are keyed by lemma. Wordforms point back to their lemma. These are different things and must not be conflated — a user might encounter allons in an article, but what they are learning is the lemma aller.

Senses

A single lemma can have multiple senses: bank (finance), bank (river), bank (verb, to lean). Each sense is a distinct row in dictionary_sense with its own gloss (definition/translation). The user learns a specific sense, not a bare headword.

When a user adds a word with multiple senses, the system cannot know which meaning they encountered. It creates a bank entry with disambiguation_status = "pending" and waits for the user to select the correct sense. This process is called disambiguation.

Part-of-speech normalisation

The dictionary source (kaikki/Wiktextract) uses its own POS labels: "noun", "verb", "past participle", "proverb", "phrase". spaCy uses Universal Dependencies tags: NOUN, VERB, ADJ, ADV. These do not map 1-to-1.

Both are stored: pos_raw holds the kaikki string exactly as it appears in the source data; pos_normalised holds the UD-compatible tag computed at import time. The pos_normalised field is what enables joining spaCy output against dictionary rows.

Gender

French, Spanish, Italian, and German nouns have grammatical gender. Learners must know the gender — le banc not just banc. Gender is extracted from the kaikki tags array at import time and stored as a first-class column (gender: text) on dictionary_lemma. Possible values are "masculine", "feminine", "neuter", "common", or null for parts of speech that do not inflect by gender.

The bilingual mapping

This system uses the English-language Wiktionary (via kaikki). An important structural fact: the gloss on a sense IS the English translation. There is no separate translations table. Because Wiktionary describes foreign words in English, the headword is the target-language word and the gloss is its English meaning:

dictionary_lemma.headword = "bisque" (French)
dictionary_sense.gloss = "advantage" (English meaning)

This means:

FR → EN (recognition): look up lemma by French headword → sense → gloss is the English meaning.
EN → FR (production): full-text search on dictionary_sense.gloss for the English term → linked lemma headword is the French word.

The bilingual dictionary

The dictionary is a read-only reference dataset, populated once by an import script (scripts/import_dictionary.py) from kaikki JSONL dumps. It is never written to by the application at runtime.

`dictionary_lemma`

One row per lemma+POS combination. The (headword, language) pair is indexed but not unique — bank has multiple lemma rows because it is both a noun and a verb.

Key fields: headword, language (ISO 639-1 code, e.g. "fr"), pos_raw, pos_normalised, gender, tags.

`dictionary_sense`

One row per meaning of a lemma. The gloss is a short English definition that serves as both the disambiguation label and the translation. sense_index preserves the ordering from the source data (Wiktionary's first sense is usually the most common).

Key fields: lemma_id (FK → dictionary_lemma), sense_index, gloss, topics, tags.

`dictionary_wordform`

One row per inflected form. Populated from the forms array in the kaikki JSONL. Enables the NLP pipeline to resolve an inflected token back to its lemma without relying on spaCy's lemmatisation being perfect.

Key fields: lemma_id (FK → dictionary_lemma), form, tags (e.g. ["plural"], ["first person plural", "present indicative"]).

`dictionary_lemma_raw`

Stores the full original kaikki JSON record for each lemma, one row per lemma, separate from the main lemma table to avoid bloating lookup queries. Used for reprocessing if the import logic changes.

The user account

User

Standard authentication entity: email, hashed password, is_active, is_email_verified. There is no User domain model — the ORM entity (User in user_entity.py) is used directly by AccountService and user_repository. This is the only entity in the codebase that does not follow the entity→domain-model pattern, reflecting its purely infrastructural role.

`LearnableLanguage`

Records which language pair a user is studying and their self-reported proficiency levels. A user can study multiple language pairs simultaneously (e.g. EN→FR at B1 and EN→ES at A2). Proficiencies follow the CEFR scale (A1, A2, B1, B2, C1, C2).

Key fields: user_id, source_language, target_language, proficiencies: list[str].

This entity lives in learnable_languages and is managed by AccountService.add_learnable_language / remove_learnable_language.

`UserLanguagePair`

A lightweight pairing of source and target language, used to scope vocab bank entries. Where LearnableLanguage is a profile concept (what am I learning, at what level), UserLanguagePair is an operational concept (which direction does this vocabulary entry belong to).

Key fields: user_id, source_lang, target_lang. Unique per user per direction.

The vocab bank

The vocab bank is the central concept of the system. It is the user's personal list of words they are actively learning. Even when words "graduate" to learned or well known by a User, they stay in the vocab bank.

Each user has their own Vocab bank.

Items can be put into a Vocab Bank by either the user (e.g. through identifying a word they don't know in some natural language text, translating it in the app, then adding it), or by the system (e.g. by the user selecting predefined "packs" of words).

`LearnableWordBankEntry`

Each LearnableWordBankEntry signifies a word or phrase that a user has added to their bank, i.e. which they have identified something they want to learn.

This is the bridge between the reference dictionary and the user's personal study material.

Key fields:

Field	Description
`surface_text`	The exact text the user encountered or typed (e.g. `"allons"`, `"avoir l'air"`). Always stored, even if dictionary lookup fails.
`sense_id`	FK → `dictionary_sense`. Null until disambiguation is resolved. The specific meaning the user is learning.
`wordform_id`	FK → `dictionary_wordform`. Set when the entry originated from the NLP pipeline and the inflected form was found in the wordform table. Null for manually-entered headwords.
`is_phrase`	True for multi-word expressions. Phrase entries bypass dictionary lookup and never resolve to a single sense.
`entry_pathway`	How the word entered the bank: `"manual"`, `"highlight"`, `"nlp_extraction"`, or `"pack"`.
`disambiguation_status`	See below.
`language_pair_id`	FK → `user_language_pair`. Which direction this entry belongs to.

Disambiguation status lifecycle

                     ┌─────────────┐
     (0 or >1 sense) │   pending   │ ◄── always starts here for phrases
          ┌──────────└─────────────┘──────────┐
          │                │                  │
   user picks         (1 sense found     user skips
    a sense           at add time)
          │                │                  │
          ▼                ▼                  ▼
     ┌──────────┐   ┌───────────────┐   ┌─────────┐
     │ resolved │   │ auto_resolved │   │ skipped │
     └──────────┘   └───────────────┘   └─────────┘

pending: No sense assigned. Occurs when zero or multiple dictionary senses were found, or when the entry is a phrase. The user must visit the disambiguation UI.
auto_resolved: Exactly one sense was found at add time; it was assigned automatically without user interaction.
resolved: The user was presented with multiple candidates and chose one.
skipped: The user chose not to disambiguate. The entry persists in the bank but cannot generate flashcards.

Only entries with disambiguation_status of "auto_resolved" or "resolved" have a sense_id and can generate flashcards.

Flashcards

A flashcard is a study card, its analogue in the physical world is a piece of paper with writing on both sides. A learner would look at one side, and attempt to recall what is on the other side. For example, for a French learner, one side would have the word "to go (v)" and the other would have "aller".

At the core of Language Learning App is the idea that Flashcards are a good primitive for improving recall over time. They should complement, not replace, immersion or exposure to foreign-language text. They allow users to focus on one thing at a time, as opposed to the more cognitiviely demanding experience of reading.

A User can have many Flashcards in their "bank", and flashcards can be arranged into "packs" of themes. Flashcards can be created in multiple ways:

Users can "open" (i.e. copy) Flashcards in pre-constructed Packs. These might be, for example "100 most common French Verbs, infinitive forms" or "Food and ingredients, French Words". These packs are build and maintained the system administrators, and it is possible for updates to the parent pack to trickle down to the children Flashcards in a User's account.
Users can generaet their own flashcards using the Web App using the dedicated Flashcard Interface.
When a Learner is reading (or listening) to foreign language content they may look up a specific word for translation. When they do so, they have the chance to auotomatically create a flashcard.
Users can duplicate pre-existing Flashcards

Flashcard content

The idea of a Flashcard starts with its paper analogue, but adds a lot of functionality on, and around, them to make them maximally useful to the learner.

For example, a user may be trying to learn a single headword, so the system use generative AI to generate multiple possible bits of context text. Because in real life, you will see a word in many contexts.

Furthermore, we use generative AI to generate autio (text-to-speech) to allow the user to hear/listen to the word, as well as the wider context text.

It is possible to have "simple" text flashcards which are just a source language word and a traget language word ("to go (v)" -> "aller"). It is also possible to have contextual text in both the source and the target. E.g. "he wants [to go] to the cinema" -> "il vuet [aller] au cinema".

For these flashcards with more context text, it might be possible to present the user with e.g. "il vue _____ au cinema (to go, v)" as the prompt, as well as the whole oringinal source text.

It is important to have Text To Speech for both the answer (e.g. "aller") as well as the whole context text ("il veut aller au cinema") because a big part of the premise of Language Learning App is that you can't just learn a language one word at a time.

We should design our Flashcard model with the idea that more than one element in the context text could be questioned on. E.g. a user may wish to have "he wants [to go] [to the cinema]" and be presented "il veut _____ __ ______". Within this single Flashcard we are helping the learner learn a number of words, each linked to separate wordforms and lemmas

Posing Questions / Prompts

Presenting just a single word prompt to the user may not be enough to generate an accurate response, especially without context text.

Notably, European languages have gender and tense agreement, where English might not.

For example, consider "went" as the past participle of "go". If you showed a learner "went" and asked for the French translation you may receive multiple possibly viable options. "Allẻ" is the most notable or likely response, but "allai" is also a possible response (simple past, first person tense).

Therefore, the cue word for a Flashcard can possibly:

Show the user explicit context: "Went (v, past participle)
Show the user context text "Went. Je suis _____"
Some mixture of the two

The same is true for plurality and gender on e.g. adjectives: "young" could be "jeune" or "jeunes"

Linking to the Bilingual Dictionary

Two cards are typically generated per bank entry — one in each direction:

target_to_source (recognition): prompt = lemma.headword (e.g. "bisque"), answer = sense.gloss (e.g. "advantage"). The learner sees the French word and must produce the English meaning.
source_to_target (production): prompt = sense.gloss (e.g. "advantage"), answer = lemma.headword (e.g. "bisque"). The learner sees the English meaning and must produce the French word.

Fluency, familiarity, and struggle

Ideally, over time, a User becomes familiar with words in their Word Bank. They will do this through e.g. Flashcards, and also possibly through exposure to the word in Articles and natural language content they generate.

It is also possible that a user consistently struggles with a certain word in a vocab bank, or a certain class of words (e.g. subjunctive tense use)

The System takes an event-driven approach to recording fluency, with periodic roll-ups or aggregations of state to represent a learner's familiarity. The exact nature of this system hasn't been thought through or designed yet

`FlashcardEvent`

An immutable record of something that happened during a study session. Events are append-only — they are never updated, only inserted.

Event types:

shown: The card was displayed to the user.
answered: The user submitted a response. user_response holds the free-text answer as typed; no automatic grading is done at this layer.
skipped: The user swiped past the card without answering.

The spaced-repetition scheduling algorithm (not yet implemented) will consume these events to determine when each card should next be shown.

`TranslatedArticleEvent`

These are immutible records of something that happened with regards to an artcie. Maybe they mark something as read or played, or they loaded a TranslatedArticle in the WebUI which contained a word, or they attempted to translate a word.

NLP pipeline integration

When a user highlights a word in an article, the client sends a spaCy token payload to POST /api/vocab/from-token. The DictionaryLookupService resolves the token to dictionary candidates using a three-stage fallback:

Stage 1 — wordform table (most precise) The inflected surface form (e.g. "allons") is looked up in dictionary_wordform. If found, the linked lemma's senses are returned. Because the lookup was via the wordform table, wordform_id is pre-populated on the resulting bank entry, preserving the link between what the user actually saw and the dictionary lemma it belongs to.

Stage 2 — lemma + UD POS If no wordform row exists, the spaCy-provided lemma (e.g. "aller") is looked up against dictionary_lemma.headword, filtered by pos_normalised (the UD POS tag from spaCy). The POS filter reduces false matches for homographs that share a headword but differ in part of speech.

Stage 3 — lemma only Drops the POS filter as a last resort. Returns all senses for the headword regardless of part of speech.

The endpoint response includes both the created bank entry and the full list of sense candidates, so the client can immediately render the disambiguation UI if disambiguation_status == "pending".

The full learner journey

1. Account setup
   User registers → adds a LearnableLanguage (e.g. EN→FR, B1)
   A UserLanguagePair is created to scope their vocab entries.

2. Word discovery
   User reads an article and encounters an unfamiliar word.
   
   Option A — manual entry:
     POST /api/vocab  { surface_text: "banque", language_pair_id: ... }
     VocabService looks up senses for "banque" in dictionary_lemma.
   
   Option B — article highlight (NLP):
     spaCy processes the article and returns a token payload.
     POST /api/vocab/from-token  { surface: "allons", spacy_lemma: "aller", pos_ud: "VERB", ... }
     DictionaryLookupService: wordform "allons" → lemma "aller" → senses.

3. Disambiguation
   If exactly 1 sense → status = auto_resolved, sense_id set immediately.
   If 0 or >1 senses → status = pending.
   
   GET /api/vocab/pending-disambiguation
   User sees list of candidate senses with glosses.
   PATCH /api/vocab/{entry_id}/sense  { sense_id: "..." }
   Status → resolved.

4. Flashcard generation
   POST /api/vocab/{entry_id}/flashcards
   FlashcardService reads sense.gloss + lemma.headword.
   Creates 2 flashcards: target_to_en and en_to_target.

5. Study session
   GET /api/flashcards  — fetch cards to study.
   POST /api/flashcards/{id}/events  { event_type: "shown" }
   POST /api/flashcards/{id}/events  { event_type: "answered", user_response: "bank" }
   Events accumulate for future SRS scheduling.

Entity relationships

users
  └── learnable_languages          (what languages, at what proficiency)
  └── user_language_pair           (operational scope for vocab entries)
        └── learnable_word_bank_entry
              ├── dictionary_sense (nullable — the specific meaning being learned)
              │     └── dictionary_lemma
              │           └── dictionary_wordform
              ├── dictionary_wordform (nullable — the exact inflected form encountered)
              └── flashcard
                    └── flashcard_event

dictionary_lemma
  ├── dictionary_sense             (one or many meanings)
  ├── dictionary_wordform          (inflected forms)
  └── dictionary_lemma_raw         (original kaikki JSON, for reprocessing)

Key enumerations

`disambiguation_status`

"pending" | "auto_resolved" | "resolved" | "skipped"

`entry_pathway`

"manual" | "highlight" | "nlp_extraction" | "pack"

`card_direction`

"target_to_en" | "en_to_target"

`prompt_modality`

"text" | "audio"

`event_type`

"shown" | "answered" | "skipped"

`pos_normalised` (UD tags used in this codebase)

20 KiB Raw Blame History