# Domain This document explains the core domain concepts of the API — how it models second language acquisition and the processes a learner goes through. It is written for both human developers and LLMs working on this codebase. --- ## What this system does A user is learning a foreign language (currently French, with Spanish, Italian, and German planned). They encounter words in reading material, add those words to a personal vocabulary bank, resolve any ambiguity about which specific meaning they encountered, and then practise those words via spaced-repetition flashcards. The system models that full cycle: from a raw word in context, through dictionary lookup and disambiguation, to a durable flashcard that can be studied repeatedly. --- ## Linguistic concepts Before reading the entity descriptions, these distinctions are essential. Conflating them is the most common source of modelling mistakes in this codebase. ### Lemma, wordform, and token - A **lemma** is the canonical dictionary form of a word: *aller*, *banque*, *bon*. - A **wordform** is an inflected surface form: *allons*, *allais*, *banques*, *bonne*. Wordforms are derived from a lemma by applying grammatical rules. - A **token** is what spaCy returns when it processes a sentence — it is a wordform in context. spaCy provides both the raw token text and its lemma. Dictionary entries are keyed by lemma. Wordforms point back to their lemma. These are different things and must not be conflated — a user might encounter *allons* in an article, but what they are learning is the lemma *aller*. ### Senses A single lemma can have multiple **senses**: *bank (finance)*, *bank (river)*, *bank (verb, to lean)*. Each sense is a distinct row in `dictionary_sense` with its own gloss (definition/translation). The user learns a specific sense, not a bare headword. When a user adds a word with multiple senses, the system cannot know which meaning they encountered. It creates a bank entry with `disambiguation_status = "pending"` and waits for the user to select the correct sense. This process is called **disambiguation**. ### Part-of-speech normalisation The dictionary source (kaikki/Wiktextract) uses its own POS labels: "noun", "verb", "past participle", "proverb", "phrase". spaCy uses Universal Dependencies tags: NOUN, VERB, ADJ, ADV. These do not map 1-to-1. Both are stored: `pos_raw` holds the kaikki string exactly as it appears in the source data; `pos_normalised` holds the UD-compatible tag computed at import time. The `pos_normalised` field is what enables joining spaCy output against dictionary rows. ### Gender French, Spanish, Italian, and German nouns have grammatical gender. Learners must know the gender — *le banc* not just *banc*. Gender is extracted from the kaikki `tags` array at import time and stored as a first-class column (`gender: text`) on `dictionary_lemma`. Possible values are `"masculine"`, `"feminine"`, `"neuter"`, `"common"`, or `null` for parts of speech that do not inflect by gender. ### The bilingual mapping This system uses the English-language Wiktionary (via kaikki). An important structural fact: **the gloss on a sense IS the English translation**. There is no separate translations table. Because Wiktionary describes foreign words in English, the headword is the target-language word and the gloss is its English meaning: - `dictionary_lemma.headword = "bisque"` (French) - `dictionary_sense.gloss = "advantage"` (English meaning) This means: - **FR → EN** (recognition): look up lemma by French headword → sense → gloss is the English meaning. - **EN → FR** (production): full-text search on `dictionary_sense.gloss` for the English term → linked lemma headword is the French word. --- ## The bilingual dictionary The dictionary is a read-only reference dataset, populated once by an import script (`scripts/import_dictionary.py`) from kaikki JSONL dumps. It is never written to by the application at runtime. ### `dictionary_lemma` One row per lemma+POS combination. The `(headword, language)` pair is indexed but not unique — *bank* has multiple lemma rows because it is both a noun and a verb. Key fields: `headword`, `language` (ISO 639-1 code, e.g. `"fr"`), `pos_raw`, `pos_normalised`, `gender`, `tags`. ### `dictionary_sense` One row per meaning of a lemma. The `gloss` is a short English definition that serves as both the disambiguation label and the translation. `sense_index` preserves the ordering from the source data (Wiktionary's first sense is usually the most common). Key fields: `lemma_id` (FK → `dictionary_lemma`), `sense_index`, `gloss`, `topics`, `tags`. ### `dictionary_wordform` One row per inflected form. Populated from the `forms` array in the kaikki JSONL. Enables the NLP pipeline to resolve an inflected token back to its lemma without relying on spaCy's lemmatisation being perfect. Key fields: `lemma_id` (FK → `dictionary_lemma`), `form`, `tags` (e.g. `["plural"]`, `["first person plural", "present indicative"]`). ### `dictionary_lemma_raw` Stores the full original kaikki JSON record for each lemma, one row per lemma, separate from the main lemma table to avoid bloating lookup queries. Used for reprocessing if the import logic changes. --- ## The user account ### User Standard authentication entity: email, hashed password, `is_active`, `is_email_verified`. There is no `User` domain model — the ORM entity (`User` in `user_entity.py`) is used directly by `AccountService` and `user_repository`. This is the only entity in the codebase that does not follow the entity→domain-model pattern, reflecting its purely infrastructural role. ### `LearnableLanguage` Records which language pair a user is studying and their self-reported proficiency levels. A user can study multiple language pairs simultaneously (e.g. EN→FR at B1 and EN→ES at A2). Proficiencies follow the CEFR scale (A1, A2, B1, B2, C1, C2). Key fields: `user_id`, `source_language`, `target_language`, `proficiencies: list[str]`. This entity lives in `learnable_languages` and is managed by `AccountService.add_learnable_language` / `remove_learnable_language`. ### `UserLanguagePair` A lightweight pairing of source and target language, used to scope vocab bank entries. Where `LearnableLanguage` is a profile concept (what am I learning, at what level), `UserLanguagePair` is an operational concept (which direction does this vocabulary entry belong to). Key fields: `user_id`, `source_lang`, `target_lang`. Unique per user per direction. --- ## The vocab bank The vocab bank is the central concept of the system. It is the user's personal list of words they are actively learning. Even when words "graduate" to _learned_ or _well known_ by a User, they stay in the vocab bank. Each user has their own Vocab bank. Items can be put into a Vocab Bank by either the user (e.g. through identifying a word they don't know in some natural language text, translating it in the app, then adding it), or by the system (e.g. by the user selecting predefined "packs" of words). ### `LearnableWordBankEntry` Each `LearnableWordBankEntry` signifies a word or phrase that a user has added to their bank, i.e. which they have identified something they want to learn. This is the bridge between the reference dictionary and the user's personal study material. Key fields: | Field | Description | |---|---| | `surface_text` | The exact text the user encountered or typed (e.g. `"allons"`, `"avoir l'air"`). Always stored, even if dictionary lookup fails. | | `sense_id` | FK → `dictionary_sense`. Null until disambiguation is resolved. The specific meaning the user is learning. | | `wordform_id` | FK → `dictionary_wordform`. Set when the entry originated from the NLP pipeline and the inflected form was found in the wordform table. Null for manually-entered headwords. | | `is_phrase` | True for multi-word expressions. Phrase entries bypass dictionary lookup and never resolve to a single sense. | | `entry_pathway` | How the word entered the bank: `"manual"`, `"highlight"`, `"nlp_extraction"`, or `"pack"`. | | `disambiguation_status` | See below. | | `language_pair_id` | FK → `user_language_pair`. Which direction this entry belongs to. | ### Disambiguation status lifecycle ``` ┌─────────────┐ (0 or >1 sense) │ pending │ ◄── always starts here for phrases ┌──────────└─────────────┘──────────┐ │ │ │ user picks (1 sense found user skips a sense at add time) │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌───────────────┐ ┌─────────┐ │ resolved │ │ auto_resolved │ │ skipped │ └──────────┘ └───────────────┘ └─────────┘ ``` - **`pending`**: No sense assigned. Occurs when zero or multiple dictionary senses were found, or when the entry is a phrase. The user must visit the disambiguation UI. - **`auto_resolved`**: Exactly one sense was found at add time; it was assigned automatically without user interaction. - **`resolved`**: The user was presented with multiple candidates and chose one. - **`skipped`**: The user chose not to disambiguate. The entry persists in the bank but cannot generate flashcards. Only entries with `disambiguation_status` of `"auto_resolved"` or `"resolved"` have a `sense_id` and can generate flashcards. --- ## Flashcards A flashcard is a study card, its analogue in the physical world is a piece of paper with writing on both sides. A learner would look at one side, and attempt to recall what is on the other side. For example, for a French learner, one side would have the word "to go (v)" and the other would have "aller". At the core of Language Learning App is the idea that Flashcards are a good primitive for improving recall over time. They should complement, not replace, immersion or exposure to foreign-language text. They allow users to focus on one thing at a time, as opposed to the more cognitiviely demanding experience of reading. A User can have many Flashcards in their "bank", and flashcards can be arranged into "packs" of themes. Flashcards can be created in multiple ways: 1. Users can "open" (i.e. copy) Flashcards in pre-constructed Packs. These might be, for example "100 most common French Verbs, infinitive forms" or "Food and ingredients, French Words". These packs are build and maintained the system administrators, and it is possible for updates to the parent pack to trickle down to the children Flashcards in a User's account. 2. Users can generaet their own flashcards using the Web App using the dedicated Flashcard Interface. 3. When a Learner is reading (or listening) to foreign language content they may look up a specific word for translation. When they do so, they have the chance to auotomatically create a flashcard. 4. Users can duplicate pre-existing Flashcards ### Flashcard content The idea of a Flashcard starts with its paper analogue, but adds a lot of functionality on, and around, them to make them maximally useful to the learner. For example, a user may be trying to learn a single headword, so the system use generative AI to generate multiple possible bits of context text. Because in real life, you will see a word in many contexts. Furthermore, we use generative AI to generate autio (text-to-speech) to allow the user to hear/listen to the word, as well as the wider context text. It is possible to have "simple" text flashcards which are _just_ a source language word and a traget language word ("to go (v)" -> "aller"). It is also possible to have contextual text in both the source and the target. E.g. "he wants [to go] to the cinema" -> "il vuet [aller] au cinema". For these flashcards with more context text, it might be possible to present the user with e.g. "il vue _____ au cinema (to go, v)" as the prompt, as well as the whole oringinal source text. It is important to have Text To Speech for both the answer (e.g. "aller") as well as the whole context text ("il veut aller au cinema") because a big part of the premise of Language Learning App is that you can't just learn a language one word at a time. We should design our Flashcard model with the idea that more than one element in the context text could be questioned on. E.g. a user may wish to have "he wants [to go] [to the cinema]" and be presented "il veut _____ __ ______". Within this single Flashcard we are helping the learner learn a number of words, each linked to separate wordforms and lemmas ### Posing Questions / Prompts Presenting just a single word prompt to the user may not be enough to generate an accurate response, especially without context text. Notably, European languages have gender and tense agreement, where English might not. For example, consider "went" as the past participle of "go". If you showed a learner "went" and asked for the French translation you may receive multiple possibly viable options. "Allẻ" is the most notable or likely response, but "allai" is also a possible response (simple past, first person tense). Therefore, the cue word for a Flashcard can possibly: 1. Show the user explicit context: "Went (v, past participle) 2. Show the user context text "Went. Je suis _____" 3. Some mixture of the two The same is true for plurality and gender on e.g. adjectives: "young" could be "jeune" or "jeunes" ### Linking to the Bilingual Dictionary Two cards are typically generated per bank entry — one in each direction: - **`target_to_source`** (recognition): prompt = `lemma.headword` (e.g. `"bisque"`), answer = `sense.gloss` (e.g. `"advantage"`). The learner sees the French word and must produce the English meaning. - **`source_to_target`** (production): prompt = `sense.gloss` (e.g. `"advantage"`), answer = `lemma.headword` (e.g. `"bisque"`). The learner sees the English meaning and must produce the French word. ## Fluency, familiarity, and struggle Ideally, over time, a User becomes familiar with words in their Word Bank. They will do this through e.g. Flashcards, and also possibly through exposure to the word in Articles and natural language content they generate. It is also possible that a user consistently struggles with a certain word in a vocab bank, or a certain class of words (e.g. subjunctive tense use) The System takes an event-driven approach to recording fluency, with periodic roll-ups or aggregations of state to represent a learner's familiarity. The exact nature of this system hasn't been thought through or designed yet ### `FlashcardEvent` An immutable record of something that happened during a study session. Events are append-only — they are never updated, only inserted. Event types: - **`shown`**: The card was displayed to the user. - **`answered`**: The user submitted a response. `user_response` holds the free-text answer as typed; no automatic grading is done at this layer. - **`skipped`**: The user swiped past the card without answering. The spaced-repetition scheduling algorithm (not yet implemented) will consume these events to determine when each card should next be shown. ### `TranslatedArticleEvent` These are immutible records of something that happened with regards to an artcie. Maybe they mark something as read or played, or they loaded a TranslatedArticle in the WebUI which contained a word, or they attempted to translate a word. --- ## NLP pipeline integration When a user highlights a word in an article, the client sends a spaCy token payload to `POST /api/vocab/from-token`. The `DictionaryLookupService` resolves the token to dictionary candidates using a three-stage fallback: **Stage 1 — wordform table (most precise)** The inflected surface form (e.g. `"allons"`) is looked up in `dictionary_wordform`. If found, the linked lemma's senses are returned. Because the lookup was via the wordform table, `wordform_id` is pre-populated on the resulting bank entry, preserving the link between what the user actually saw and the dictionary lemma it belongs to. **Stage 2 — lemma + UD POS** If no wordform row exists, the spaCy-provided lemma (e.g. `"aller"`) is looked up against `dictionary_lemma.headword`, filtered by `pos_normalised` (the UD POS tag from spaCy). The POS filter reduces false matches for homographs that share a headword but differ in part of speech. **Stage 3 — lemma only** Drops the POS filter as a last resort. Returns all senses for the headword regardless of part of speech. The endpoint response includes both the created bank entry and the full list of sense candidates, so the client can immediately render the disambiguation UI if `disambiguation_status == "pending"`. --- ## The full learner journey ``` 1. Account setup User registers → adds a LearnableLanguage (e.g. EN→FR, B1) A UserLanguagePair is created to scope their vocab entries. 2. Word discovery User reads an article and encounters an unfamiliar word. Option A — manual entry: POST /api/vocab { surface_text: "banque", language_pair_id: ... } VocabService looks up senses for "banque" in dictionary_lemma. Option B — article highlight (NLP): spaCy processes the article and returns a token payload. POST /api/vocab/from-token { surface: "allons", spacy_lemma: "aller", pos_ud: "VERB", ... } DictionaryLookupService: wordform "allons" → lemma "aller" → senses. 3. Disambiguation If exactly 1 sense → status = auto_resolved, sense_id set immediately. If 0 or >1 senses → status = pending. GET /api/vocab/pending-disambiguation User sees list of candidate senses with glosses. PATCH /api/vocab/{entry_id}/sense { sense_id: "..." } Status → resolved. 4. Flashcard generation POST /api/vocab/{entry_id}/flashcards FlashcardService reads sense.gloss + lemma.headword. Creates 2 flashcards: target_to_en and en_to_target. 5. Study session GET /api/flashcards — fetch cards to study. POST /api/flashcards/{id}/events { event_type: "shown" } POST /api/flashcards/{id}/events { event_type: "answered", user_response: "bank" } Events accumulate for future SRS scheduling. ``` --- ## Entity relationships ``` users └── learnable_languages (what languages, at what proficiency) └── user_language_pair (operational scope for vocab entries) └── learnable_word_bank_entry ├── dictionary_sense (nullable — the specific meaning being learned) │ └── dictionary_lemma │ └── dictionary_wordform ├── dictionary_wordform (nullable — the exact inflected form encountered) └── flashcard └── flashcard_event dictionary_lemma ├── dictionary_sense (one or many meanings) ├── dictionary_wordform (inflected forms) └── dictionary_lemma_raw (original kaikki JSON, for reprocessing) ``` --- ## Key enumerations ### `disambiguation_status` `"pending"` | `"auto_resolved"` | `"resolved"` | `"skipped"` ### `entry_pathway` `"manual"` | `"highlight"` | `"nlp_extraction"` | `"pack"` ### `card_direction` `"target_to_en"` | `"en_to_target"` ### `prompt_modality` `"text"` | `"audio"` ### `event_type` `"shown"` | `"answered"` | `"skipped"` ### `pos_normalised` (UD tags used in this codebase) `NOUN` | `VERB` | `ADJ` | `ADV` | `DET` | `PRON` | `ADP` | `CCONJ` | `SCONJ` | `INTJ` | `NUM` | `PART` | `PROPN` | `PUNCT` | `SYM`