15 KiB
Domain
This document explains the core domain concepts of the API — how it models second language acquisition and the processes a learner goes through. It is written for both human developers and LLMs working on this codebase.
What this system does
A user is learning a foreign language (currently French, with Spanish, Italian, and German planned). They encounter words in reading material, add those words to a personal vocabulary bank, resolve any ambiguity about which specific meaning they encountered, and then practise those words via spaced-repetition flashcards.
The system models that full cycle: from a raw word in context, through dictionary lookup and disambiguation, to a durable flashcard that can be studied repeatedly.
Linguistic concepts
Before reading the entity descriptions, these distinctions are essential. Conflating them is the most common source of modelling mistakes in this codebase.
Lemma, wordform, and token
- A lemma is the canonical dictionary form of a word: aller, banque, bon.
- A wordform is an inflected surface form: allons, allais, banques, bonne. Wordforms are derived from a lemma by applying grammatical rules.
- A token is what spaCy returns when it processes a sentence — it is a wordform in context. spaCy provides both the raw token text and its lemma.
Dictionary entries are keyed by lemma. Wordforms point back to their lemma. These are different things and must not be conflated — a user might encounter allons in an article, but what they are learning is the lemma aller.
Senses
A single lemma can have multiple senses: bank (finance), bank (river), bank (verb, to lean). Each sense is a distinct row in dictionary_sense with its own gloss (definition/translation). The user learns a specific sense, not a bare headword.
When a user adds a word with multiple senses, the system cannot know which meaning they encountered. It creates a bank entry with disambiguation_status = "pending" and waits for the user to select the correct sense. This process is called disambiguation.
Part-of-speech normalisation
The dictionary source (kaikki/Wiktextract) uses its own POS labels: "noun", "verb", "past participle", "proverb", "phrase". spaCy uses Universal Dependencies tags: NOUN, VERB, ADJ, ADV. These do not map 1-to-1.
Both are stored: pos_raw holds the kaikki string exactly as it appears in the source data; pos_normalised holds the UD-compatible tag computed at import time. The pos_normalised field is what enables joining spaCy output against dictionary rows.
Gender
French, Spanish, Italian, and German nouns have grammatical gender. Learners must know the gender — le banc not just banc. Gender is extracted from the kaikki tags array at import time and stored as a first-class column (gender: text) on dictionary_lemma. Possible values are "masculine", "feminine", "neuter", "common", or null for parts of speech that do not inflect by gender.
The bilingual mapping
This system uses the English-language Wiktionary (via kaikki). An important structural fact: the gloss on a sense IS the English translation. There is no separate translations table. Because Wiktionary describes foreign words in English, the headword is the target-language word and the gloss is its English meaning:
dictionary_lemma.headword = "bisque"(French)dictionary_sense.gloss = "advantage"(English meaning)
This means:
- FR → EN (recognition): look up lemma by French headword → sense → gloss is the English meaning.
- EN → FR (production): full-text search on
dictionary_sense.glossfor the English term → linked lemma headword is the French word.
The bilingual dictionary
The dictionary is a read-only reference dataset, populated once by an import script (scripts/import_dictionary.py) from kaikki JSONL dumps. It is never written to by the application at runtime.
dictionary_lemma
One row per lemma+POS combination. The (headword, language) pair is indexed but not unique — bank has multiple lemma rows because it is both a noun and a verb.
Key fields: headword, language (ISO 639-1 code, e.g. "fr"), pos_raw, pos_normalised, gender, tags.
dictionary_sense
One row per meaning of a lemma. The gloss is a short English definition that serves as both the disambiguation label and the translation. sense_index preserves the ordering from the source data (Wiktionary's first sense is usually the most common).
Key fields: lemma_id (FK → dictionary_lemma), sense_index, gloss, topics, tags.
dictionary_wordform
One row per inflected form. Populated from the forms array in the kaikki JSONL. Enables the NLP pipeline to resolve an inflected token back to its lemma without relying on spaCy's lemmatisation being perfect.
Key fields: lemma_id (FK → dictionary_lemma), form, tags (e.g. ["plural"], ["first person plural", "present indicative"]).
dictionary_lemma_raw
Stores the full original kaikki JSON record for each lemma, one row per lemma, separate from the main lemma table to avoid bloating lookup queries. Used for reprocessing if the import logic changes.
The user account
User
Standard authentication entity: email, hashed password, is_active, is_email_verified. There is no User domain model — the ORM entity (User in user_entity.py) is used directly by AccountService and user_repository. This is the only entity in the codebase that does not follow the entity→domain-model pattern, reflecting its purely infrastructural role.
LearnableLanguage
Records which language pair a user is studying and their self-reported proficiency levels. A user can study multiple language pairs simultaneously (e.g. EN→FR at B1 and EN→ES at A2). Proficiencies follow the CEFR scale (A1, A2, B1, B2, C1, C2).
Key fields: user_id, source_language, target_language, proficiencies: list[str].
This entity lives in learnable_languages and is managed by AccountService.add_learnable_language / remove_learnable_language.
UserLanguagePair
A lightweight pairing of source and target language, used to scope vocab bank entries. Where LearnableLanguage is a profile concept (what am I learning, at what level), UserLanguagePair is an operational concept (which direction does this vocabulary entry belong to).
Key fields: user_id, source_lang, target_lang. Unique per user per direction.
The vocab bank
The vocab bank is the central concept of the system. It is the user's personal list of words they are actively learning.
LearnableWordBankEntry
One row per word or phrase that a user has added to their bank. This is the bridge between the reference dictionary and the user's personal study material.
Key fields:
| Field | Description |
|---|---|
surface_text |
The exact text the user encountered or typed (e.g. "allons", "avoir l'air"). Always stored, even if dictionary lookup fails. |
sense_id |
FK → dictionary_sense. Null until disambiguation is resolved. The specific meaning the user is learning. |
wordform_id |
FK → dictionary_wordform. Set when the entry originated from the NLP pipeline and the inflected form was found in the wordform table. Null for manually-entered headwords. |
is_phrase |
True for multi-word expressions. Phrase entries bypass dictionary lookup and never resolve to a single sense. |
entry_pathway |
How the word entered the bank: "manual", "highlight", "nlp_extraction", or "pack". |
disambiguation_status |
See below. |
language_pair_id |
FK → user_language_pair. Which direction this entry belongs to. |
Disambiguation status lifecycle
┌─────────────┐
(0 or >1 sense) │ pending │ ◄── always starts here for phrases
┌──────────└─────────────┘──────────┐
│ │ │
user picks (1 sense found user skips
a sense at add time)
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────────┐ ┌─────────┐
│ resolved │ │ auto_resolved │ │ skipped │
└──────────┘ └───────────────┘ └─────────┘
pending: No sense assigned. Occurs when zero or multiple dictionary senses were found, or when the entry is a phrase. The user must visit the disambiguation UI.auto_resolved: Exactly one sense was found at add time; it was assigned automatically without user interaction.resolved: The user was presented with multiple candidates and chose one.skipped: The user chose not to disambiguate. The entry persists in the bank but cannot generate flashcards.
Only entries with disambiguation_status of "auto_resolved" or "resolved" have a sense_id and can generate flashcards.
Flashcards
A flashcard is a study card derived from a resolved vocab bank entry. It carries pre-computed prompt and answer text so the study session does not need to re-query the dictionary.
Flashcard
Two cards are typically generated per bank entry — one in each direction:
target_to_en(recognition): prompt =lemma.headword(e.g."bisque"), answer =sense.gloss(e.g."advantage"). The learner sees the French word and must produce the English meaning.en_to_target(production): prompt =sense.gloss(e.g."advantage"), answer =lemma.headword(e.g."bisque"). The learner sees the English meaning and must produce the French word.
Key fields: bank_entry_id, user_id, source_lang, target_lang, prompt_text, answer_text, prompt_context_text (optional sentence context), answer_context_text, card_direction, prompt_modality ("text" or "audio").
FlashcardEvent
An immutable record of something that happened during a study session. Events are append-only — they are never updated, only inserted.
Event types:
shown: The card was displayed to the user.answered: The user submitted a response.user_responseholds the free-text answer as typed; no automatic grading is done at this layer.skipped: The user swiped past the card without answering.
The spaced-repetition scheduling algorithm (not yet implemented) will consume these events to determine when each card should next be shown.
NLP pipeline integration
When a user highlights a word in an article, the client sends a spaCy token payload to POST /api/vocab/from-token. The DictionaryLookupService resolves the token to dictionary candidates using a three-stage fallback:
Stage 1 — wordform table (most precise)
The inflected surface form (e.g. "allons") is looked up in dictionary_wordform. If found, the linked lemma's senses are returned. Because the lookup was via the wordform table, wordform_id is pre-populated on the resulting bank entry, preserving the link between what the user actually saw and the dictionary lemma it belongs to.
Stage 2 — lemma + UD POS
If no wordform row exists, the spaCy-provided lemma (e.g. "aller") is looked up against dictionary_lemma.headword, filtered by pos_normalised (the UD POS tag from spaCy). The POS filter reduces false matches for homographs that share a headword but differ in part of speech.
Stage 3 — lemma only Drops the POS filter as a last resort. Returns all senses for the headword regardless of part of speech.
The endpoint response includes both the created bank entry and the full list of sense candidates, so the client can immediately render the disambiguation UI if disambiguation_status == "pending".
The full learner journey
1. Account setup
User registers → adds a LearnableLanguage (e.g. EN→FR, B1)
A UserLanguagePair is created to scope their vocab entries.
2. Word discovery
User reads an article and encounters an unfamiliar word.
Option A — manual entry:
POST /api/vocab { surface_text: "banque", language_pair_id: ... }
VocabService looks up senses for "banque" in dictionary_lemma.
Option B — article highlight (NLP):
spaCy processes the article and returns a token payload.
POST /api/vocab/from-token { surface: "allons", spacy_lemma: "aller", pos_ud: "VERB", ... }
DictionaryLookupService: wordform "allons" → lemma "aller" → senses.
3. Disambiguation
If exactly 1 sense → status = auto_resolved, sense_id set immediately.
If 0 or >1 senses → status = pending.
GET /api/vocab/pending-disambiguation
User sees list of candidate senses with glosses.
PATCH /api/vocab/{entry_id}/sense { sense_id: "..." }
Status → resolved.
4. Flashcard generation
POST /api/vocab/{entry_id}/flashcards
FlashcardService reads sense.gloss + lemma.headword.
Creates 2 flashcards: target_to_en and en_to_target.
5. Study session
GET /api/flashcards — fetch cards to study.
POST /api/flashcards/{id}/events { event_type: "shown" }
POST /api/flashcards/{id}/events { event_type: "answered", user_response: "bank" }
Events accumulate for future SRS scheduling.
Entity relationships
users
└── learnable_languages (what languages, at what proficiency)
└── user_language_pair (operational scope for vocab entries)
└── learnable_word_bank_entry
├── dictionary_sense (nullable — the specific meaning being learned)
│ └── dictionary_lemma
│ └── dictionary_wordform
├── dictionary_wordform (nullable — the exact inflected form encountered)
└── flashcard
└── flashcard_event
dictionary_lemma
├── dictionary_sense (one or many meanings)
├── dictionary_wordform (inflected forms)
└── dictionary_lemma_raw (original kaikki JSON, for reprocessing)
Key enumerations
disambiguation_status
"pending" | "auto_resolved" | "resolved" | "skipped"
entry_pathway
"manual" | "highlight" | "nlp_extraction" | "pack"
card_direction
"target_to_en" | "en_to_target"
prompt_modality
"text" | "audio"
event_type
"shown" | "answered" | "skipped"
pos_normalised (UD tags used in this codebase)
NOUN | VERB | ADJ | ADV | DET | PRON | ADP | CCONJ | SCONJ | INTJ | NUM | PART | PROPN | PUNCT | SYM