Introduce structured linguistic data around entries in a "choose your own adventure", the purpose of which is to create a structured pathway from the user reading/listening to an entry, and then putting words in their vocab bank / word bank, and possibly then creating flashcards around them, to help them learn the words.
The app already has the idea of an Adventure (i.e. a single story), for which there are many Entries, each of which have Possible Choices (4, for now), which the user selects and then the story continues to be generated.
Entries are generated by a LLM (Claude), have a translation text generated by the DeepL translator, and are converted to audio by another LLM (Google's Gemini).
1. Use the SpaCy natural language processing to break downt he generated (i.e. foreign language) text for an entry into their parts of speech and their sentences.
2. We are to translate these sentenses one at a time, and then the results from that translation are passed into the same SpaCy pipeline.
3. We need to end up with a data structure of `paragraphs` each of which has 1..n `sentences`, and the tokens (words) in that sentence have gone through the part-of-speech tagging system, as well as lemmatisation (these are already configured with how SpaCy is used elsewhere).
4. This structured data should be stored alongside the full-text as they are currently generated in the API, i.e. we need both the structured linguistic data as well as the original body text.
The `AdventureService` (`/app/domain/service/adventure_service.py`) contains a method called `run_entry_pipeline` - this is the highly asynchronous orchestrator of calls to various external parties (e.g. LLMs, translators, TTS), we should use this existing entrypoint to run the code.
Running the NLP pipeline in SpaCy won't get us the paragraphs, so we may need to split the incoming raw text by the `\n\n` separator, and then call the pipeline on each paragraph in turn.
We will therefore need a JSON new field on the `AdventureEntryEntity`, which I think we should call `story_text_linguistic_data`, which should look like the following:
We will then need to feed this data through to the front-end, which will use it to create a more structured set of data in the UI, which will aid in creating a better "translate" experience (i.e. click on a single word in the target language, and go to the relevant word(s) in the source language; be able to add words from that translation via a more automated pathway, with the option for manual intervention; linking of words with their dictionary entries, which we have)
This may have an impact on performance, can we therefore introduce a simple tracing mechanism into the `run_entry_pipeline` method, to give visibility about how long it take (in seconds) to run each individual step. Can we store this as JSON in the `AdventureEntryEntity`, so we'll need to createa migration to create those fields, I imagine some data that looks like:
The use of LLMs creates a cost on Language Learning App per entry that is generated (initial generation, translation, text-to-speech). This will likely be as high as 50-60p per adventure, per user this could add up to a lot of money.
Users who wish to operate on the subscription model will get a certain number of Adventure entries per subscription period. We should round this up to the nearest adventure (you don't want to be waiting for your next renewal to finsih an adventure).
For this reason, it's very important that the system tracks the costs (in money, and in tokens) taken to generate the content for an adventure, so these figures can be adjusted to reflect reality.