• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    6 hours ago

    Mistral likely does “prompt enhancement,” aka feeding your prompt to an LLM first and asking it to expand it with more words.

    So internally, a Mistral text LLM is probably writing out “sure! Here’s a long prompt with no dog: …” and then that part is fed to the image generator.

    Other “LLMs” are truly multimodal and generate image output, hence they still get the word “dog” in the input.