Mistral likely does “prompt enhancement,” aka feeding your prompt to an LLM first and asking it to expand it with more words.
So internally, a Mistral text LLM is probably writing out “sure! Here’s a long prompt with no dog: …” and then that part is fed to the image generator.
Other “LLMs” are truly multimodal and generate image output, hence they still get the word “dog” in the input.
Mistral likely does “prompt enhancement,” aka feeding your prompt to an LLM first and asking it to expand it with more words.
So internally, a Mistral text LLM is probably writing out “sure! Here’s a long prompt with no dog: …” and then that part is fed to the image generator.
Other “LLMs” are truly multimodal and generate image output, hence they still get the word “dog” in the input.