How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?
Use any of the image generation models (eg Nanobanana, Midjourney, or ChatGPT) to generate a picture of a noun on a noun. Simonw's test is to have a Language (text) model generate a Scalar Vector Graphic, which the language model has to do by writing curves and colors, like draw a spline from point 150,100 to 200,300 of type cubic, using width 20, color orange.
In that hypothetical second is freaking fascinating. It's a denoising algorithm, and then a bunch of linear algebra, and out pops a picture of a pelican on a bicycle. Stable diffusion does this quite handily. https://stablediffusionweb.com/image/6520628-pelican-bicycle...