• 0 Posts
  • 40 Comments
Joined 3 months ago
cake
Cake day: October 9th, 2025

help-circle

  • As I understand it, CLIP (and other text encoders in diffusion models) aren’t trained like LLMs, exactly. They’re trained on image/text pairing, which ya get from the metadata creators upload with their photos in Adobe Stock. Open AI trained CLIP with alt text on scraped images, but I assume Adobe would want to train their own text encoder on the more extensive tags on the stock images its already using.

    All that said, Adobe hasn’t published their entire architecture. And there were some reports during the training of Firefly 1 back in '22 that they weren’t filtering out AI-generated images in the training set. At the time, those made up ~5% of the full stock library. Currently, AI images make up about half of Adobe Stock, though filtering them out seems to work well. We don’t know if they were included in later versions of Firefly. There’s an incentive for Adobe to filter them out, since AI trained on AI tends to lose its tails (the ability to handle edge cases well), and that would be pretty devastating for something like generative fill.

    I figure we want to encourage companies to do better, whatever that looks like. For a monopolistic giant like Adobe, they seem to have at least done better. And at some point, they have to rely on the artists uploading stock photos to be honest. Not just about AI, but about release forms, photo shoot working conditions, local laws being followed while shooting, etc. They do have some incentive to be honest, since Adobe pays them, but I don’t doubt there are issues there too.




  • Here’s a metaphor/framework I’ve found useful but am trying to refine, so feedback welcome.

    Visualize the deforming rubber sheet model commonly used to depict masses distorting spacetime. Your goal is to roll a ball onto the sheet from one side such that it rolls into a stable or slowly decaying orbit of a specific mass. You begin aiming for a mass on the outer perimeter of the sheet. But with each roll, you must aim for a mass further toward the center. The longer you roll, the more masses sit between you and your goal, to be rolled past or slingshot-ed around. As soon as you fail to hit a goal, you lose. But you can continue to play indefinitely.

    The model’s latent space is the sheet. The way the prompt is worded is your aiming/rolling of the ball. The response is the path the ball takes. And the good (useful, correct, original, whatever your goal was) response/inference is the path that becomes an orbit of the mass you’re aiming for. As the context window grows, the path becomes more constrained, and there are more pitfalls the model can fall into. Until you lose, there’s a phase transition, and the model starts going way off the rails. This phase transition was formalized mathematically in this paper from August.

    The masses are attractors that have been studied at different levels of abstraction. And the metaphor/framework seems to work at different levels as well, as if the deformed rubber sheet is a fractal with self-similarity across scale.

    One level up: the sheet becomes the trained alignment, the masses become potential roles the LLM can play, and the rolled ball is the RLHF or fine-tuning. So we see the same kind of phase transition in prompting (from useful to hallucinatory), in pre-training (poisoned training data), and in post-training (switching roles/alignments).

    Two levels down: the sheet becomes the neuron architecture, the masses become potential next words, and the rolled ball is the transformer process.

    In reality, the rubber sheet has like 40,000 dimensions, and I’m sure a ton is lost in the reduction.