Can't you just give it a photo of a dog, and then say "use this dog in this or t...

artemisart · 2024-12-09T21:31:07 1733779867

Yes, the idea works and was explored with dreambooth/textual inversion for image diffusion models.

https://dreambooth.github.io/ https://textual-inversion.github.io/

minimaxir · 2024-12-09T21:43:20 1733780600

Both of those are of course out of date and require significant training instead of just feeding it a single image.

InstantID (https://replicate.com/zsxkib/instant-id) fixes that issue.

AuryGlenz · 2024-12-10T16:26:26 1733847986

Dreambooth style training is in no way out of date.

If you just want a face, InstandID/Pulid work - but it’s not going to be very varied. Doing actual training means you can get any perspective, lighting, style, expression, etc - and have the whole body be accurate.

alpha_squared · 2024-12-09T21:09:07 1733778547

How would that even work? A dog has physical features (legs, nose, eyes, ears, etc.) that they use to interact with the world around them (ground, tree, grass, sounds, etc.). And each one of those things has physical structures that compose senses (nervous system, optic nerves, etc.). There are layers upon layers of intricate complexity that took eons to develop and a single photo cannot encapsulate that level of complexity and density of information. Even a 3D scan can't capture that level of information. There is an implicit understanding of the physical world that helps us make sense of images. For example, a dog with all four paws standing on grass is within the bounds of possibility; a dog with six paws, two of which are on it's head, are outside the bounds of possibility. An image generator doesn't understand that obvious delineation and just approximates likelihood.

int_19h · 2024-12-09T21:36:30 1733780190

A single photo doesn't have to capture all that complexity. It's carried by all those countless dog photos and videos in the training set of the model.

krainboltgreene · 2024-12-10T09:39:40 1733823580

Actually, it does have to capture all of that complexity because it's a photon-based analysis of reality. You cannot take a photo without doing that.