Generating transparent images using Stable Diffusion XL

amluto · on March 3, 2024

Looking at the “woman, messy hair, high quality” image, the hair farther from her head looks wrong in much the way that iPhone portrait mode messes up hair. I wonder if this is an example of an AI training on partially AI-generated data and reproducing its artifacts.

dotancohen · on March 3, 2024

I just looked at the photograph. What is wrong about it?

https://private-user-images.githubusercontent.com/161511761/...

amluto · on March 3, 2024

The specific issue that bothers me is the incorrect imitation bokeh. The hair farther from the head is too blurry despite being well within what should be the focal plane.

This is inherent to what I think is Apple’s model of depth of field: Apple takes a picture that is fairly sharp everywhere and generates an ordinary RGB image plus a depth map (an estimated distance for each pixel). Then it applies some sort of blur that depends on depth.

This is a decent approximation if the scene contains opaque, pixel-sized or larger objects, so that each pixel’s content actually has a well defined depth. But hair tends to be much smaller than a pixel, and a pixel containing both hair and background can’t be correctly represented.

This was an issue in older (circa 2000?) Z-buffered rendering — if you naively render hair and then render an object behind the person based on the Z data from the hair rendering, you get very wrong-looking hair. It turns out that just having a GPU that can handle a zillion vertices doesn’t mean that rendering each hair independently gives good results!

Y-bar · on March 3, 2024

I think it is how it fills in "waves" in the hair, notice how each time a bunch of hair intersects with another, at least one set of hair curves unnaturally? Valleys and peaks where the hair bends should be similar in natural hair, here they are not.

heinrich5991 · on March 3, 2024

Working link: https://github.com/layerdiffusion/sd-forge-layerdiffusion/as....

GaggiX · on March 3, 2024

Paper: https://arxiv.org/abs/2402.17113

The author Lvmin Zhang is the same person behind ControlNet.

msp26 · on March 3, 2024

It's amazing how much they've contributed to imagegen. I started using forge recently and it's a great speedup from regular sd-webui. https://github.com/lllyasviel/stable-diffusion-webui-forge

teapourer · on March 3, 2024

Also the creator of Fooocus, the open source version of Midjourney. It’s amazing how much one person can contribute to a field in such a short span of time.

vunderba · on March 3, 2024

The partial alpha blending support for translucent materials is really cool (glass, plastic, etc).

I'd be curious to see how well this plays with inpainting. Apparently img2img is also on the authors todo list.

jasonjamerson · on March 3, 2024

Good AI rotoscoping is welcome any time.

CuriouslyC · on March 3, 2024

It's not too far off. Vid2vid is already decent at keeping character consistency when configured correctly, background/environment flickering is hard to control but since the process is currently done using img2img on successive frames that makes sense. I think we'll see new models that do temporal convolution soon that will make video -> video transformations absolutely stunning.

swyx · on March 3, 2024

reactions

1 - the way the dog at the end gets a reflection off the floor is pretty nice.

2 - i wonder how this compares in terms of latency/complexity with a comfyui pipeline that just does a typical edge detection/masking layer to achieve the transparency effect. however i dont think that method would work with the glass example as shown

dannyw · on March 3, 2024

Apache 2.0, the beauty of open source. Nice.