SD3 API (from 2 months ago) and SD3m comparison

Por: ／u／Mean_Ship4545

27 Junio 2024 at 13:18

Some time ago when the SD3 API was released and we still hoped the open model would be on par with its performance, a series of prompts was tried and compared to MJ and Dall-E.

For reference, here are the links to the results of this comparison:

https://www.reddit.com/r/StableDiffusion/comments/1c94ojx/sd3\_first\_impression\_from\_prompt\_list\_comparison/

https://www.reddit.com/r/StableDiffusion/comments/1c94698/sd3\_first\_impression\_from\_prompt\_list\_comparison/

https://www.reddit.com/r/StableDiffusion/comments/1c93h5k/sd3\_first\_impression\_from\_prompt\_list\_comparison/

https://www.reddit.com/r/StableDiffusion/comments/1c92acf/sd3\_first\_impression\_from\_prompt\_list\_comparison/

Now that it's possible (not certain, but a possibility) that the SD3m is the only model we'll get, I thought it would be useful to rerun the prompts of these threads, generate 8 of them and comment on the result.

TLDR: the SD3m model is FAR FAR FAR worse than the API of two month ago.

Test 1 : Inside a steampunk workshop, a young cute redhead inventor, wearing blue overall and a glowing blue tattoo on her left shoulder, is working on a mechanical spider

This one gave OK results compared to the SD3 API/Dall-E, but with much less variation for the mechanical spiders, more hesitation over the number of legs it should have and failed with the location of the tattoo. It can fail to put it on the correct arm, or, worse, put it over the clothing, or make it the wrong color. Interestingly, the API made the inventor wear only overalls, while in 7 out of 8 case, the medium model Added a white undeclothing. It's more realistic, but it's interesting that it avoided to show more skin than necessary. Hands are generally garbled, which is sad since it was supposedly a strong point of SD3.

The best out of 8 was this one:

https://preview.redd.it/bfv8qopw549d1.png?width=1024&format=png&auto=webp&s=1970a41737b3ba0fceebea129a6926b2240cfe6e

Test 2

prompt: A fluffy blue cat with black bat wings is flying in a steampunk workshop, breathing fire at a mouse

In this case, the API failed to have the cat breath fire from its mouth, and the SD3m model fails as well. But it also failed, in 6 out of 8 cases, to have a cat with two bat wings. The best outcome is meh, it has all the elements but the positionning fails hard.

https://preview.redd.it/du6f8suy549d1.png?width=1024&format=png&auto=webp&s=339222d04d04d44b428c6731784bcc4f8c0403fd

Test 3 : A trio of typical D&D adventurer are looking through the bushes at a forest clearing in which a gothic manor is standing. In the night sky, three moons can be seen, the large green one, the small red one and the white one

https://preview.redd.it/droxl5c4649d1.png?width=1024&format=png&auto=webp&s=35e93b5086d9179e7ac985e6c8c66742c11d68ae

IN this one, I can't but notice that the 8 images are _very_ close, the model displaying small variety. The API one did better, as well as D3. For example, all the characters have white hair, as if the typical D&D party was recruited among retirement home escapees. Same with the manor, which doesn't display a lot of variation. With regard to prompt respect, one can't have 3 moons of the right colour. Generally, I got 3 white moons. This is severely disappointing as prompt adherence was supposed to be a strong suit of this model.

Test 4 : A dynamic image depicting a naval engagement between an 18th century man-of-war and a 20th century battleship. The scene shows the man-of-war with its tall sails and cannons, juxtaposed against the formidable steel structure of the modern battleship equipped with large gun turrets. The ocean around them is turbulent, illustrating the clash of eras in naval warfare. The background features stormy skies and high waves, enhancing the dramatic effect of this historical and technological confrontation. This image blends historical accuracy with imaginative interpretation, showcasing the stark contrast in naval technology.

1 out of SIXTEEN displayed a wooden ship and a steel ship. All the other had two steel warships. It's a fail and a strong step back from the API model.

https://preview.redd.it/i68x6t67649d1.png?width=1024&format=png&auto=webp&s=ca85d126bcf081b211b88135200c8a7ddaa19aaf

Test 5 : The breathtaking view of the Garden Dome in a space station orbiting Uranus, with passengers sitting and having coffee

https://preview.redd.it/223adld8649d1.png?width=1024&format=png&auto=webp&s=8219326892affa6676a45874a0ce78b2bd1d15b8

MUCH less interesting images than the API. Visages and hands are bad. More focus on people having coffee than on representing Uranus (0 out of 8). I should try to ask for Jupiter because maybe SAI thought it was unsafe and unethical to look at Uranus?

Test 6 : An orc and an elf swordfighting. The elf wields a katana, the orc a crude bone saber. The orc is wearing a loincloth, the elf an intricate silvery plate armor

This one is awful. I got 0 elf out of 8 generation. Only two orcs battling, disregarding the intricate silvery armor and the weapons descriptions. Exceptionnally, the (slightly) worst out of 8, but they are all awful:

https://preview.redd.it/u7n5ydsa649d1.png?width=1024&format=png&auto=webp&s=061b14ba6462564f57a20901bc5ad828330e3e80

Test 7 : A man juggling with three balls, one red, one blue, one green, while holding one one foot clad in a yellow boo

Another awful one. SD3m can't do poses. The best out of 8 was this one...

https://preview.redd.it/mztm3isc649d1.png?width=1024&format=png&auto=webp&s=c35082d410fe09f7d96dfec2a1f34d1594833255

but the average generation was more like this one :

https://preview.redd.it/4yuartne649d1.png?width=1024&format=png&auto=webp&s=d802617a87858cfc9b2e198bf3bdd51c7ba9d398

Test 8 : A man doing a handstand while riding a bicycle in front of a mirror

This one generated body horror. The API AND Dall-E didn't do well on this one, so I won't post images but it is awful.

Test 9 : A woman wearing a 18th century attire, on all four, facing the viewer, on a table in a pirate tavern

https://preview.redd.it/fdnrnvjg649d1.png?width=1024&format=png&auto=webp&s=59e3f97db34343bc0bce0c3236624fd126d03a8f

The fact that this is the best out of 8 should suffice to say that most of my prompt was ignored, despite being extremely safe for work, 18th century dress are all covering. I never got an image of the woman on the table. Neither did I get a pirate tavern, unless those were place of Learning (I got books on the table in 6 cases out of 8).

Test 10 :

A defeated trio of SS soldiers on the East Front, looking sad

https://preview.redd.it/ek7jucvi649d1.png?width=1024&format=png&auto=webp&s=d9bc10d67be1542bcf4e5482e607854934422747

No evocation of the East Front, no mention of them being SS or defeated. I got a trio of random soldiers. Another big fail.

Test 11 : A vivid depiction of the Easter procession in Sevilla, highlighting penitents wearing their iconic pointed hoods. The scene is set in the historic streets of Sevilla, with penitents dressed in traditional robes and hoods, creating a solemn and reflective atmosphere. The procession includes ornate pasos (floats) carrying religious icons, surrounded by a crowd of onlookers. The architecture of Sevilla, with its intricate details and historic charm, forms the backdrop, emphasizing the deep religious and cultural significance of this annual event.

A mix of body horror, penitents without eyes and Strange things.

https://preview.redd.it/0f2i5f4m649d1.png?width=1024&format=png&auto=webp&s=7beb2312d3035ece829fbc6c6478887a608a658e

https://preview.redd.it/tyj3x3en649d1.png?width=1024&format=png&auto=webp&s=6824cc9a6b781221f3d8d5682037444c7c804238

Test 12: A detailed picture of a sexy catgirl doing a handstand over a table

100% fails. Body horror generally. D3 does much better, despite being heavily censored, which some claims SD3 isn't.

https://preview.redd.it/slihca3p649d1.png?width=1024&format=png&auto=webp&s=add024516473cb610b9a42773bba50a7a9822642

Test 13 : a bulky man in the halasana yoga pose, cheered by a pair of cherleaders.

https://preview.redd.it/rsw9vepq649d1.png?width=1024&format=png&auto=webp&s=2e2a3c81afd5a8f875559a5bcc37bc1cb34c866a

Body Horror mostly. Interestingly it got the cheerleaders...

Test 14 : a person holding a foot with his or her hands, his or her face obviously in pain

https://preview.redd.it/6rljb26t649d1.png?width=1024&format=png&auto=webp&s=89e047cbc00ed204b79d5537d05f9b7c3b8e83e5

All are body-horror level... Admittedly Dall-E can't do it quite right either, but at least it has a semblance of adhereing to the prompt. Or it draws a foot.

Maybe SD3m can be saved with finetunes but it behaves so bad compared to base SDXL that I wonder if it's worth it to try to improve a 2B model, nerfed on anatomy and dynamic poses as this one.

submitted by /u/Mean_Ship4545
[link] [comments]

Vista Normal