Vista Normal

Hay nuevos artículos disponibles. Pincha para refrescar la página.
Hoy — 18 Febrero 2025Salida Principal

Hey everyone! Let's revisit SD 1.5! test tri-structured prompts

18 Febrero 2025 at 11:44
Hey everyone! Let's revisit SD 1.5! test tri-structured prompts

Since I have weaker hardware, I never really stopped using it, and sometimes I come across techniques I’ve never seen before.

For example, I recently found a new model (Contra Base) and decided to run it through my standard set of prompts for comparison (I have a special spreadsheet for that). The author mentioned something called a tri-structured prompt for image generation, which they used during training. I didn’t recall seeing this approach before, so I decided to check it out.

Here’s a link to the article. In short:

  • The initial prompt sets a general description of the image.
  • Based on keywords in that description, additional clarification branches are created, separated by the BREAK command.
  • The same keyword is also added to the negative prompt via BREAK, which neutralizes its impact during generation. However, tokens after that keyword are processed within its context.

I'll assume you roughly understand this logic and won't dive into the finer details.

BREAK is a familiar syntax for those using Automatic1111 and similar UIs. But I work in ComfyUI, where this is handled using a combination of two prompts.

https://preview.redd.it/gmr9c788uvje1.jpg?width=1757&format=pjpg&auto=webp&s=d2f925ea2af029dad2bc902ebc8933256e331b84

However, having everything in a single text field is much more convenient - both for reading and editing, especially when using multiple BREAK separators. So I decided to use the word BREAK directly inside the prompt. Fortunately, ComfyUI has nodes that recognize this syntax.

Let’s play around and see what results we get. First, we need to set up the basics. Let's start with the default pipeline.

For some reason, I wanted to generate a clown in an unusual environment, so I wrote this:

interesting view on the clown in the room juggles candle and ball BREAK view downward, wide angle, backlighting, ominous BREAK clown dressed in blue pants and red socks BREAK room in an old Gothic castle with a large window BREAK candle is red with gold candlestick BREAK ball is from billiard and has a black color BREAK window a huge stained-glass with scenes from the Bible 

Not very readable, right? And we still need to write the negative prompt, as described in the article. A more readable format could look something like this:

https://preview.redd.it/hy9dyl4vuvje1.jpg?width=2257&format=pjpg&auto=webp&s=6c7e11caf7c283f2d294734e80339f814aabb77c

After some thought, I decided the prompt format should be more intuitive and visually clear while still being easy to process. So I restructured it like this:

interesting view on the clown in the room juggles candle and ball _view downward, wide angle, backlighting, ominous _clown dressed in blue pants and red socks _room in an old Gothic castle with a large window _candle is red with gold candlestick _ball is from billiard and has a black color __window a huge stained-glass with scenes from the Bible 

The first line is the main prompt, with clarifying details listed below as keyword branches. I added underscores for better readability. They don’t affect generation significantly, but I’ll remove them before processing.

For comparison, of course, I decided to test what would be generated without BREAK commands to see how much of an impact they have. Let's begin! The resolution I want to make more than 768 points, which will give us repetitions and duplications without additional “dodging” of the model...

https://preview.redd.it/56ajxjtbvvje1.jpg?width=1930&format=pjpg&auto=webp&s=527fffb0453ec39a729844392160f49eb7b915be

As expected! One noticeable difference: the BREAK prompt includes a negative prompt, while the standard one does not. The negative prompt slightly desaturates the image. So, let’s add some utilities to improve color consistency and overall coherence in larger images. I don’t want to use upscalers - my goal is different.

To keep it simple, I added:

  • PatchModelAddDownscale (Kohya Deep Shrink)
  • FreeU
  • ResAdapter
  • Detail Daemon Sampler

https://preview.redd.it/u0ojt8kgvvje1.jpg?width=1931&format=pjpg&auto=webp&s=5d67cce64d0afa572b627229b70c3edc594f34a4

Much better results! Not perfect, but definitely more interesting.

Then I remembered that in SD 1.5, I could use an external encoder instead of the one from the loaded model. Flux works well for this. Using this one, I got these results:

https://preview.redd.it/qxsnfvpjvvje1.jpg?width=1931&format=pjpg&auto=webp&s=e81626faac3c3075de9a087acc7ba0321887a28a

What conclusions can be drawn about using this prompting method? "It's not so simple." I think no one would argue with that. Some things improved, while others were lost.

By the way, my clown just kept juggling, no matter how much I tweaked the prompt. But I didn’t stress over it too much.

One key takeaway: increasing the number of “layers” indefinitely is a bad idea. The more nested branches there are, the less weight each one carries in the overall prompt, which leads to detail loss. So in my opinion, 3 to 4 clarifications are the optimal amount.

a smaller number of branches gives a clearer following

Now, let’s try other prompt variations for better comparison.

detailed view on pretty lady standing near cadillac _view is downward, from ground and with wide angle _lady is 22-yo in a tight red dress __dress is red and made up of many big red fluffy knots _standing with her hands on the car and in pin-up pose with her back to the camera _cadillac in green colors and retro design, it is old model 

https://preview.redd.it/sb0n3k48wvje1.jpg?width=1931&format=pjpg&auto=webp&s=79de30ab70b63c3e56777b64cc39a0c1438352c3

While working with this, I discovered that Kohya Deep Shrink sometimes swaps colors - turning the dress green and the car red. It seems to depend on the final image resolution. Different samplers also handle this prompt differently (who would’ve thought, right?).

Another interesting detail: I clearly specified that the dress should be fluffy with large knots. In the general prompt, this token is considered, but since there are many layers, its weight is diluted, resulting in just a plain red dress. Also, the base prompt tends to generate a doll-like figure, while the branches produce a more realistic image.

Let’s try another one:

detailed painting of landscape of rock and town under that _landscape of high red rock wall with carving of cat silhouette _rock is a giant silhouette of cat, carved into the slopes _town consists of small wooden houses that rise in tiers up the cliff _painting with oil in expressionist style, three-dimensional painting, and vibrant colors 

https://preview.redd.it/pjmlo87nwvje1.jpg?width=1931&format=pjpg&auto=webp&s=74ccaf03b088a6f80ee50d824d5d42d074d7498e

No cats here. And no painterly effect from the branches. My guess? Since the painting-style tokens are placed in just one out of five branches, their total weight is only one-fifth of the overall prompt.

Let’s test this by increasing the weight of that branch. With a small boost, no visible changes. But if we overdo it (e.g., 1.6), abstract painting tokens dominate, making the image completely off-topic.

weight of painting brunch is 1.55

Conclusion: this method is not suitable for defining overall art style.

And finally, let’s wrap up with a cat holding a sign. Of course, SD 1.5 won’t magically generate perfect text, but splitting into branches does improve results.

cat with hold big poster-board with label _cat is small, fluffy and ginger _poster-board is white, holded by front paws _label is ("SD 1.5 is KING":1.3) 

https://preview.redd.it/was5qfa2xvje1.jpg?width=1931&format=pjpg&auto=webp&s=13f3f3dd13a5aaed947e9edd8bd19624cdaed8c8

Final thoughts

In my opinion, this prompting technique can be useful for refining a few specific elements, but it doesn't work as the original article described. More branches = less influence per branch = loss of control.

Right now, I think there are better ways to add complexity and detail to SD 1.5 models. For example, ELLA handles more intricate prompts much better. To test this, I used the same prompts with ELLA and the same seed values:

https://preview.redd.it/0zbvz1hfxvje1.jpg?width=2900&format=pjpg&auto=webp&s=7940ab202826a42517e762af8db3b0dffb5068e2

If anyone wants to experiment, I’ve uploaded my setup here. Let me know your thoughts or if you see any flaws in my approach.

Happy generating! 🎨🚀

submitted by /u/mr-asa
[link] [comments]

Flux Tech Details by Robin Rombach (CEO, Black Forest Labs)

18 Febrero 2025 at 13:49

https://www.youtube.com/watch?v=nrKKLJXBSw0

I made a summary, I can't digest it myself.

FLUX: Flow Matching for Content Creation at Scale - Detailed Summary (Formatted)

Speaker:
Robin Rombach (Creator of Latent Diffusion, CEO of Black Forest Labs)
Lecture Topic:
Flux - Content Creation Model using Flow Matching
Focus of Lecture:
Detailed methodology of Flux, comparison of flow matching vs. diffusion models, and future directions in generative modeling.
Context:
TUM AI Lecture Series

Key Highlights:

  • Latent Diffusion Influence: Rombach emphasized the impact of Latent Diffusion (15,000+ citations) and its role in establishing text-to-image generation as a standard.
  • Dual Impact: Rombach's contributions span both academia and industry, notably including his work on Stable Diffusion at Stability AI.

Flux: Methodology and Foundations

  • Developed by: Black Forest Labs
  • Core Techniques: Flow Matching and Distillation for efficient content creation.
  • Latent Generative Modeling Paradigm:
    • Motivation: Separates perceptually relevant information into a lower-dimensional space.
    • Benefit: Improves computational efficiency and simplifies the generative task.
    • Contrast: Compared to end-to-end learning and auto-regressive latent models (e.g., Gemini 2 image generation).
  • Flux Architecture (Two-Stage):
    1. Adversarial Autoencoder:
      • Function: Compresses images into latent space.
      • Key Feature: Removes imperceptible details and separates texture from structure.
      • Addresses: "Getting lost in details" issue of likelihood-based models.
      • Advantage: Adversarial component ensures sharper reconstructions than standard autoencoders.
    2. Flow Matching based Generative Model (in Latent Space):
      • Technique: Rectified Flow Matching.
      • Goal: Transforms noise samples (normal distribution) into complex image samples.

Flux's Flow Matching Implementation:

  • Simplified Training: Direct interpolation between data and noise samples.
    • Benefit: Concise loss function and implementation.
  • Optimized Time-Step Sampling: Log-normal distribution for time-steps (t).
    • Down-weights: Trivial time steps (t=0, t=1).
    • Focuses Computation: On informative noise levels.
  • Resolution-Aware Training & Inference:
    • Adaptation: Adjusts noise schedules and sampling steps based on image dimensionality.
    • Improvement: Enhanced high-resolution generation.
    • Addresses Limitation: Suboptimal uniform Euler step sampling for varying resolutions.

Architectural Enhancements in Flux:

  • Parallel Attention (Transformer Blocks):
    • Inspiration: Vision Transformers.
    • Benefit: Hardware efficiency via fused attention and MLP operations (single matrix multiplication).
  • RoPE Embeddings (Relative Positional Embeddings):
    • Advantage: Flexibility across different aspect ratios and resolutions.
    • Impact: Improved generalization.

Flux Model Variants & Distillation:

  • Flux Pro: Proprietary API model.
  • Flux Dev: Open-weights, distilled.
  • Flux Schnell: Open-source, 4-step distilled.
    • Differentiation: Trade-offs between quality and efficiency.
  • Adversarial Distillation for Acceleration:
    • Technique: Distills pre-trained diffusion model (teacher) into faster student model.
    • Loss Function: Adversarial Loss.
    • Latent Adversarial Diffusion Distillation: Operates in latent space, avoiding pixel-space decoding.
      • Benefits: Scalability to higher resolutions, retains teacher model flexibility.
      • Addresses: Quality-diversity trade-off, potentially improving visual quality.

Applications & Future Directions:

  • Practical Applications:
    • Image Inpainting (Flux Fill)
    • Iterative Image Enlargement
    • Scene Composition
    • Retexturing (Depth Maps, etc.)
    • Image Variation (Flux Redux)
  • Future Research:
    • Zero-Shot Personalization & Text-Based Editing (Customization)
    • Streaming & Controllable Video Generation
    • Interactive 3D Content Creation

Black Forest Labs - Startup Learnings:

  • Critical Importance of Model Scaling: For real-world deployment.
  • Emphasis on: Robust Distillation Techniques and Efficient Parallelization (ZeRO, FSDP).
  • Evaluation Shift: Application-specific performance and user preference are prioritized over traditional metrics (FID).
  • Methodological Simplicity: Key for practical scalability and debugging.

Conclusion:

  • Flux represents a significant advancement in content creation through efficient flow matching and distillation techniques.
  • Future research directions promise even more powerful and versatile generative models.
  • Black Forest Labs emphasizes practical scalability and user-centric evaluation in their development process.
submitted by /u/Badjaniceman
[link] [comments]

Zluda, NVidia to AMD takes more Vram?

18 Febrero 2025 at 12:43

So I recently got a 7900xtx and figured jumping from (rtx2070)8gb to 24gb Vram would be great but im finding im even more restricted when it comes to resolutions now.

I was wondering if this is normal for the AMD windows experience or if i did something wrong somewhere.

Also It''d be great to know if linux has the same issue.

For example trying to get 1600x1600 gets me: Tried to allocate 47.69 GiB. GPU

Using /likelovewant/'s forge guide for the AMD install since it was the only one that seems to work for me on windows.

submitted by /u/Ixillius
[link] [comments]

Disney/Pixar 'competition' - prize to be won

18 Febrero 2025 at 14:27

Hi all,

I'm looking for someone that can turn the picture I have included in this post of me and my friends into Disney/Pixar art style.

https://imgur.com/a/Y5iwaAz

I would like the text 'Congratulations' written somewhere on the image, similar to you may see on a celebration card or a movie poster.

I'll leave the rest up to your imagination.

The 'prize' I am offering is £15/$20 USD for the best picture, and a further £100/$125 USD if you're willing to teach me your technique of how you create the image.

submitted by /u/Alexone_
[link] [comments]

Problems with LORA in FLUX Forge

18 Febrero 2025 at 13:52
Problems with LORA in FLUX Forge

Can anyone help me figure out how to get a decent render from FLUX using Forge if I want to apply LORA to my renders? I'm running FLUX Dev fp16 GGUF (flux1-dev-Q1_0.gguf) on Forge and I have thousands of LORA downloaded but I can't use a single one of them because each and every one seems to completely overcook the model, even when I drop the weight to "0.1". The moment I remove the LORA, the images render cleanly but not with the results I want....because I can't use the LORAs.

I have pretty much preserved all base defaults for FLUX Forge, and have my CFG at 1. In terms of processing power, I'm running with RTX-4090 on i9-14900 with 48GB RAM. I setup my Forge using this guide: Getting Started with Flux & Forge

I've attached an image comparison of my typical experience (the reference image is based on THIS RENDER from Civit, which uses the LORA).

https://preview.redd.it/gt4rr74hkwje1.png?width=1767&format=png&auto=webp&s=317a75262c9d1bc262b8c1bad62cbe95e95cb75d

submitted by /u/MomTrainer
[link] [comments]

Pc config-upgrade suggestion

18 Febrero 2025 at 13:26

Hi everyone, I would need your suggestion to upgrade (or to do from scratch) my workstation.
Currently I have a relatively old config, with:
- intel 8700
-mobo msi h370 bazooka
- gpu amd radeon pro sapphire pulse 580rx 8gb
- 36gb (4 slot) of ram ddr4
-various m2 and sata ssd
What I am searching for:
To upgrade my config with the possibility to add a 2nd gpu, and keep at least the cpu, the amd gpu 8and eventually the ram), in order to use comfy or forge with flux, with no amd restriction!

Is there any configuration I can choose for, with new components (meaning no 2nd hand), that can make sense also in terms of durability?
I would like to spent the less the possible (hence keeping some old components).

I hope I was clear enough
Thanks everyone!

Edit: I need to run both gpu at the same time, with no restriction. For this I need to change my mobo, since it is not compatible

submitted by /u/Cautious_Basil_7065
[link] [comments]
❌
❌