While testing T5 on SDXL, some questions about the choice of text encoders regarding human anatomical features

Por: ／u／OldFisherman8

16 Febrero 2025 at 09:14

I have been experimenting T5 as a text encoder in SDXL. Since SDXL isn't trained on T5, the complete replacement of clip_g wasn't possible without fine-tuning. Instead, I added T5 to clip_g in two ways: 1) merging T5 with clip_g (25:75) and 2) replacing the earlier layers of clip_g with T5.

While testing them, I noticed something interesting: certain anatomical features were removed in the T5 merge. I didn't notice this at first but it became a bit more noticeable while testing Pony variants. I became curious about why that was the case.

After some research, I realized that some LLMs have built-in censorship whereas the latest models tend to do this through online filtering. So, I tested this with T5, Gemma2 2B, and Qwen2.5 1.5B (just using them as LLMs with prompt and text response.)

As it turned out, T5 and Gemma2 have built-in censorship (Gemma2 refusing to answer anything related to human anatomy) whereas Qwen has very light censorship (no problems with human anatomy but gets skittish to describe certain physiological phenomena relating to various reproductive activities.) Qwen2.5 behaved similarly to Gemini2 when using it through API with all the safety filters off.

The more current models such as FLux and SD 3.5 use T5 without fine-tuning to preserve its rich semantic understanding. That is reasonable enough. What I am curious about is why anyone wants to use a censored LLM for an image generation AI which will undoubtedly limit its ability to express the visual representation. What I am even more puzzled by is the fact that Lumina2 is using Gemma2 which is heavily censored.

At the moment, I am no longer testing T5 and figuring out how to apply Qwen2.5 to SDXL. The complication with this is that Qwen2.5 is a decoder-only model which means that the same transformer layers are used for both encoding and decoding.

submitted by /u/OldFisherman8
[link] [comments]

Major Updates to SDXL Quantization Notebook, V-Prediction Custom node, Clip_G Quantization, SDXL quantized model repo updates, and more

StableDiffusion

Por: ／u／OldFisherman8

8 Febrero 2025 at 10:40

Major Updates to SDXL Quantization Notebook, V-Prediction Custom node, Clip_G Quantization, SDXL quantized model repo updates, and more

1. SDXL Component Extraction and Unet Quantization Notebook Update

Notebook Link: https://colab.research.google.com/drive/1xRwSht2tc82O8jrQQG4cl5LH1xdhiCyn?usp=sharing

CivitAI Tutorial Link: https://civitai.com/articles/10417/major-update-to-the-sdxl-gguf-conversion-colab-notebook-new-vprediction-custom-node-and-more

Key Features of the update:

a) Removal of ComfyUI for component extraction: No more installation of dependencies or tunneling for the UI. It now runs entirely on the scripts with no installation of dependencies.

b) CPU only: there is no need for GPU as the entire process runs on the CPU.

c) Compatibility with ComfyUI: the extracted components go through the processes to meet the ComfyUI tensor naming convention and structure.

d) Dynamic path management: Other than the four input fields, there is nothing to modify or change in the scripts. You can run the entire notebook all at once.

Side Note: One of the stumbling blocks in getting this done was the existence of two tensor keys in the clip_l: logit_scale and text_projection. weight. They were not present inside the checkpoint but present in ComfyUI extracted clip_l.

I looked through the codebase to find out what the convention of adding these layers was but couldn't find it (technically, Gemini couldn't find it after looking through all the reference codes I provided.) According to Qwen, Gemini, and o3-mini, it may be done through random initialization, and asked me if I wanted it done that way.

However, these keys were present in clip_g. So, I went through a bunch of trial and error in adopting clip_g keys modified to fit clip_l using SVD (courtesy of Qwen.) As a result, it's different from ComfyUI extracted clip_l. I wouldn't say it's better but it's not worse. Here are some comparisons:

https://preview.redd.it/ufbe68yw7vhe1.jpg?width=3648&format=pjpg&auto=webp&s=119d3ad14ee4cc24616987f04808a0f136e006f0

Models used: nnPony 3D Mix and Suzanne's XL Mix

In hindsight, it might have been better to not include logit_scale in clip_l but I left it as is due to my plan to quantize clip_g where that logit_scale may help align it closer to the checkpoint version after the quantization.

----------------------------------------------------------------------------------------------------------------------

2) V_Prediction Custom Node

For v_prediction models (such as NoobAI), the quantized model won't work because it will not be recognized as such and always default to EPS sampling calculation mode. I made a small custom node to give you a choice of setting the parametrization to v_prediction or epsilon (courtesy of Gemini).

Custom Node Link: https://github.com/magekinnarus/ComfyUI-V-Prediction-Node

Model used: NoobAI-XL

----------------------------------------------------------------------------------------------------------------------

3) SDXL Quantized Model Repo Updates

Since the first upload, there have been several updates to the list of quantized models and clip encoders. These are the newly updated models:
juggernautXL_juggXIByRundiffusion

nnPony3DMix_v20

noobaiXLNAIXL_vPred10

ntrMIXIllustriousXL_xiii

suzannesXLMix_v70

waiNSFWIllustrious_v100

waiREALCN_v14

waiSHUFFLENOOB_vPred01

Repo Link: https://huggingface.co/Old-Fisherman/SDXL_Finetune_GGUF_Files

----------------------------------------------------------------------------------------------------------------------

4) Clip_G Quantization

I was able to create a conversion script to convert clip_g to F16 gguf file (courtesy of o3-mini). But I haven't modified lcpp.patch (city96's work) needed for quantization. (I ran out of time with the o3-mini free version.) At the moment, quantizing clip_g isn't the priority as I'm more interested in making scripts merging models and loras exactly the way I want them merged. But I will get around to this and update when it is done.

----------------------------------------------------------------------------------------------------------------------

5) Some thoughts on working with Gemini, Qwen, DeepSeek, and o3-mini

Working with LLMs is akin to working with an autistic savant with a severe case of dyslexia and anterograde amnesia. So, it takes a bit of getting used to. For example, if you ask Gemini to go and look at something, the chances are more than likely that it won't. But if you approach it in this manner: I want this done but someone mentioned that this can be done by changing a few lines of this script (URL link) using this guide (URL link). But I don't know where in the script I need to start modifying from. Then Gemini will not only look into the script but also look at the guide. In other words, an LLM will only look at things if and only if it needs it irrelevant of what you think it needs.

By far, the knowledge base of Gemini is incredible. But it is also quite autistic and dyslexic unable to cobble together all that it knows within the relevant context. Qwen is more stable in that regard but more forgetful. I have at least three different AIs open while working on something. The reason is simply that I need the one working on a problem to focus on the problem without distraction. So, when I have a question or something that I don't understand, I ask another AI for it. The third AI (in my case, usually co-pilot) is for making short scripts to check or verify things in progress, the kind of things like: what is the script to mount Google Drive?

CoT-based reasoning models are really good. And their reasoning sequence offers invaluable information. For example, when I was working with o3-mini to modify convert.py to add clip_g architecture, its reasoning sequence revealed that clip_g being a text encoder needed to be handled differently. But it gave me a script that added clip_g to the other image Unets and how they were handled because that was what I asked for. Having realized from its reasoning that it needed to be handled differently. I asked where in llama.cpp to find the way text models were handled. After feeding the relevant modules to it, I asked it to make me a separate script that added clip_g architecture using the methods in the convert.py but handled the conversion process of the text model instead.

I don't know how to code and I still can't write even the simplest codes like mounting Google Drive. But I am getting a hang of how to work with these AIs. The key is to get your head wrapped around the problem by interacting with them. My go-to AIs are Gemini and Qwen because you simply can't grasp the issues without trial and error. And these models are good for that. For instance, I headbutted and problem-solved with Gemini for 5 hours. Then I went to DeepSeek to make it work. It took less than 30 seconds for it to make it work. However, I was able to summarize the issue at hand and fed all the relevant information precisely because I worked with Gemini for 5 hours and had a pretty good idea of what was needed to get it done.

In the end, Ai is like a mirror and its output will reflect your strengths and shortcomings. While I was working with DeepSeek and o3-mini, they would sometimes make wrong assumptions, or the flow of thoughts was going in the wrong direction from my intention. Then I thought about why that happened. And I realized that the faults were primarily mine. I didn't give the necessary information, made assumptions that might not translate to AI, or my instructions weren't clear enough.

submitted by /u/OldFisherman8
[link] [comments]

Vista Normal