Prs/lora reservations (reduce massive Lora reservations especially on Flux2) (#11069)

* mp: only count the offload cost of math once This was previously bundling the combined weight storage and computation cost * ops: put all post async transfer compute on the main stream Some models have massive weights that need either complex dequantization or lora patching. Don't do these patchings on the offload stream, instead do them on the main stream to syncrhonize the potentially large vram spikes for these compute processes. This avoids having to assume a worst case scenario of multiple offload streams all spiking VRAM is parallel with whatever the main stream is doing.
2025-12-03 17:28:45 +10:00
parent 861817d22d
commit 519c941165
2 changed files with 24 additions and 19 deletions
--- a/comfy/model_patcher.py
+++ b/comfy/model_patcher.py
@@ -704,7 +704,7 @@ class ModelPatcher:

                lowvram_weight = False

-                potential_offload = max(offload_buffer, module_offload_mem * (comfy.model_management.NUM_STREAMS + 1))
+                potential_offload = max(offload_buffer, module_offload_mem + (comfy.model_management.NUM_STREAMS * module_mem))
                lowvram_fits = mem_counter + module_mem + potential_offload < lowvram_model_memory

                weight_key = "{}.weight".format(n)
@@ -883,7 +883,7 @@ class ModelPatcher:
                    break
                module_offload_mem, module_mem, n, m, params = unload

-                potential_offload = (comfy.model_management.NUM_STREAMS + 1) * module_offload_mem
+                potential_offload = module_offload_mem + (comfy.model_management.NUM_STREAMS * module_mem)

                lowvram_possible = hasattr(m, "comfy_cast_weights")
                if hasattr(m, "comfy_patched_weights") and m.comfy_patched_weights == True: