Add custom nodes, Civitai loras (LFS), and vast.ai setup script

Includes 30 custom nodes committed directly, 7 Civitai-exclusive loras stored via Git LFS, and a setup script that installs all dependencies and downloads HuggingFace-hosted models on vast.ai. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 00:55:26 +00:00
parent 2b70ab9ad0
commit f09734b0ee
2274 changed files with 748556 additions and 3 deletions
--- a/custom_nodes/ComfyUI-Florence2/.gitignore
+++ b/custom_nodes/ComfyUI-Florence2/.gitignore
@@ -0,0 +1,9 @@
+.DS_Store
+*pyc
+.vscode
+__pycache__
+*.egg-info
+*.bak
+checkpoints
+results
+backup
--- a/custom_nodes/ComfyUI-Florence2/LICENSE
+++ b/custom_nodes/ComfyUI-Florence2/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Jukka Seppänen
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/custom_nodes/ComfyUI-Florence2/README.md
+++ b/custom_nodes/ComfyUI-Florence2/README.md
@@ -0,0 +1,70 @@
+# Florence2 in ComfyUI
+
+> Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. 
+Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. 
+It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. 
+The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
+
+## New Feature: Document Visual Question Answering (DocVQA)
+
+This fork includes support for Document Visual Question Answering (DocVQA) using the Florence2 model. DocVQA allows you to ask questions about the content of document images, and the model will provide answers based on the visual and textual information in the document. This feature is particularly useful for extracting information from scanned documents, forms, receipts, and other text-heavy images.
+
+## Installation:
+
+Clone this repository to 'ComfyUI/custom_nodes` folder.
+
+Install the dependencies in requirements.txt, transformers version 4.38.0 minimum is required:
+
+`pip install -r requirements.txt`
+
+or if you use portable (run this in ComfyUI_windows_portable -folder):
+
+`python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-Florence2\requirements.txt`
+
+![image](https://github.com/kijai/ComfyUI-Florence2/assets/40791699/4d537ac7-5490-470f-92f5-3007da7b9cc7)
+![image](https://github.com/kijai/ComfyUI-Florence2/assets/40791699/512357b7-39ee-43ee-bb63-7347b0a8d07d)
+
+Supports most Florence2 models, which can be automatically downloaded with the `DownloadAndLoadFlorence2Model` to `ComfyUI/models/LLM`:
+
+Official:
+
+https://huggingface.co/microsoft/Florence-2-base
+
+https://huggingface.co/microsoft/Florence-2-base-ft
+
+https://huggingface.co/microsoft/Florence-2-large
+
+https://huggingface.co/microsoft/Florence-2-large-ft
+
+https://huggingface.co/HuggingFaceM4/Florence-2-DocVQA
+
+Tested finetunes:
+
+https://huggingface.co/MiaoshouAI/Florence-2-base-PromptGen-v1.5
+
+https://huggingface.co/MiaoshouAI/Florence-2-large-PromptGen-v1.5
+
+https://huggingface.co/thwri/CogFlorence-2.2-Large
+
+https://huggingface.co/HuggingFaceM4/Florence-2-DocVQA
+
+https://huggingface.co/gokaygokay/Florence-2-SD3-Captioner
+
+https://huggingface.co/gokaygokay/Florence-2-Flux-Large
+
+https://huggingface.co/NikshepShetty/Florence-2-pixelpros
+
+## Using DocVQA
+
+To use the DocVQA feature:
+1. Load a document image into ComfyUI.
+2. Connect the image to the Florence2 DocVQA node.
+3. Input your question about the document.
+4. The node will output the answer based on the document's content.
+
+Example questions:
+- "What is the total amount on this receipt?"
+- "What is the date mentioned in this form?"
+- "Who is the sender of this letter?"
+
+Note: The accuracy of answers depends on the quality of the input image and the complexity of the question.
--- a/custom_nodes/ComfyUI-Florence2/init.py
+++ b/custom_nodes/ComfyUI-Florence2/init.py
@@ -0,0 +1,3 @@
+from .nodes import NODE_CLASS_MAPPINGS, NODE_DISPLAY_NAME_MAPPINGS
+
+__all__ = ["NODE_CLASS_MAPPINGS", "NODE_DISPLAY_NAME_MAPPINGS"]
--- a/custom_nodes/ComfyUI-Florence2/configuration_florence2.py
+++ b/custom_nodes/ComfyUI-Florence2/configuration_florence2.py
@@ -0,0 +1,341 @@
+# coding=utf-8
+# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import warnings
+""" Florence-2 configuration"""
+
+from typing import Optional
+
+from transformers import AutoConfig
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+class Florence2VisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Florence2VisionModel`]. It is used to instantiate a Florence2VisionModel
+    according to the specified arguments, defining the model architecture. Instantiating a configuration with the 
+    defaults will yield a similar configuration to that of the Florence2VisionModel architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        drop_path_rate (`float`, *optional*, defaults to 0.1):
+            The dropout rate of the drop path layer.
+        patch_size (`List[int]`, *optional*, defaults to [7, 3, 3, 3]):
+            The patch size of the image.
+        patch_stride (`List[int]`, *optional*, defaults to [4, 2, 2, 2]):
+            The patch stride of the image.
+        patch_padding (`List[int]`, *optional*, defaults to [3, 1, 1, 1]):
+            The patch padding of the image.
+        patch_prenorm (`List[bool]`, *optional*, defaults to [false, true, true, true]):
+            Whether to apply layer normalization before the patch embedding layer.
+        enable_checkpoint (`bool`, *optional*, defaults to False):
+            Whether to enable checkpointing.
+        dim_embed (`List[int]`, *optional*, defaults to [256, 512, 1024, 2048]):
+            The dimension of the embedding layer.
+        num_heads (`List[int]`, *optional*, defaults to [8, 16, 32, 64]):
+            The number of attention heads.
+        num_groups (`List[int]`, *optional*, defaults to [8, 16, 32, 64]):
+            The number of groups.
+        depths (`List[int]`, *optional*, defaults to [1, 1, 9, 1]):
+            The depth of the model.
+        window_size (`int`, *optional*, defaults to 12):
+            The window size of the model.
+        projection_dim (`int`, *optional*, defaults to 1024):
+            The dimension of the projection layer.
+        visual_temporal_embedding (`dict`, *optional*):
+            The configuration of the visual temporal embedding.
+        image_pos_embed (`dict`, *optional*):
+            The configuration of the image position embedding.
+        image_feature_source (`List[str]`, *optional*, defaults to ["spatial_avg_pool", "temporal_avg_pool"]):
+            The source of the image feature.
+    Example:
+
+    ```python
+    >>> from transformers import Florence2VisionConfig, Florence2VisionModel
+
+    >>> # Initializing a Florence2 Vision style configuration
+    >>> configuration = Florence2VisionConfig()
+
+    >>> # Initializing a model (with random weights)
+    >>> model = Florence2VisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "florence2_vision"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        drop_path_rate=0.1,
+        patch_size=[7, 3, 3, 3],
+        patch_stride=[4, 2, 2, 2],
+        patch_padding=[3, 1, 1, 1],
+        patch_prenorm=[False, True, True, True],
+        enable_checkpoint=False,
+        dim_embed=[256, 512, 1024, 2048],
+        num_heads=[8, 16, 32, 64],
+        num_groups=[8, 16, 32, 64],
+        depths=[1, 1, 9, 1],
+        window_size=12,
+        projection_dim=1024,
+        visual_temporal_embedding=None,
+        image_pos_embed=None,
+        image_feature_source=["spatial_avg_pool", "temporal_avg_pool"],
+        **kwargs,
+    ):
+        self.drop_path_rate = drop_path_rate
+        self.patch_size = patch_size
+        self.patch_stride = patch_stride
+        self.patch_padding = patch_padding
+        self.patch_prenorm = patch_prenorm
+        self.enable_checkpoint = enable_checkpoint
+        self.dim_embed = dim_embed
+        self.num_heads = num_heads
+        self.num_groups = num_groups
+        self.depths = depths
+        self.window_size = window_size
+        self.projection_dim = projection_dim
+        self.visual_temporal_embedding = visual_temporal_embedding
+        self.image_pos_embed = image_pos_embed
+        self.image_feature_source = image_feature_source
+
+        super().__init__(**kwargs)
+
+
+
+class Florence2LanguageConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Florence2LanguagePreTrainedModel`]. It is used to instantiate a BART
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the BART
+    [facebook/bart-large](https://huggingface.co/facebook/bart-large) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 51289):
+            Vocabulary size of the Florence2Language model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Florence2LanguageModel`].
+        d_model (`int`, *optional*, defaults to 1024):
+            Dimensionality of the layers and the pooler layer.
+        encoder_layers (`int`, *optional*, defaults to 12):
+            Number of encoder layers.
+        decoder_layers (`int`, *optional*, defaults to 12):
+            Number of decoder layers.
+        encoder_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        decoder_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        decoder_ffn_dim (`int`, *optional*, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        encoder_ffn_dim (`int`, *optional*, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        activation_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for activations inside the fully connected layer.
+        classifier_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for classifier.
+        max_position_embeddings (`int`, *optional*, defaults to 1024):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        init_std (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        encoder_layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        decoder_layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        scale_embedding (`bool`, *optional*, defaults to `False`):
+            Scale embeddings by diving by sqrt(d_model).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        num_labels (`int`, *optional*, defaults to 3):
+            The number of labels to use in [`Florence2LanguageForSequenceClassification`].
+        forced_eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the token to force as the last generated token when `max_length` is reached. Usually set to
+            `eos_token_id`.
+
+    Example:
+
+    ```python
+    >>> from transformers import Florence2LanguageConfig, Florence2LanguageModel
+
+    >>> # Initializing a Florence2 Language style configuration
+    >>> configuration = Florence2LanguageConfig()
+
+    >>> # Initializing a model (with random weights)
+    >>> model = Florence2LangaugeModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "florence2_language"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"}
+
+    def __init__(
+        self,
+        vocab_size=51289,
+        max_position_embeddings=1024,
+        encoder_layers=12,
+        encoder_ffn_dim=4096,
+        encoder_attention_heads=16,
+        decoder_layers=12,
+        decoder_ffn_dim=4096,
+        decoder_attention_heads=16,
+        encoder_layerdrop=0.0,
+        decoder_layerdrop=0.0,
+        activation_function="gelu",
+        d_model=1024,
+        dropout=0.1,
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        init_std=0.02,
+        classifier_dropout=0.0,
+        scale_embedding=False,
+        use_cache=True,
+        num_labels=3,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        is_encoder_decoder=True,
+        decoder_start_token_id=2,
+        forced_eos_token_id=2,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.d_model = d_model
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.encoder_layers = encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.decoder_layers = decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.activation_function = activation_function
+        self.init_std = init_std
+        self.encoder_layerdrop = encoder_layerdrop
+        self.decoder_layerdrop = decoder_layerdrop
+        self.classifier_dropout = classifier_dropout
+        self.use_cache = use_cache
+        self.num_hidden_layers = encoder_layers
+        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True
+        self.forced_bos_token_id = bos_token_id
+
+        super().__init__(
+            num_labels=num_labels,
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            decoder_start_token_id=decoder_start_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            **kwargs,
+        )
+
+        # ensure backward compatibility for BART CNN models
+        # if self.forced_bos_token_id is None and kwargs.get("force_bos_token_to_be_generated", False):
+        #     self.forced_bos_token_id = self.bos_token_id
+        #     warnings.warn(
+        #         f"Please make sure the config includes `forced_bos_token_id={self.bos_token_id}` in future versions. "
+        #         "The config can simply be saved and uploaded again to be fixed."
+        #     )
+
+class Florence2Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Florence2ForConditionalGeneration`]. It is used to instantiate an
+    Florence-2 model according to the specified arguments, defining the model architecture. 
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vision_config (`Florence2VisionConfig`,  *optional*):
+            Custom vision config or dict
+        text_config (`Union[AutoConfig, dict]`, *optional*):
+            The config object of the text backbone. 
+        ignore_index (`int`, *optional*, defaults to -100):
+            The ignore index for the loss function.
+        vocab_size (`int`, *optional*, defaults to 51289):
+            Vocabulary size of the Florence2model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~Florence2ForConditionalGeneration`]
+        projection_dim (`int`, *optional*, defaults to 1024):
+            Dimension of the multimodal projection space.
+
+    Example:
+
+    ```python
+    >>> from transformers import Florence2ForConditionalGeneration, Florence2Config, CLIPVisionConfig, BartConfig
+
+    >>> # Initializing a clip-like vision config
+    >>> vision_config = CLIPVisionConfig()
+
+    >>> # Initializing a Bart config
+    >>> text_config = BartConfig()
+
+    >>> # Initializing a Florence-2 configuration
+    >>> configuration = Florence2Config(vision_config, text_config)
+
+    >>> # Initializing a model from the florence-2 configuration
+    >>> model = Florence2ForConditionalGeneration(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "florence2"
+    is_composition = False
+
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        ignore_index=-100,
+        vocab_size=51289,
+        projection_dim=1024,
+        **kwargs,
+    ):
+        self.ignore_index = ignore_index
+        self.vocab_size = vocab_size
+        self.projection_dim = projection_dim
+        if vision_config is not None:
+            vision_config = PretrainedConfig(**vision_config)
+        self.vision_config = vision_config
+        self.vocab_size = self.vocab_size
+
+        self.text_config = text_config
+        if text_config is not None:
+            self.text_config = Florence2LanguageConfig(**text_config)
+
+
+        super().__init__(**kwargs)
+
--- a/custom_nodes/ComfyUI-Florence2/modeling_florence2.py
+++ b/custom_nodes/ComfyUI-Florence2/modeling_florence2.py
--- a/custom_nodes/ComfyUI-Florence2/nodes.py
+++ b/custom_nodes/ComfyUI-Florence2/nodes.py
@@ -0,0 +1,753 @@
+from collections.abc import Callable
+import torch
+import torchvision.transforms.functional as F
+import io
+import os
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+from PIL import Image, ImageDraw, ImageColor, ImageFont
+import random
+import numpy as np
+import re
+from pathlib import Path
+
+from transformers.dynamic_module_utils import get_imports
+import transformers
+from packaging import version
+
+from safetensors.torch import save_file
+
+def load_model(model_path: str, attention: str, dtype: torch.dtype, offload_device: torch.device):
+    from .modeling_florence2 import Florence2ForConditionalGeneration, Florence2Config
+    from transformers import CLIPImageProcessor, BartTokenizerFast
+    from .processing_florence2 import Florence2Processor
+    from accelerate import init_empty_weights
+    from accelerate.utils import set_module_tensor_to_device
+
+    config = Florence2Config.from_pretrained(model_path)
+    config._attn_implementation = attention
+    with init_empty_weights():
+        model = Florence2ForConditionalGeneration(config)
+
+    checkpoint_path = os.path.join(model_path, "model.safetensors")
+    if not os.path.exists(checkpoint_path):
+        checkpoint_path = os.path.join(model_path, "pytorch_model.bin")
+    if os.path.exists(checkpoint_path):
+        state_dict = load_torch_file(checkpoint_path)
+    else:
+        raise FileNotFoundError(f"No model weights found at {model_path}")
+
+    key_mapping = {}
+    if "language_model.model.shared.weight" in state_dict:
+        key_mapping["language_model.model.encoder.embed_tokens.weight"] = "language_model.model.shared.weight"
+        key_mapping["language_model.model.decoder.embed_tokens.weight"] = "language_model.model.shared.weight"
+
+    for name, param in model.named_parameters():
+        # Check if we need to remap the key
+        actual_key = key_mapping.get(name, name)
+
+        if actual_key in state_dict:
+            set_module_tensor_to_device(model, name, offload_device, value=state_dict[actual_key].to(dtype))
+        else:
+            print(f"Parameter {name} not found in state_dict.")
+
+    # Tie embeddings
+    model.language_model.tie_weights()
+    model = model.eval().to(dtype).to(offload_device)
+
+    # Create image processor
+    image_processor = CLIPImageProcessor(
+        do_resize=True,
+        size={"height": 768, "width": 768},
+        resample=3,  # BICUBIC
+        do_center_crop=False,
+        do_rescale=True,
+        rescale_factor=1/255.0,
+        do_normalize=True,
+        image_mean=[0.485, 0.456, 0.406],
+        image_std=[0.229, 0.224, 0.225],
+    )
+    image_processor.image_seq_length = 577
+
+    # Create tokenizer - Florence2 uses BART tokenizer
+    tokenizer = BartTokenizerFast.from_pretrained(model_path)
+    processor = Florence2Processor(image_processor=image_processor, tokenizer=tokenizer)
+    return model, processor
+
+def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
+    try:
+        if not str(filename).endswith("modeling_florence2.py"):
+            return get_imports(filename)
+        imports = get_imports(filename)
+        imports.remove("flash_attn")
+    except:
+        print(f"No flash_attn import to remove")
+        pass
+    return imports
+
+
+def create_path_dict(paths: list[str], predicate: Callable[[Path], bool] = lambda _: True) -> dict[str, str]:
+    """
+    Creates a flat dictionary of the contents of all given paths: ``{name: absolute_path}``.
+
+    Non-recursive.  Optionally takes a predicate to filter items.  Duplicate names overwrite (the last one wins).
+
+    Args:
+        paths (list[str]):
+            The paths to search for items.
+        predicate (Callable[[Path], bool]): 
+            (Optional) If provided, each path is tested against this filter.
+            Returns ``True`` to include a path.
+
+            Default: Include everything
+    """
+
+    flattened_paths = [item for path in paths if Path(path).exists() for item in Path(path).iterdir() if predicate(item)]
+
+    return {item.name: str(item.absolute()) for item in flattened_paths}
+
+
+import comfy.model_management as mm
+from comfy.utils import ProgressBar, load_torch_file
+
+device = mm.get_torch_device()
+offload_device = mm.unet_offload_device()
+
+import folder_paths
+
+script_directory = os.path.dirname(os.path.abspath(__file__))
+model_directory = os.path.join(folder_paths.models_dir, "LLM")
+os.makedirs(model_directory, exist_ok=True)
+
+# Ensure ComfyUI knows about the LLM model path
+folder_paths.add_model_folder_path("LLM", model_directory)
+
+from transformers import AutoProcessor, set_seed
+
+model_list = [
+            'microsoft/Florence-2-base',
+            'microsoft/Florence-2-base-ft',
+            'microsoft/Florence-2-large',
+            'microsoft/Florence-2-large-ft',
+            'HuggingFaceM4/Florence-2-DocVQA',
+            'thwri/CogFlorence-2.1-Large',
+            'thwri/CogFlorence-2.2-Large',
+            'gokaygokay/Florence-2-SD3-Captioner',
+            'gokaygokay/Florence-2-Flux-Large',
+            'MiaoshouAI/Florence-2-base-PromptGen-v1.5',
+            'MiaoshouAI/Florence-2-large-PromptGen-v1.5',
+            'MiaoshouAI/Florence-2-base-PromptGen-v2.0',
+            'MiaoshouAI/Florence-2-large-PromptGen-v2.0',
+            'PJMixers-Images/Florence-2-base-Castollux-v0.5'
+            ]
+
+class DownloadAndLoadFlorence2Model:
+    @classmethod
+    def INPUT_TYPES(s):
+        return {"required": {
+            "model": (model_list, {"default": 'microsoft/Florence-2-base'}),
+            "precision": ([ 'fp16','bf16','fp32'],
+                    {
+                    "default": 'fp16'
+                    }),
+            "attention": (
+                    [ 'flash_attention_2', 'sdpa', 'eager'],
+                    {
+                    "default": 'sdpa'
+                    }),
+            },
+            "optional": {
+                "lora": ("PEFTLORA",),
+                "convert_to_safetensors": ("BOOLEAN", {"default": False, "tooltip": "Some of the older model weights are not saved in .safetensors format, which seem to cause longer loading times, this option converts the .bin weights to .safetensors"}),
+            }
+        }
+
+    RETURN_TYPES = ("FL2MODEL",)
+    RETURN_NAMES = ("florence2_model",)
+    FUNCTION = "loadmodel"
+    CATEGORY = "Florence2"
+
+    def loadmodel(self, model, precision, attention, lora=None, convert_to_safetensors=False):
+        if model not in model_list:
+            raise ValueError(f"Model {model} is not in the supported model list.")
+
+        dtype = {"bf16": torch.bfloat16, "fp16": torch.float16, "fp32": torch.float32}[precision]
+
+        model_name = model.rsplit('/', 1)[-1]
+        model_path = os.path.join(model_directory, model_name)
+
+        if not os.path.exists(model_path):
+            print(f"Downloading Florence2 model to: {model_path}")
+            from huggingface_hub import snapshot_download
+            snapshot_download(repo_id=model,
+                            local_dir=model_path,
+                            local_dir_use_symlinks=False)
+
+        print(f"Florence2 using {attention} for attention")
+
+        if convert_to_safetensors:
+            model_weight_path = os.path.join(model_path, 'pytorch_model.bin')
+            if os.path.exists(model_weight_path):
+                safetensors_weight_path = os.path.join(model_path, 'model.safetensors')
+                print(f"Converting {model_weight_path} to {safetensors_weight_path}")
+                if not os.path.exists(safetensors_weight_path):
+                    sd = torch.load(model_weight_path, map_location=offload_device)
+                    sd_new = {}
+                    for k, v in sd.items():
+                        sd_new[k] = v.clone()
+                    save_file(sd_new, safetensors_weight_path)
+                    if os.path.exists(safetensors_weight_path):
+                        print(f"Conversion successful. Deleting original file: {model_weight_path}")
+                        os.remove(model_weight_path)
+                        print(f"Original {model_weight_path} file deleted.")
+
+        if version.parse(transformers.__version__) >= version.parse('5.0.0'):
+            model, processor = load_model(model_path, attention, dtype, offload_device)
+        else:
+            from .modeling_florence2 import Florence2ForConditionalGeneration
+            model = Florence2ForConditionalGeneration.from_pretrained(model_path, attn_implementation=attention, dtype=dtype).to(offload_device)
+            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+
+        if lora is not None:
+            from peft import PeftModel
+            adapter_name = lora
+            model = PeftModel.from_pretrained(model, adapter_name, trust_remote_code=True)
+
+        florence2_model = {
+            'model': model,
+            'processor': processor,
+            'dtype': dtype
+            }
+
+        return (florence2_model,)
+
+class DownloadAndLoadFlorence2Lora:
+    @classmethod
+    def INPUT_TYPES(s):
+        return {"required": {
+            "model": (
+                    [ 
+                    'NikshepShetty/Florence-2-pixelprose',
+                    ],
+                  ),            
+            },
+          
+        }
+
+    RETURN_TYPES = ("PEFTLORA",)
+    RETURN_NAMES = ("lora",)
+    FUNCTION = "loadmodel"
+    CATEGORY = "Florence2"
+
+    def loadmodel(self, model):
+        if model not in ['NikshepShetty/Florence-2-pixelprose']:
+            raise ValueError(f"Lora Model {model} is not in the supported lora model list.")
+        model_name = model.rsplit('/', 1)[-1]
+        model_path = os.path.join(model_directory, model_name)
+        
+        if not os.path.exists(model_path):
+            print(f"Downloading Florence2 lora model to: {model_path}")
+            from huggingface_hub import snapshot_download
+            snapshot_download(repo_id=model,
+                            local_dir=model_path,
+                            local_dir_use_symlinks=False)
+        return (model_path,)
+    
+class Florence2ModelLoader:
+
+    @classmethod
+    def INPUT_TYPES(s):
+        all_llm_paths = folder_paths.get_folder_paths("LLM")
+        s.model_paths = create_path_dict(all_llm_paths, lambda x: x.is_dir())
+
+        return {"required": {
+            "model": ([*s.model_paths], {"tooltip": "models are expected to be in Comfyui/models/LLM folder"}),
+            "precision": (['fp16','bf16','fp32'],),
+            "attention": (
+                    [ 'flash_attention_2', 'sdpa', 'eager'],
+                    {
+                    "default": 'sdpa'
+                    }),
+            },
+            "optional": {
+                "lora": ("PEFTLORA",),
+                "convert_to_safetensors": ("BOOLEAN", {"default": False, "tooltip": "Some of the older model weights are not saved in .safetensors format, which seem to cause longer loading times, this option converts the .bin weights to .safetensors"}),
+            }
+        }
+
+    RETURN_TYPES = ("FL2MODEL",)
+    RETURN_NAMES = ("florence2_model",)
+    FUNCTION = "loadmodel"
+    CATEGORY = "Florence2"
+
+    def loadmodel(self, model, precision, attention, lora=None, convert_to_safetensors=False):
+        dtype = {"bf16": torch.bfloat16, "fp16": torch.float16, "fp32": torch.float32}[precision]
+        model_path = Florence2ModelLoader.model_paths.get(model)
+        print(f"Loading model from {model_path}")
+        print(f"Florence2 using {attention} for attention")
+        if convert_to_safetensors:
+            model_weight_path = os.path.join(model_path, 'pytorch_model.bin')
+            if os.path.exists(model_weight_path):
+                safetensors_weight_path = os.path.join(model_path, 'model.safetensors')
+                print(f"Converting {model_weight_path} to {safetensors_weight_path}")
+                if not os.path.exists(safetensors_weight_path):
+                    sd = torch.load(model_weight_path, map_location=offload_device)
+                    sd_new = {}
+                    for k, v in sd.items():
+                        sd_new[k] = v.clone()
+                    save_file(sd_new, safetensors_weight_path)
+                    if os.path.exists(safetensors_weight_path):
+                        print(f"Conversion successful. Deleting original file: {model_weight_path}")
+                        os.remove(model_weight_path)
+                        print(f"Original {model_weight_path} file deleted.")
+
+        if version.parse(transformers.__version__) >= version.parse('5.0.0'):
+            model, processor = load_model(model_path, attention, dtype, offload_device)
+        else:
+            from .modeling_florence2 import Florence2ForConditionalGeneration
+            model = Florence2ForConditionalGeneration.from_pretrained(model_path, attn_implementation=attention, dtype=dtype).to(offload_device)
+            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+
+        if lora is not None:
+            from peft import PeftModel
+            adapter_name = lora
+            model = PeftModel.from_pretrained(model, adapter_name, trust_remote_code=True)
+
+        florence2_model = {
+            'model': model,
+            'processor': processor,
+            'dtype': dtype
+            }
+   
+        return (florence2_model,)
+    
+class Florence2Run:
+    @classmethod
+    def INPUT_TYPES(s):
+        return {
+            "required": {
+                "image": ("IMAGE", ),
+                "florence2_model": ("FL2MODEL", ),
+                "text_input": ("STRING", {"default": "", "multiline": True}),
+                "task": (
+                    [ 
+                    'region_caption',
+                    'dense_region_caption',
+                    'region_proposal',
+                    'caption',
+                    'detailed_caption',
+                    'more_detailed_caption',
+                    'caption_to_phrase_grounding',
+                    'referring_expression_segmentation',
+                    'ocr',
+                    'ocr_with_region',
+                    'docvqa',
+                    'prompt_gen_tags',
+                    'prompt_gen_mixed_caption',
+                    'prompt_gen_analyze',
+                    'prompt_gen_mixed_caption_plus',
+                    ],
+                   ),
+                "fill_mask": ("BOOLEAN", {"default": True}),
+            },
+            "optional": {
+                "keep_model_loaded": ("BOOLEAN", {"default": False}),
+                "max_new_tokens": ("INT", {"default": 1024, "min": 1, "max": 4096}),
+                "num_beams": ("INT", {"default": 3, "min": 1, "max": 64}),
+                "do_sample": ("BOOLEAN", {"default": True}),
+                "output_mask_select": ("STRING", {"default": ""}),
+                "seed": ("INT", {"default": 1, "min": 1, "max": 0xffffffffffffffff}),
+            }
+        }
+    
+    RETURN_TYPES = ("IMAGE", "MASK", "STRING", "JSON")
+    RETURN_NAMES =("image", "mask", "caption", "data") 
+    FUNCTION = "encode"
+    CATEGORY = "Florence2"
+
+    def hash_seed(self, seed):
+        import hashlib
+        # Convert the seed to a string and then to bytes
+        seed_bytes = str(seed).encode('utf-8')
+        # Create a SHA-256 hash of the seed bytes
+        hash_object = hashlib.sha256(seed_bytes)
+        # Convert the hash to an integer
+        hashed_seed = int(hash_object.hexdigest(), 16)
+        # Ensure the hashed seed is within the acceptable range for set_seed
+        return hashed_seed % (2**32)
+
+    def encode(self, image, text_input, florence2_model, task, fill_mask, keep_model_loaded=False, 
+            num_beams=3, max_new_tokens=1024, do_sample=True, output_mask_select="", seed=None):
+        _, height, width, _ = image.shape
+        annotated_image_tensor = None
+        mask_tensor = None
+        processor = florence2_model['processor']
+        model = florence2_model['model']
+        dtype = florence2_model['dtype']
+        model.to(device)
+
+        if seed:
+            set_seed(self.hash_seed(seed))
+
+        colormap = ['blue','orange','green','purple','brown','pink','olive','cyan','red',
+                    'lime','indigo','violet','aqua','magenta','gold','tan','skyblue']
+
+        prompts = {
+            'region_caption': '<OD>',
+            'dense_region_caption': '<DENSE_REGION_CAPTION>',
+            'region_proposal': '<REGION_PROPOSAL>',
+            'caption': '<CAPTION>',
+            'detailed_caption': '<DETAILED_CAPTION>',
+            'more_detailed_caption': '<MORE_DETAILED_CAPTION>',
+            'caption_to_phrase_grounding': '<CAPTION_TO_PHRASE_GROUNDING>',
+            'referring_expression_segmentation': '<REFERRING_EXPRESSION_SEGMENTATION>',
+            'ocr': '<OCR>',
+            'ocr_with_region': '<OCR_WITH_REGION>',
+            'docvqa': '<DocVQA>',
+            'prompt_gen_tags': '<GENERATE_TAGS>',
+            'prompt_gen_mixed_caption': '<MIXED_CAPTION>',
+            'prompt_gen_analyze': '<ANALYZE>',
+            'prompt_gen_mixed_caption_plus': '<MIXED_CAPTION_PLUS>',
+        }
+        task_prompt = prompts.get(task, '<OD>')
+
+        if (task not in ['referring_expression_segmentation', 'caption_to_phrase_grounding', 'docvqa']) and text_input:
+            raise ValueError("Text input (prompt) is only supported for 'referring_expression_segmentation', 'caption_to_phrase_grounding', and 'docvqa'")
+
+        if text_input != "":
+            prompt = task_prompt + " " + text_input
+        else:
+            prompt = task_prompt
+
+        image = image.permute(0, 3, 1, 2)
+
+        out = []
+        out_masks = []
+        out_results = []
+        out_data = []
+        pbar = ProgressBar(len(image))
+        for img in image:
+            image_pil = F.to_pil_image(img)
+            inputs = processor(text=prompt, images=image_pil, return_tensors="pt", do_rescale=False).to(dtype).to(device)
+
+            generated_ids = model.generate(
+                input_ids=inputs["input_ids"],
+                pixel_values=inputs["pixel_values"],
+                max_new_tokens=max_new_tokens,
+                do_sample=do_sample,
+                num_beams=num_beams,
+                use_cache=False,
+            )
+
+            results = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+            print(results)
+            # cleanup the special tokens from the final list
+            if task == 'ocr_with_region':
+                clean_results = str(results)       
+                cleaned_string = re.sub(r'</?s>|<[^>]*>', '\n',  clean_results)
+                clean_results = re.sub(r'\n+', '\n', cleaned_string)
+            else:
+                clean_results = str(results)       
+                clean_results = clean_results.replace('</s>', '')
+                clean_results = clean_results.replace('<s>', '')
+
+             #return single string if only one image for compatibility with nodes that can't handle string lists
+            if len(image) == 1:
+                out_results = clean_results
+            else:
+                out_results.append(clean_results)
+
+            W, H = image_pil.size
+            
+            parsed_answer = processor.post_process_generation(results, task=task_prompt, image_size=(W, H))
+
+            if task == 'region_caption' or task == 'dense_region_caption' or task == 'caption_to_phrase_grounding' or task == 'region_proposal':           
+                fig, ax = plt.subplots(figsize=(W / 100, H / 100), dpi=100)
+                fig.subplots_adjust(left=0, right=1, top=1, bottom=0)
+                ax.imshow(image_pil)
+                bboxes = parsed_answer[task_prompt]['bboxes']
+                labels = parsed_answer[task_prompt]['labels']
+
+                mask_indexes = []
+                # Determine mask indexes outside the loop
+                if output_mask_select != "":
+                    mask_indexes = [n for n in output_mask_select.split(",")]
+                    print(mask_indexes)
+                else:
+                    mask_indexes = [str(i) for i in range(len(bboxes))]
+
+                # Initialize mask_layer only if needed
+                if fill_mask:
+                    mask_layer = Image.new('RGB', image_pil.size, (0, 0, 0))
+                    mask_draw = ImageDraw.Draw(mask_layer)
+
+                for index, (bbox, label) in enumerate(zip(bboxes, labels)):
+                    # Modify the label to include the index
+                    indexed_label = f"{index}.{label}"
+                    
+                    if fill_mask:
+                        # Ensure y1 is greater than or equal to y0 for mask drawing
+                        x0, y0, x1, y1 = bbox[0], bbox[1], bbox[2], bbox[3]
+                        if y1 < y0:
+                            y0, y1 = y1, y0
+                        if x1 < x0:
+                            x0, x1 = x1, x0
+                            
+                        if str(index) in mask_indexes:
+                            print("match index:", str(index), "in mask_indexes:", mask_indexes)
+                            mask_draw.rectangle([x0, y0, x1, y1], fill=(255, 255, 255))
+                        if label in mask_indexes:
+                            print("match label")
+                            mask_draw.rectangle([x0, y0, x1, y1], fill=(255, 255, 255))
+
+                    # Create a Rectangle patch
+                    # Ensure y1 is greater than or equal to y0
+                    y0, y1 = bbox[1], bbox[3]
+                    if y1 < y0:
+                        y0, y1 = y1, y0
+                    
+                    rect = patches.Rectangle(
+                        (bbox[0], y0),  # (x,y) - lower left corner
+                        bbox[2] - bbox[0],   # Width
+                        y1 - y0,   # Height
+                        linewidth=1,
+                        edgecolor='r',
+                        facecolor='none',
+                        label=indexed_label
+                    )
+                     # Calculate text width with a rough estimation
+                    text_width = len(label) * 6  # Adjust multiplier based on your font size
+                    text_height = 12  # Adjust based on your font size
+
+                    # Get corrected coordinates
+                    x0, y0, x1, y1 = bbox[0], bbox[1], bbox[2], bbox[3]
+                    if y1 < y0:
+                        y0, y1 = y1, y0
+                    if x1 < x0:
+                        x0, x1 = x1, x0
+
+                    # Initial text position
+                    text_x = x0
+                    text_y = y0 - text_height  # Position text above the top-left of the bbox
+
+                    # Adjust text_x if text is going off the left or right edge
+                    if text_x < 0:
+                        text_x = 0
+                    elif text_x + text_width > W:
+                        text_x = W - text_width
+
+                    # Adjust text_y if text is going off the top edge
+                    if text_y < 0:
+                        text_y = y1  # Move text below the bottom-left of the bbox if it doesn't overlap with bbox
+
+                    # Add the rectangle to the plot
+                    ax.add_patch(rect)
+                    facecolor = random.choice(colormap) if len(image) == 1 else 'red'
+                    # Add the label
+                    plt.text(
+                        text_x,
+                        text_y,
+                        indexed_label,
+                        color='white',
+                        fontsize=12,
+                        bbox=dict(facecolor=facecolor, alpha=0.5)
+                    )
+                if fill_mask:             
+                    mask_tensor = F.to_tensor(mask_layer)
+                    mask_tensor = mask_tensor.unsqueeze(0).permute(0, 2, 3, 1).cpu().float()
+                    mask_tensor = mask_tensor.mean(dim=0, keepdim=True)
+                    mask_tensor = mask_tensor.repeat(1, 1, 1, 3)
+                    mask_tensor = mask_tensor[:, :, :, 0]
+                    out_masks.append(mask_tensor)           
+
+                # Remove axis and padding around the image
+                ax.axis('off')
+                ax.margins(0,0)
+                ax.get_xaxis().set_major_locator(plt.NullLocator())
+                ax.get_yaxis().set_major_locator(plt.NullLocator())
+                fig.canvas.draw() 
+                buf = io.BytesIO()
+                plt.savefig(buf, format='png', pad_inches=0)
+                buf.seek(0)
+                annotated_image_pil = Image.open(buf)
+
+                annotated_image_tensor = F.to_tensor(annotated_image_pil)
+                out_tensor = annotated_image_tensor[:3, :, :].unsqueeze(0).permute(0, 2, 3, 1).cpu().float()
+                out.append(out_tensor)
+               
+                if task == 'caption_to_phrase_grounding':
+                    out_data.append(parsed_answer[task_prompt])
+                else:
+                    out_data.append(bboxes)
+
+                
+                pbar.update(1)
+    
+                plt.close(fig)
+
+            elif task == 'referring_expression_segmentation':
+                # Create a new black image
+                mask_image = Image.new('RGB', (W, H), 'black')
+                mask_draw = ImageDraw.Draw(mask_image)
+  
+                predictions = parsed_answer[task_prompt]
+    
+                # Iterate over polygons and labels  
+                for polygons, label in zip(predictions['polygons'], predictions['labels']):
+                    color = random.choice(colormap)
+                    for _polygon in polygons:  
+                        _polygon = np.array(_polygon).reshape(-1, 2)
+                        # Clamp polygon points to image boundaries
+                        _polygon = np.clip(_polygon, [0, 0], [W - 1, H - 1])
+                        if len(_polygon) < 3:  
+                            print('Invalid polygon:', _polygon)
+                            continue  
+                        
+                        _polygon = _polygon.reshape(-1).tolist()
+                        
+                        # Draw the polygon
+                        if fill_mask:
+                            overlay = Image.new('RGBA', image_pil.size, (255, 255, 255, 0))
+                            image_pil = image_pil.convert('RGBA')
+                            draw = ImageDraw.Draw(overlay)
+                            color_with_opacity = ImageColor.getrgb(color) + (180,)
+                            draw.polygon(_polygon, outline=color, fill=color_with_opacity, width=3)
+                            image_pil = Image.alpha_composite(image_pil, overlay)
+                        else:
+                            draw = ImageDraw.Draw(image_pil)
+                            draw.polygon(_polygon, outline=color, width=3)
+
+                        #draw mask
+                        mask_draw.polygon(_polygon, outline="white", fill="white")
+                        
+                image_tensor = F.to_tensor(image_pil)
+                image_tensor = image_tensor[:3, :, :].unsqueeze(0).permute(0, 2, 3, 1).cpu().float() 
+                out.append(image_tensor)
+
+                mask_tensor = F.to_tensor(mask_image)
+                mask_tensor = mask_tensor.unsqueeze(0).permute(0, 2, 3, 1).cpu().float()
+                mask_tensor = mask_tensor.mean(dim=0, keepdim=True)
+                mask_tensor = mask_tensor.repeat(1, 1, 1, 3)
+                mask_tensor = mask_tensor[:, :, :, 0]
+                out_masks.append(mask_tensor)
+                pbar.update(1)
+
+            elif task == 'ocr_with_region':
+                try:
+                    font = ImageFont.load_default().font_variant(size=24)
+                except:
+                    font = ImageFont.load_default()
+                predictions = parsed_answer[task_prompt]
+                scale = 1
+                image_pil = image_pil.convert('RGBA')
+                overlay = Image.new('RGBA', image_pil.size, (255, 255, 255, 0))
+                draw = ImageDraw.Draw(overlay)
+                bboxes, labels = predictions['quad_boxes'], predictions['labels']
+                
+                # Create a new black image for the mask
+                mask_image = Image.new('RGB', (W, H), 'black')
+                mask_draw = ImageDraw.Draw(mask_image)
+                
+                for box, label in zip(bboxes, labels):
+                    scaled_box = [v / (width if idx % 2 == 0 else height) for idx, v in enumerate(box)]
+                    out_data.append({"label": label, "box": scaled_box})
+                    
+                    color = random.choice(colormap)
+                    new_box = (np.array(box) * scale).tolist()
+                    
+                    # Ensure polygon coordinates are valid
+                    # For polygons, we need to make sure the points form a valid shape
+                    # This is a simple check to ensure the polygon has at least 3 points
+                    if len(new_box) >= 6:  # At least 3 points (x,y pairs)
+                        if fill_mask:
+                            color_with_opacity = ImageColor.getrgb(color) + (180,)
+                            draw.polygon(new_box, outline=color, fill=color_with_opacity, width=3)
+                        else:
+                            draw.polygon(new_box, outline=color, width=3)
+                        
+                        # Get the first point for text positioning
+                        text_x, text_y = new_box[0]+8, new_box[1]+2
+                        
+                        draw.text((text_x, text_y),
+                                  "{}".format(label),
+                                  align="right",
+                                  font=font,
+                                  fill=color)
+                        
+                        # Draw the mask
+                        mask_draw.polygon(new_box, outline="white", fill="white")
+                
+                image_pil = Image.alpha_composite(image_pil, overlay)
+                image_pil = image_pil.convert('RGB')
+                
+                image_tensor = F.to_tensor(image_pil)
+                image_tensor = image_tensor[:3, :, :].unsqueeze(0).permute(0, 2, 3, 1).cpu().float()
+                out.append(image_tensor)
+
+                # Process the mask
+                mask_tensor = F.to_tensor(mask_image)
+                mask_tensor = mask_tensor.unsqueeze(0).permute(0, 2, 3, 1).cpu().float()
+                mask_tensor = mask_tensor.mean(dim=0, keepdim=True)
+                mask_tensor = mask_tensor.repeat(1, 1, 1, 3)
+                mask_tensor = mask_tensor[:, :, :, 0]
+                out_masks.append(mask_tensor)
+
+                pbar.update(1)
+            
+            elif task == 'docvqa':
+                if text_input == "":
+                    raise ValueError("Text input (prompt) is required for 'docvqa'")
+                prompt = "<DocVQA> " + text_input
+
+                inputs = processor(text=prompt, images=image_pil, return_tensors="pt", do_rescale=False).to(dtype).to(device)
+                generated_ids = model.generate(
+                    input_ids=inputs["input_ids"],
+                    pixel_values=inputs["pixel_values"],
+                    max_new_tokens=max_new_tokens,
+                    do_sample=do_sample,
+                    num_beams=num_beams,
+                    use_cache=False,
+                )
+
+                results = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+                clean_results = results.replace('</s>', '').replace('<s>', '')
+                
+                if len(image) == 1:
+                    out_results = clean_results
+                else:
+                    out_results.append(clean_results)
+                    
+                out.append(F.to_tensor(image_pil).unsqueeze(0).permute(0, 2, 3, 1).cpu().float())
+
+                pbar.update(1)
+            
+        if len(out) > 0:
+            out_tensor = torch.cat(out, dim=0)
+        else:
+            out_tensor = torch.zeros((1, 64,64, 3), dtype=torch.float32, device="cpu")
+        if len(out_masks) > 0:
+            out_mask_tensor = torch.cat(out_masks, dim=0)
+        else:
+            out_mask_tensor = torch.zeros((1,64,64), dtype=torch.float32, device="cpu")
+
+        if not keep_model_loaded:
+            print("Offloading model...")
+            model.to(offload_device)
+            mm.soft_empty_cache()
+        
+        return (out_tensor, out_mask_tensor, out_results, out_data)
+     
+NODE_CLASS_MAPPINGS = {
+    "DownloadAndLoadFlorence2Model": DownloadAndLoadFlorence2Model,
+    "DownloadAndLoadFlorence2Lora": DownloadAndLoadFlorence2Lora,
+    "Florence2ModelLoader": Florence2ModelLoader,
+    "Florence2Run": Florence2Run,
+}
+NODE_DISPLAY_NAME_MAPPINGS = {
+    "DownloadAndLoadFlorence2Model": "DownloadAndLoadFlorence2Model",
+    "DownloadAndLoadFlorence2Lora": "DownloadAndLoadFlorence2Lora",
+    "Florence2ModelLoader": "Florence2ModelLoader",
+    "Florence2Run": "Florence2Run",
+}
--- a/custom_nodes/ComfyUI-Florence2/processing_florence2.py
+++ b/custom_nodes/ComfyUI-Florence2/processing_florence2.py
--- a/custom_nodes/ComfyUI-Florence2/pyproject.toml
+++ b/custom_nodes/ComfyUI-Florence2/pyproject.toml
@@ -0,0 +1,15 @@
+[project]
+name = "comfyui-florence2"
+description = "Nodes to use Florence2 VLM for image vision tasks: object detection, captioning, segmentation and ocr"
+version = "1.0.8"
+license = "MIT"
+dependencies = ["transformers>=4.39.0,!=4.50.*"]
+
+[project.urls]
+Repository = "https://github.com/kijai/ComfyUI-Florence2"
+#  Used by Comfy Registry https://comfyregistry.org
+
+[tool.comfy]
+PublisherId = "kijai"
+DisplayName = "ComfyUI-Florence2"
+Icon = ""
--- a/custom_nodes/ComfyUI-Florence2/requirements.txt
+++ b/custom_nodes/ComfyUI-Florence2/requirements.txt
@@ -0,0 +1,6 @@
+transformers>=4.39.0,!=4.50.*
+matplotlib
+timm
+pillow>=10.2.0
+peft
+accelerate>=0.26.0