2025 February 1

My attempt to build a pipeline for visual chat

[Vision language models] are a type of generative models that take image and text inputs, and generate text outputs. (Vision Language Models Explained, 2024)

GPT-4V from OpenAI was one of the best vision language models (VLMs) last year around this month. So a research paper entitled Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V (from here on will be referred as the “SoM paper” in this post) was an interesting read back then. More concepts and tools including the Chain-of-Spot (CoS) approach, LLaVA-Grounding, Grounded SAM, and BLIP image captioning inspired me to try building a system that uses VLMs. The goal was to do my own implementation of the SoM paper with improvements in combination with the CoS approach and BLIP-generated captions for better visual chatting. It is unfinished though and I considered it as a failure. I want to share anyway what I went through when working on it. Here’s the code repo.

The pipeline overview

The pipeline takes an input image. The first step will be generating a new image that shows numbers overlayed as marks on the original image (image for Set-of-Mark prompting or what I will call “SoM image”) and then generating captions for the image. The SoM image will be generated with the help of YOLOv8-seg. BLIP will be used for captioning. The generated caption and SoM image will then be treated as input prompts to GPT-4V to write a detailed description with visual grounding. The SoM image and written description will be used as context for the visual chat.

What is a SoM image?

The SoM paper introduction states: “we present a new prompting mechanism called Set-of-Mark (SoM) prompting , i.e. simply adding a set of visual marks on top of image regions.” So what I call a “SoM image” is an image used for SoM prompting. A sample of this image is shown below:

The SoM researchers tried various formats for visually marking the image such as numbers, alphabets, masks, or boxes. They also noted that the type of mark is dependent on what the image is. For example, if an image is full of numbers, numeric marks should be avoided. The researchers considered alphanumeric marks since it will not take much space when overlayed and also it can be recognized by GPT-4V thru its OCR capability. In my implementation, I opted for numeric marks since the images I’ll be processing are not text-heavy.

Going through the pipeline

This will be our input image to the pipeline:

A little blurry picture of our cat named "Linggoy" on a stack of laundry.

Step 1. Generate the SoM image

The image first needs to be segmented. YOLOv8 model for object segmentation was selected for its speed. Notice it identified Linggoy and the laundry. For the laundry, it has 2 large detached regions.

Now that there are regions to work on, the numeric marks can now be overlayed. The set_marks() method shown below is my implementation for setting the marks. Another important method called within this is the _keep_largest_contour().

def set_marks(self, current_image_detections: Results, font_size: int = 20):
    """Takes in a YOLO inference Result. Returns a PIL Image."""

    raw_image = Image.open(self.src_image)
    raw_image_draw = ImageDraw.Draw(raw_image)
    font = ImageFont.load_default(size=font_size)

    masks = current_image_detections.masks.data
    for i, mask in enumerate(masks):
        mask_arr = mask.detach().cpu().numpy() * 255
        src_mask = (mask_arr).astype("uint8")

        current_mask = self._keep_largest_contour(src_mask)

        mask_dt = cv2.distanceTransform(current_mask, cv2.DIST_L2, 0)
        mask_dt = mask_dt[1:-1, 1:-1]
        max_dist = np.max(mask_dt) * 0.8
        coords_y, coords_x = np.where(mask_dt >= max_dist)

        # calculate a center within image bounds
        height, width = mask_dt.shape
        center_x = coords_x[len(coords_x) // 2] + 2
        center_y = coords_y[len(coords_y) // 2] - 6
        center_x = max(0, min(center_x, width - 1))
        center_y = max(0, min(center_y, height - 1))

        # rescale center point coordinates to image size
        scaled_center = self._rescale_coordinates(
            width,
            height,
            raw_image.width,
            raw_image.height,
            center_x,
            center_y,
        )

        # add numerical label
        text = str(i + 1)
        text_width, text_height = raw_image_draw.textbbox((0, 0), text, font=font)[
            2:
        ]
        text_background = Image.new(
            "RGB", (text_width + 2, text_height + 2), "black"
        )
        draw = ImageDraw.Draw(text_background)
        draw.text((0, 0), text, fill="white", font=font)

        raw_image.paste(text_background, scaled_center)

    return raw_image

For cases when an object has multiple detached mask regions (e.g. the laundry’s case), the one with the largest area will be the region selected for where the mark will be set. Forgive my cv2 code. This was the first time I did much code involving python OpenCV. I did think of learning more about it during that time but I did not push through. Here’s the implementation:

def _keep_largest_contour(self, binary_mask):
    """
    For mask with multiple parts, keep only the one with largest area for marking.
    """

    contours, hierarchy = cv2.findContours(
        binary_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )

    max_area = 0
    largest_contour = None
    for contour in contours:
        area = cv2.contourArea(contour)
        if area > max_area:
            max_area = area
            largest_contour = contour

    # Create a new mask for the largest contour
    largest_contour_mask = np.zeros_like(binary_mask)
    if largest_contour is not None:
        cv2.drawContours(
            largest_contour_mask, [largest_contour], -1, (255), thickness=cv2.FILLED
        )

    return largest_contour_mask

Finally, the SoM image:

Step 2. Generate caption

This is straightforward, and can be done simultaneously in parallel with generating the SoM image. The original image is used to generate the caption. The purpose of the caption is to act as an initial context of the image when prompting GPT-4V to write a more detailed description.

The current commit in the main branch doesn’t show what the caption_model is. I had it removed for some reason I can no longer remember. But here’s the context for that:

from lavis.models import load_model_and_preprocess

caption_model, blip_vis_processors, _ = load_model_and_preprocess(
    name="blip_caption", model_type="base_coco", is_eval=True, device=device
)

def write_caption(self):
    raw_image = Image.open(self.src_image).convert("RGB")
    image = blip_vis_processors["eval"](raw_image).unsqueeze(0).to(device)
    caption = caption_model.generate({"image": image})

    return {"caption": caption[0]}

Step 3. GPT-4V to write a description

This is the part where there is no more code to show. The plan was to use OpenAI API for calls to GPT-4V. No money for it though. I did test using ChatGPT with access to vision. My prompt went like:

[image attached]

Image caption: [caption generated by BLIP]

Using the image and its caption, write a detailed description of the image (with visual grounding).

Now with a detailed description and the SoM image, the plan was to have both of these as context in the beginning of a new chat session, and then chat all the way. Something like:

[
  {
    "role": "system",
    "content": "You are an assistant who is excellent in answering questions about a given image with its description."
  },
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Description: An orange cat sitting on a pile of sheets.\n\nThis is an image and its detailed description."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "linggoy-set-of-mark-image-attachment.jpg"
        }
      }
    ]
  }
]

An unsolved problem

How about if the object in the image has holes? Like a donut or ring? Haven’t solved that:

Even if the segmenter does a good job punching a hole right there, it will be challenging for me to overlay the mark on the donut itself. Probably same goes for thin line width arcs.

More SoM images

Because they’re the only output of this work. (>_<)