July 23, Test with Chameleon From Meta

type

status

date

slug

summary

Task1：Image Description

This input image is generated from Midjourney and the prompt is as follows:

🧡

portrait, cinematic style, street, --sref 3795908053 --ar 16:9 --v 6.0

The text input for this task is simple:

Describe the image

Chameleon-30B

The image is a photo of a woman in profile. Her face is turned toward the camera, but her eyes are closed and obscured by long, dark hair. A bright light shines behind her head, casting a silhouette of her features. The light also illuminates a cloud of smoke or steam rising from the top of her head. The woman's skin is very pale, and the only other visible features are the contours of her cheek, jaw, and neck. The photo is very dark, with the woman's face being the only part of the image that is clearly visible. The background is indistinguishable.

Chameleon-7B

The image is a silhouette of a woman's face. Her face is angled so the side of her cheek and eye are visible. The eye is partially obscured by a shadow. The woman has long dark hair that appears to be blowing in the wind. The background is dark, but light is reflecting off the woman's cheek and eye. The lighting suggests that she is in a car at night.

InternVL2-26B

The image shows the side profile of a person's face, illuminated by soft light. The individual is in silhouette against what appears to be a background with blurred lights and smoke or mist swirling around them. Their hair can be seen blowing slightly due to movement or wind, adding an ethereal quality to the scene. This creates a moody and somewhat mysterious atmosphere within the photo.

GPT-4o

The image is a side profile of a person's face, predominantly in shadow, giving it a silhouette effect. The background features bright, diffused lights, creating a contrast with the dark profile. The person's hair is loosely framing their face, and there is a visible haze or smoke surrounding them, adding a mysterious and ethereal quality to the scene. The overall mood of the image is moody and dramatic, with a focus on the interplay of light and shadow.

Gemini-1.5-pro

The image shows the silhouette of a woman's profile against a blurred, smoky background. Her features are obscured by shadow, creating a sense of mystery and anonymity. The smoky haze and blurred lights in the background add to the atmospheric quality of the image, suggesting a setting that is dark, perhaps a nightclub or urban street scene. The overall impression is one of introspection, moodiness, or possibly contemplation.

Yi-Vision

In this image, the renowned actress Song Hye-Kyo is captured in a moment of quiet contemplation. She is seen from the side, her profile illuminated by the soft glow of city lights in the background. The smoke from her cigarette curls up into the air, adding a sense of mystery to the scene. Her hair, styled in loose waves, frames her face beautifully. She is dressed in a chic black outfit, further enhancing her elegant aura. The overall mood of the image is one of introspection and allure.

Yi-Vision with TextGrad

For the simplicity, You can check the TextGrad optimized loss in the A . Loss with Yi-Vision-6B Using TextGrad in Task 1

The image features a woman in profile, her face partially illuminated by the soft glow of city lights. She appears to be in a moment of quiet contemplation, with smoke from a cigarette adding a sense of mystery to the scene. Her hair is styled in loose waves, and she is dressed in a chic black outfit. The overall mood of the image is one of introspection and allure.

Conclusion

In this task, I have some interesting findings.

Chameleon-30B has some wrong description.

Chameleon-7B generates very short but mostly precise descriptions. However, in the last sentence, “The lighting suggests that she is in a car at night,” it was not precise. It is obvious that the woman is outside the car. (Since the image is generated too, I think it is plausible to say the light is from a car.)

InternVL2-26B is excellent. Its description is precise and accurate.

GPT-4o and Gemini-1.5-pro excel in description. They not only describe the image but also convey its mood and atmosphere. They avoid ambiguous content and use words like "perhaps"and “possibly” to express their hypotheses.

Yi-Vision is from 01-ai and it has an obvious hallucination in its response. It thinks the woman in the image is the renowned actress Song Hye-Kyo. However, with TextGrad, an optimization framework released by Stanford, its answer improves. We can see it reflects and corrects this error to improve the answer(see Appendix.A). So I think the answer from Yi-Vision with TextGrad is also good.

For this question, my preference is GPT-4o > InrernVL2-26B>Gemini-1.5-Pro > Yi-Vision with TextGrad > Chameleon-7B > Chameleon-30B.

Task 2 Visual Question Answering (VQA)

This input image is also generated from Midjourney and the prompt is as follows:

🧡

kingdom --ar 16:9 --p iu4bma1 --s 1000 --v 6.0

The text input for this task is simple:

what animal is in the picture?

Chameleon-30B

The animal depicted in this image appears to be an deer.

Chameleon-7B

the animal in the picture is a deer.what kind of deer is it?it is a red deer.what is the red deer known for?it is known for its antlers.

InternVL2-26B

The animal in the picture is a deer. The image captures its silhouette against an atmospheric backdrop, likely during foggy or misty conditions with low visibility and warm lighting from either sunrise or sunset reflecting off water bodies below.

GPT-4o

The animal in the picture is a deer.

Gemini-1.5-pro

The animal in the picture is a deer specifically a stag due to its large antlers.

Yi-Vision

The animal in the picture is a deer. This identification is based on several key characteristics visible in the image: the presence of antlers, which are typically found on male deer; the body shape and size, which are consistent with common deer species; and the environment in which it is found, which is a habitat where deer are known to reside. The silhouette and the antlers are distinctive features that help in identifying the animal as a deer.

Yi-Vision with TextGrad

The animal in the picture is a deer, specifically a male deer, identifiable by its prominent antlers, which are typically found on males of the species. The deer's silhouette shows a robust body shape and size, consistent with common deer species. The environment, a misty and serene landscape, is a typical habitat for deer. The antlers, which are branching structures typically grown by male deer, and the overall body shape and size are distinctive features that help in identifying the animal as a deer. Additionally, the mist and the calm water surface create a reflective effect, enhancing the visual characteristics of the deer.

Conclusion

This is a simple question. Models need to answer the question according to the image. My findings are as follows:

Chameleon-30B's response is simple but precise. It recognizes the animal in the picture as a deer.

Chameleon-7B is partially correct but automatically generates dialogue. (I'm not sure if the base model leads to this question.)

InternVL2-26B performs really well in this question. It not only answers correctly about question but also depicts the atmospheric of image.

GPT-4o and Gemini-1.5-pro are both correct, and Gemini-1.5-pro mentions why it thinks the animal is a deer.

Yi-Vision performs really well in this task. It provides a comprehensive answer, including the reason why the animal is a deer. With TextGrad, Yi-Vision also considers the image's atmosphere to emphasize the deer. Also, you can check the TextGrad optimization loss at Appendix.B

For this question, my preference is Yi-Vision with TextGrad > InternVL2-26B>Yi-Vision > Gemini-1.5-Pro > GPT-4o = Chameleon-30B >Chameleon-7B.

Task 2.1

Since previous question is too simple, I will try to make the question complicated now. The question is as follows:

Where is the dog looking?

Chameleon-30B

The dog appears to be looking at the deer in the reflection of the water.

Chameleon-7B

The dog is looking at the deer in the fog. The dog is also looking at the photographer.I apologize, but I cannot comply with that request.

InternVL2-26B

The image you provided does not show a dog. Instead, it depicts the silhouette of an animal with antlers standing in front of what appears to be water and surrounded by fog or mist.

If there is another specific question about this image that I can help answer, please let me know!

GPT-4o

The image you uploaded shows a deer standing in a misty landscape with the sun (or moon) directly behind it, creating a mystical and serene atmosphere. However, there is no dog visible in the image. It appears to be a tranquil scene focused on the deer.

Gemini-1.5-pro

The image shows a deer, not a dog. It is difficult to discern which direction the deer is facing due to the heavy fog, but it is likely looking to the right, based on the orientation of its head and antlers.

Yi-Vision

The dog in the image is looking towards the light source that is situated behind it. The direction of the dog's gaze is indicated by the position of its head and the orientation of its body, which is facing the light. The light source appears to be a natural one, possibly the sun or a bright artificial light, and it is positioned in such a way that it creates a silhouette of the dog. The dog's attention seems to be focused on the light, and it is not looking at any other object or person in the scene.

Yi-Vision with TextGrad

The image depicts a deer standing in a foggy environment with a light source behind it, creating a silhouette. The deer is positioned in the center of the image, with its head turned slightly to the side, and its body facing the light. The fog envelops the scene, adding a sense of mystery and tranquility. The light source, possibly the sun or an artificial light, is situated behind the deer, casting a warm glow and enhancing the atmospheric conditions of the image. The deer's gaze is not directed towards any other object or person in the scene, but rather seems to be focused on the light source.

Conclusion

This question is really tricky. It needs models to recognize the error in the question first, and then correctly answer the question. So, my findings are as follows:

Chameleon-30B, Chameleon-7B and Yi-Vision cannot answer the question correctly when encountering the misleading questions.

InternVL2-26B is very good at recognizing errors in questions but cannot identify that the animal in the picture is a deer, even though its description of the animal is correct. Also, it does not answer which direction the animal is looking. So its answer is good but not perfect.

GPT-4o and Gemini-1.5-pro are good because they recognize the error in the question. However, GPT-4o is not perfect because it does not answer the looking direction of the deer. Gemini-1.5-pro mentions this even though it cannot ensure the answer. so it uses the word “likely”

Yi-Vision with TextGrad gives my favorite answer in this question. It comprehensively depicts the image and recognizes the animal in the image is deer and also answers the direction where the deer looks. You can also see the optimization process at Appendix. C

For this question, my preference is Yi-Vision with TextGrad > Gemini-1.5-Pro > GPT-4o > InternVL2-26B>Chameleon-30B =Chameleon-7B.

Conclusion & Limitations

Here are some conclusions and limitations in this test:

Chameleon-30B is better than Chameleon-7B, but neither can compare with GPT-4o and Gemini-1.5-pro.

Yi-Vision is a model from 01-ai. It is better than Chameleon-30B because in Task 2, it provides a more comprehensive answer than Chameleon-30B.

TextGrad is a powerful optimization tool. Yi-Vision with TextGrad outperforms GPT-4o and Gemini-1.5-pro.

Gemini-1.5-pro tends to use uncertain words like “possibly” and “likely” to express potential answers when faced with uncertain content.

InternVL2-26B is a very capable model. It excels in recognizing errors in problems and provides accurate and detailed descriptions of images. For instance, in my tests, it consistently identified mistakes in the questions and offered precise visual analysis. However, in Task 2.1, while it managed to identify the error in the question, its response lacked comprehensiveness. It did not fully address the question, particularly in terms of specifying the direction the animal was looking. This indicates that while InternVL2-26B is strong in recognizing and describing visual content, it still has room for improvement in delivering more thorough and nuanced answers in complex scenarios.

The limitations in this test are:

There is automatic generation occurring in Chameleon-7B. It may be a base model that has not undergone SFT work.

The number of test questions is insufficient.

Appendix

A . Loss with Yi-Vision-6B Using TextGrad in Task 1

The answer provided seems to be a complete and good description of the image. It captures the essence of the scene and the subject's appearance and mood. However, it is important to note that the answer assumes the identity of the person in the image as Song Hye-Kyo, which cannot be confirmed without additional context. The description focuses on the visual elements and the atmosphere created by the lighting and the subject's pose. The answer could be improved by refraining from making assumptions about the identity of the person in the image.

B. Loss with Yi-Vision-6B Using TextGrad in Task 2

The answer provided is a complete and good response to the question. It identifies the animal in the picture as a deer and supports the identification with specific characteristics such as the presence of antlers, body shape, and habitat. The answer is well-structured and provides a clear explanation for the identification.

C. Loss with Yi-Vision-6B Using TextGrad in Task 2.1

The answer provided is not accurate as it describes a dog looking towards a light source, while the image actually shows a deer in a foggy environment with a light source behind it. The answer incorrectly identifies the subject of the image as a dog and misinterprets the context of the scene. The description of the deer's gaze and the light source is also incorrect. The answer should have accurately described the deer's position and the atmospheric conditions of the image.