HDIL: How Do I Look?

Carnegie Mellon University

Abstract

We present a real-time pipeline to generate appropriateness of an attire for any custom occasion or event. The pipeline also provides the user with suggestions when the attire can be improved to be more culturally sensitive, suitable for the season, and color coordinated. The pipeline consists of state-of-the-art image-to-text model, BLIP-2, to generate detailed descriptions of the person's attire in the image. Using this look description and the user-provided custom description of the event that they are dressed up for, we perform Prompt Engineering to develop the most impactful prompt to query GPT3.5-turbo through OpenAI API calls. The resulting text consists of an "Overall Rating", sub-category ratings, and possible suggestions in attire and/or accessories. The user-interface is intuitive and the overall latency is ~7.1s.

Interpolate start reference image.

An image of Dishani in this attire with input prompt: "for going to a snow resort"

Interpolate start reference image.

An image of Dishani in this attire with input prompt: "for an Indian wedding"

Pipeline

The real-time pipeline consists of the following steps:

Step 1: User uploads an image of themselves in an attire and the occasion they are dressed up for.

Step 2: The image is passed through a state-of-the-art image-to-text model, BLIP-2, to generate a detailed description of the person's attire in the image.

Step 3: Using this look description and the user-provided custom description of the event that they are dressed up for, we perform Prompt Engineering to develop the most impactful prompt to query GPT3.5-turbo through OpenAI API calls. We decided on the following factors to construct the prompt for the query: Color coordination, Color appropriateness, Potential cultural sensitivity, Seasonal suitability, accessories.

Step 4: The resulting text consists of an "Overall Rating", sub-category ratings, and possible suggestions in attire and/or accessories.

An example prompt to the BLIP-2 model is ->
"Question: Explain the outfit in as much detail as possible. Answer:"

An example prompt to the GPT3.5-turbo model after Prompt Engineering is ->
I am a male, wearing a navy blue and gold embroidered kurta pajama. Details of the occasion are as follows: attending a close friend's Indian wedding. Give one overall rating out of 'consider changing outfit', 'could be better', 'on point', and title it 'overall rating'. Consider: Color coordination, Color appropriateness, Potential cultural sensitivity, Seasonal suitability, accessories (for each of the categories give one succinct sentence.

Interpolate start reference image.

Results

We obtained the best results when we used BLIP-2 as the image-to-text generation model along with GPT3.5-turbo as the LLM. We leverage the quality of the generated texts from these large models pre-trained on huge and diverse datasets.

The latency of the whole pipeline is only ~7.1 s.

We observe that more than 90% of the time is actually taken by the OpenAI API calls. This can be easily reduced by using pre-saved open-source models like any version of LLaMA2. Below are some examples showcasing 2 different attires and occasions.

Interpolate start reference image.

Interpolate start reference image.

Ablation Studies

Why not use BLIP-2 directly?

Here is an example case which showcases that although BLIP is great at describing the image and captioning it, it is not reliable when answering indirect or chain of thought questions about the objects in the image.

This is because, the text prompt that we provide is simply appended to the frozen LLM when we query it. So, it is stifled by the amount of Prompt Engineering that takes place and the quality of the LLM.

Interpolate start reference image.

Future Scope

Experiment with LLaMA2 as the frozen LLM:

The authors of BLIP-2 mention that they have observed improved generations when improving the image encoders and/or the frozen LLM. Thus, using a higher quality LLM that has been proven to answer deduction-based questions reliably should boost the performance of the image-to-text model itself to hopefully eliminate the need for using GPT3.5-turbo.

Real-time Image Analysis:

Implement a system that can analyze images in real-time. This could be particularly useful for event planners, social media platforms, or e-commerce sites looking to ensure the content shared aligns with their guidelines.

Virtual Wardrobe:

Integrate a virtual wardrobe feature that suggests alternative outfits for images that may not be appropriate for a given context. Users can virtually try on different outfits generated by the AI