We present a real-time pipeline to generate appropriateness of an attire for any custom occasion or event. The pipeline also provides the user with suggestions when the attire can be improved to be more culturally sensitive, suitable for the season, and color coordinated. The pipeline consists of state-of-the-art image-to-text model, BLIP-2, to generate detailed descriptions of the person's attire in the image. Using this look description and the user-provided custom description of the event that they are dressed up for, we perform Prompt Engineering to develop the most impactful prompt to query GPT3.5-turbo through OpenAI API calls. The resulting text consists of an "Overall Rating", sub-category ratings, and possible suggestions in attire and/or accessories. The user-interface is intuitive and the overall latency is ~7.1s.
The real-time pipeline consists of the following steps:
Step 1: User uploads an image of themselves in an attire and the occasion they are dressed up for.
Step 2: The image is passed through a state-of-the-art image-to-text model, BLIP-2, to generate a detailed description of the person's attire in the image.
Step 3: Using this look description and the user-provided custom description of the event that they are dressed up for, we perform Prompt
Engineering to develop the most impactful prompt to query GPT3.5-turbo through OpenAI API calls. We decided on the following factors to construct the prompt for the query:
Color coordination, Color appropriateness, Potential cultural sensitivity, Seasonal suitability, accessories.
Step 4: The resulting text consists of an "Overall Rating", sub-category ratings, and possible suggestions in attire and/or accessories.
An example prompt to the BLIP-2 model is ->
"Question: Explain the outfit in as much detail as possible. Answer:"
An example prompt to the GPT3.5-turbo model after Prompt Engineering is ->
I am a male, wearing a navy blue and gold embroidered kurta pajama. Details of the occasion are as follows:
attending a close friend's Indian wedding. Give one overall rating out of
'consider changing outfit', 'could be better', 'on point', and title it 'overall rating'. Consider:
Color coordination, Color appropriateness, Potential cultural sensitivity, Seasonal suitability, accessories (for each of the categories give one succinct sentence.
We obtained the best results when we used BLIP-2 as the image-to-text generation model along with GPT3.5-turbo as the LLM.
We leverage the quality of the generated texts from these large models pre-trained on huge and diverse datasets.
The latency of the whole pipeline is only ~7.1 s.
We observe that more than 90% of the time is actually taken by the OpenAI
API calls. This can be easily reduced by using pre-saved open-source models like any version of LLaMA2.
Below are some examples showcasing 2 different attires and occasions.
Here is an example case which showcases that although BLIP is great at describing the image and captioning it, it is not reliable when answering indirect or chain of thought questions about the objects in the image.
This is because, the text prompt that we provide is simply appended to the frozen LLM when we query it. So, it is stifled
by the amount of Prompt Engineering that takes place and the quality of the LLM.
Implement a system that can analyze images in real-time. This could be particularly useful for event planners, social media platforms, or e-commerce sites looking to ensure the content shared aligns with their guidelines.
Integrate a virtual wardrobe feature that suggests alternative outfits for images that may not be appropriate for a given context. Users can virtually try on different outfits generated by the AI