By Yassine El Yacoubi, Founder and CEO of musAIm.ai
As our interaction with AI is getting more and more commoditized, bringing "humanness" to these intimate interactions with bits and bytes is essential to captivate trust, engagement, and fandom of your customers. A critical element of building these immersive AI solutions is making them engage multiple senses.
In this guide, I shared a step-by-step tutorial to illustrate how you can integrate different modalities into your LLM solution.
I will walk you through the basics, provide code snippets, and link to the best resources for deeper exploration. I hope you find it helpful. For any further questions or notes, don't hesitate to leave comments or email me at: yassine@musaim.ai
What Is Multi-Modal AI?
Multi-modal means solutions that can process inputs and generate outputs from multiple types of data, (notably text, images, video, and audio.) For example, a solution that can:
- Analyze a photo and describe its contents in natural language.
- Respond to spoken commands while analyzing real-time video input.
- Generate captions for videos with contextual text and voiceover.
Why Multi-Modal AI Matters
We understand the world and react to it via multiple senses. Integrating multiple data types allows AI to better understand the real world, where information rarely exists in just one form. It also makes the interaction engage multiple senses. Multi-sensory interactions are powerful, fun, and captivating. Building multi-modal AI solutions is an essential tool to deliver customer delight via engaging different senses.
Imagine the power of the following experiences for example:
- Virtual assistants that respond to voice commands and recognize objects in your environment.
- Educational tools that combine visual aids with spoken explanations.
- Enhanced customer support systems that analyze and respond to visual and auditory cues.
Step 1: Set Up Your Environment
Let's start by getting you set up:
Install Required Libraries
Install the following packages:
pip install transformers torch torchvision speechrecognition opencv-python
Learn More About the Installed Libraries
- Hugging Face Transformers: For working with LLMs.
Documentation - PyTorch: For building and training models.
Documentation - OpenCV: For image and video processing.
Documentation - SpeechRecognition: For processing audio and converting speech to text.
Documentation
Step 2: Build a Multi-Modal Pipeline
1. Text + Vision
Let’s start by combining text and images. I’ll be using BLIP from Hugging Face.
As an illustration, let's write some code that takes an image as input and outputs a text description of that image:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
# Load the model and processor once to improve efficiency
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
def generate_image_description(image_path):
"""
Generates a description of the given image using BLIP.
Args:
image_path (str): The path to the image.
Returns:
str: The generated description of the image.
"""
# Load the image
image = Image.open(image_path)
# Process the image
inputs = processor(image, return_tensors="pt")
# Generate the description
outputs = model.generate(**inputs)
caption = processor.decode(outputs[0], skip_special_tokens=True)
return caption
2. Voice
Now, let’s add voice input recognition via the SpeechRecognition library.
import speech_recognition as sr
def speech_to_text():
"""
Captures audio from the microphone and converts it to text using Google Speech Recognition.
Returns:
str: The recognized text from speech, or an error message if recognition fails.
"""
# Initialize the recognizer
recognizer = sr.Recognizer()
# Capture audio from the microphone
with sr.Microphone() as source:
print("Speak something...")
audio = recognizer.listen(source)
# Convert speech to text
try:
command = recognizer.recognize_google(audio)
return command
except sr.UnknownValueError:
return "Sorry, could not understand the audio."
except sr.RequestError as e:
return f"Request error: {e}"
Text + Voice + Vision = Magic!
Finally, let’s create a pipeline that accepts an image, processes voice commands, and uses the LLM to generate a response.
from transformers import pipeline
# Assuming generate_image_description and speech_to_text functions are defined
from your_module import generate_image_description, speech_to_text # Replace `your_module` with the actual module name
# Load an LLM pipeline
text_generator = pipeline("text-generation", model="gpt-3.5-turbo")
# Get the image description dynamically
image_path = "example_image.jpg" # Replace with the actual image path
image_description = generate_image_description(image_path)
# Get the spoken input dynamically
spoken_input = speech_to_text()
# Combine inputs and generate output
prompt = f"The image shows: {image_description}. The user said: {spoken_input}. Provide a response:"
response = text_generator(prompt, max_length=100, num_return_sequences=1)
print("AI Response:", response[0]['generated_text'])
This creates a system where the AI interprets images and voice input, and then responds with contextually rich text.
Step 3: Build the Front End of Your Application
We'll be using Flask, a lightweight web stack, to build the front end of this application. In this sample,
the user uploads an image via the form on the main page (index.html
). Flask processes the image and captures the voice input through the backend function listed above. The AI response is displayed on a results page (result.html
).
from flask import Flask, request, render_template
from transformers import pipeline
from your_module import generate_image_description, speech_to_text # Replace with the actual module name
app = Flask(__name__)
# Load the text generation pipeline
text_generator = pipeline("text-generation", model="gpt-3.5-turbo")
@app.route("/")
def index():
return render_template("index.html") # HTML form for image upload and voice input
@app.route("/process", methods=["POST"])
def process():
# Handle image upload
uploaded_image = request.files["image"]
image_path = "uploaded_image.jpg"
uploaded_image.save(image_path)
# Handle voice input (assumes microphone input is captured separately)
spoken_input = speech_to_text()
# Generate image description
image_description = generate_image_description(image_path)
# Generate response
prompt = f"The image shows: {image_description}. The user said: {spoken_input}. Provide a response:"
response = text_generator(prompt, max_length=100, num_return_sequences=1)
return render_template(
"result.html",
image_description=image_description,
spoken_input=spoken_input,
ai_response=response[0]['generated_text']
)
if __name__ == "__main__":
app.run(debug=True)
Below are the html files used in the sample above:
index.html
<!DOCTYPE html>
<html>
<head>
<title>Multi-Modal AI Interface</title>
</head>
<body>
<h1>Multi-Modal AI Interface</h1>
<form action="/process" method="post" enctype="multipart/form-data">
<label for="image">Upload an Image:</label>
<input type="file" name="image" id="image" required><br><br>
<button type="submit">Submit</button>
</form>
</body>
</html>
result.html
<!DOCTYPE html>
<html>
<head>
<title>AI Response</title>
</head>
<body>
<h1>AI Response</h1>
<p><strong>Image Description:</strong> {{ image_description }}</p>
<p><strong>Spoken Input:</strong> {{ spoken_input }}</p>
<p><strong>Generated Response:</strong> {{ ai_response }}</p>
<a href="/">Go Back</a>
</body>
</html>
Further Reading
- Hugging Face Transformers Documentation
- CLIP Model Paper
- PyTorch Tutorials
- SpeechRecognition Documentation
- OpenCV Python Docs
- Flask
Building multi-modal AI systems is an essential step in many solutions given how diverse and interconnected our world is. By combining vision, voice, and text, you’re not just building a technical tool—you’re creating an experience that mirrors human interaction. Whether you’re creating a virtual assistant or reimagining how users interact with content, the possibilities are endless. Dive in, experiment, and let your curiosity lead the way!