Expanding Beyond Text – A Practical Guide to Building Multi-Modal LLM Solutions

By Yassine El Yacoubi, Founder and CEO of musAIm.ai

As our interaction with AI is getting more and more commoditized, bringing "humanness" to these intimate interactions with bits and bytes is essential to captivate trust, engagement, and fandom of your customers. A critical element of building these immersive AI solutions is making them engage multiple senses.

In this guide, I shared a step-by-step tutorial to illustrate how you can integrate different modalities into your LLM solution.

I will walk you through the basics, provide code snippets, and link to the best resources for deeper exploration. I hope you find it helpful. For any further questions or notes, don't hesitate to leave comments or email me at: yassine@musaim.ai


What Is Multi-Modal AI?

Multi-modal means solutions that can process inputs and generate outputs from multiple types of data, (notably text, images, video, and audio.) For example, a solution that can:

  • Analyze a photo and describe its contents in natural language.
  • Respond to spoken commands while analyzing real-time video input.
  • Generate captions for videos with contextual text and voiceover.

Why Multi-Modal AI Matters

We understand the world and react to it via multiple senses. Integrating multiple data types allows AI to better understand the real world, where information rarely exists in just one form. It also makes the interaction engage multiple senses. Multi-sensory interactions are powerful, fun, and captivating. Building multi-modal AI solutions is an essential tool to deliver customer delight via engaging different senses.

Imagine the power of the following experiences for example:

  • Virtual assistants that respond to voice commands and recognize objects in your environment.
  • Educational tools that combine visual aids with spoken explanations.
  • Enhanced customer support systems that analyze and respond to visual and auditory cues.

Step 1: Set Up Your Environment

Let's start by getting you set up:

Install Required Libraries

Install the following packages:

pip install transformers torch torchvision speechrecognition opencv-python

Learn More About the Installed Libraries


Step 2: Build a Multi-Modal Pipeline

1. Text + Vision

Let’s start by combining text and images. I’ll be using BLIP from Hugging Face.

As an illustration, let's write some code that takes an image as input and outputs a text description of that image:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Load the model and processor once to improve efficiency
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

def generate_image_description(image_path):
    """
    Generates a description of the given image using BLIP.

    Args:
        image_path (str): The path to the image.

    Returns:
        str: The generated description of the image.
    """
    # Load the image
    image = Image.open(image_path)

    # Process the image
    inputs = processor(image, return_tensors="pt")

    # Generate the description
    outputs = model.generate(**inputs)
    caption = processor.decode(outputs[0], skip_special_tokens=True)

    return caption

2. Voice

Now, let’s add voice input recognition via the SpeechRecognition library.

import speech_recognition as sr

def speech_to_text():
    """
    Captures audio from the microphone and converts it to text using Google Speech Recognition.

    Returns:
        str: The recognized text from speech, or an error message if recognition fails.
    """
    # Initialize the recognizer
    recognizer = sr.Recognizer()

    # Capture audio from the microphone
    with sr.Microphone() as source:
        print("Speak something...")
        audio = recognizer.listen(source)

    # Convert speech to text
    try:
        command = recognizer.recognize_google(audio)
        return command
    except sr.UnknownValueError:
        return "Sorry, could not understand the audio."
    except sr.RequestError as e:
        return f"Request error: {e}"

Text + Voice + Vision = Magic!

Finally, let’s create a pipeline that accepts an image, processes voice commands, and uses the LLM to generate a response.

from transformers import pipeline

# Assuming generate_image_description and speech_to_text functions are defined
from your_module import generate_image_description, speech_to_text  # Replace `your_module` with the actual module name

# Load an LLM pipeline
text_generator = pipeline("text-generation", model="gpt-3.5-turbo")

# Get the image description dynamically
image_path = "example_image.jpg"  # Replace with the actual image path
image_description = generate_image_description(image_path)

# Get the spoken input dynamically
spoken_input = speech_to_text()

# Combine inputs and generate output
prompt = f"The image shows: {image_description}. The user said: {spoken_input}. Provide a response:"
response = text_generator(prompt, max_length=100, num_return_sequences=1)

print("AI Response:", response[0]['generated_text'])

This creates a system where the AI interprets images and voice input, and then responds with contextually rich text.


Step 3: Build the Front End of Your Application

We'll be using Flask, a lightweight web stack, to build the front end of this application. In this sample,
the user uploads an image via the form on the main page (index.html). Flask processes the image and captures the voice input through the backend function listed above. The AI response is displayed on a results page (result.html).

from flask import Flask, request, render_template
from transformers import pipeline
from your_module import generate_image_description, speech_to_text  # Replace with the actual module name

app = Flask(__name__)

# Load the text generation pipeline
text_generator = pipeline("text-generation", model="gpt-3.5-turbo")

@app.route("/")
def index():
    return render_template("index.html")  # HTML form for image upload and voice input

@app.route("/process", methods=["POST"])
def process():
    # Handle image upload
    uploaded_image = request.files["image"]
    image_path = "uploaded_image.jpg"
    uploaded_image.save(image_path)

    # Handle voice input (assumes microphone input is captured separately)
    spoken_input = speech_to_text()

    # Generate image description
    image_description = generate_image_description(image_path)

    # Generate response
    prompt = f"The image shows: {image_description}. The user said: {spoken_input}. Provide a response:"
    response = text_generator(prompt, max_length=100, num_return_sequences=1)

    return render_template(
        "result.html",
        image_description=image_description,
        spoken_input=spoken_input,
        ai_response=response[0]['generated_text']
    )

if __name__ == "__main__":
    app.run(debug=True)

Below are the html files used in the sample above:

index.html

<!DOCTYPE html>
<html>
<head>
    <title>Multi-Modal AI Interface</title>
</head>
<body>
    <h1>Multi-Modal AI Interface</h1>
    <form action="/process" method="post" enctype="multipart/form-data">
        <label for="image">Upload an Image:</label>
        <input type="file" name="image" id="image" required><br><br>
        <button type="submit">Submit</button>
    </form>
</body>
</html>

result.html

<!DOCTYPE html>
<html>
<head>
    <title>AI Response</title>
</head>
<body>
    <h1>AI Response</h1>
    <p><strong>Image Description:</strong> {{ image_description }}</p>
    <p><strong>Spoken Input:</strong> {{ spoken_input }}</p>
    <p><strong>Generated Response:</strong> {{ ai_response }}</p>
    <a href="/">Go Back</a>
</body>
</html>

Further Reading


Building multi-modal AI systems is an essential step in many solutions given how diverse and interconnected our world is. By combining vision, voice, and text, you’re not just building a technical tool—you’re creating an experience that mirrors human interaction. Whether you’re creating a virtual assistant or reimagining how users interact with content, the possibilities are endless. Dive in, experiment, and let your curiosity lead the way!

Leave a Reply

Your email address will not be published. Required fields are marked *