Multimodal AI. New uses of artificial intelligence in business | AI in business #21

What is multimodal AI?

Multimodal AI is a highly advanced form of AI that mimics the human ability to interpret the world using content and data from different senses. Just as humans understand text, images and sounds, multimodal AI integrates these different types of data to understand the context and complex meaning contained in information. In business, for example, it can enable a better understanding of customer opinions by analyzing both what they say and how they express it through tone of voice or facial expression.

Traditional AI systems are typically unimodal, meaning they specialize in one type of data, such as text or images. They can process large amounts of data quickly and spot patterns that human intelligence cannot pick up. However, they have serious limitations. They are insensitive to context and less adept at dealing with unusual and ambiguous situations.

This is why multimodal AI goes a step further, integrating modalities. This allows for deeper understanding and much more interesting interactions between humans and AI.

What can multimodal AI do?

Artificial intelligence models developed today employ the following pairs of modalities:

from text to image – such multimodal AI can create images based on textual prompts; this is a core capability of the famous Midjourney, the OpenAI-developed DALL-E 3, available in the browser as Bing Image Creator, the advanced Stable Diffusion or the youngest tool in the family, Ideogram, which not only understands textual prompts but can also place text on an image:

Source: Ideogram (https://ideogram.ai)

Multimodal AI models are also able to follow textual cues and the image they are “inspired” by simultaneously. They offer even more interesting, more precisely defined results and variations of created images. This is very helpful if you just want to get a slightly different graphic or banner, or add or remove a single element, such as a coffee mug:

Source: Ideogram (https://ideogram.ai)

From image to text – artificial intelligence can do much more than recognize and translate text seen in an image or find a similar product. It can also describe an image in words – as Midjourney does when you type the /describe command, Google Bard, and the Salesforce model (used mainly to create automated product and image descriptions on e-commerce sites,

Source: HuggingFace.co (https://huggingface.co/tasks/image-to-text)

from voice to text – multimodal AI also empowers voice commands in Google Bard, but it is best performed by Bing Chat, as well as ChatGPT thanks to its excellent Whisper API, which copes with recognizing and recording speech along with punctuation in multiple languages, which can, among other things, greatly facilitate the work of international customer service centers, as well as prepare quick transcription of meetings and translation of business conversations into other languages in real-time,
from text to voice – ElevenLabs’ tool allows us to convert any text we choose into a realistic-sounding utterance, and even “voice cloning,” whereby we can teach the AI its sound and expression to create a recording of any text in a foreign language for marketing or presentations to foreign investors, for example,
from text to video – converting text to video with a talking avatar is possible in D-ID, Colossyan and Synthesia tools, among others,
from image to video – generating videos, including music videos, from images and textual cues is already made possible today by Kaiber, and Meta has announced the release of the Make-A-Video tool soon,
image and 3D model – this is a particularly promising area of multimodal AI, targeted by Meta and Nvidia, which enables the creation of realistic avatars from photos, as well as the building of 3D models of objects and products by Masterpiece Studio (https://masterpiecestudio.com/masterpiece-studio-pro), NeROIC (https://zfkuang.github.io/NeROIC/), 3DFY (https://3dfy.ai/), with which, for example, a two-dimensional prototyped product can be returned to the camera with a different side, a quick 3D visualization can be created from a sketch of a piece of furniture, or even a textual description:

Source: NeROIC (https://zfkuang.github.io/NeROIC/resources/material.png)

from image to movement in space – this modality makes multimodal AI reach beyond screens into the zone of the Internet of Things (IoT), autonomous vehicles and robotics, where devices can perform precise actions thanks to advanced image recognition and the ability to respond to changes in the environment.

There are also experiments with multimodal AI translating music into images, for example (https://huggingface.co/spaces/fffiloni/Music-To-Image), but let’s take a closer look at the business applications of multimodal AI. So how does the issue of multimodality play out in the most popular AI-based chatbots, ChatGPT and Google Bard?

Multimodality in Google Bard, BingChat and ChatGPT

Google Bard can describe simple images and has been equipped with voice communication since July 2023, when it appeared in Europe. Despite the variable quality of the image recognition results, this has so far been one of the strengths that differentiates Google’s solution from ChatGPT.

BingChat, thanks to its use of DALL-E 3, can generate images based on text or voice prompts. While it cannot describe in words the images attached by the user, it can modify them or use them as inspiration to create new images.

As of October 2023, OpenAI also began introducing new voice and image features to ChatGPT Plus, the paid version of the tool. They make it possible to have a voice conversation or show ChatGPT an image, so it will know what you’re asking without having to describe it in exact words.

For example, you can take a photo of a monument while traveling and have a live conversation about what’s interesting about it. Or take a picture of the inside of your refrigerator to find out what you can prepare for dinner with the available ingredients and ask for a step-by-step recipe.

3 applications of Multimodal AI in business

Describing images can help, for example, to prepare goods inventory based on CCTV camera data or identify missing products on store shelves. Object manipulation can be used to replenish the missing goods identified in the previous step. But how can multimodal chatbots be used in business? Here are three examples:

Customer service: A multimodal chat implemented in an online store can serve as an advanced customer service assistant that not only answers text questions but also understands images and questions asked by voice. For example, a customer can take a picture of a damaged product and send it to the chatbot, which will help identify the problem and offer an appropriate solution.
Social media analysis: Multimodal artificial intelligence can analyze social media posts, which include both text and images and even videos, to understand what customers are saying about a company and its products. This can help a company better understand customer feedback and respond more quickly to their needs.
Training and Development: ChatGPT can be used to train employees. For example, it can conduct interactive training sessions that include both text and images to help employees better understand complex concepts.

The future of multimodal AI in business

A great example of forward-looking multimodal AI is the optimization of a company’s business processes. For example, an AI system could analyze data from various sources, such as sales data, customer data and social media data, to identify areas that need improvement and suggest possible solutions.

Another example is employing multimodal AI to organize logistics. Combining GPS data, warehouse status read from a camera and delivery data to optimize logistics processes and reduce costs of business.

Many of these functionalities are already applied today in complex systems such as autonomous cars and smart cities. However, they have not been at this scale in smaller business contexts.

Summary

Multimodality, or the ability to process multiple types of data, such as text, images and audio, promotes deeper contextual understanding and better interaction between humans and AI systems.

An open question remains, what new combinations of modalities might exist shortly? For example, will it be possible to combine text analysis with body language, so that AI can anticipate customer needs by analyzing their facial expressions and gestures? This type of innovation opens up new horizons for business, helping to meet ever-changing customer expectations.

If you like our content, join our busy bees community on Facebook, Twitter, LinkedIn, Instagram, YouTube, Pinterest, TikTok.

Author: Robert Whitney
JavaScript expert and instructor who coaches IT departments. His main goal is to up-level team productivity by teaching others how to effectively cooperate while coding.
View all posts

Robert Whitney

JavaScript expert and instructor who coaches IT departments. His main goal is to up-level team productivity by teaching others how to effectively cooperate while coding.