Have you ever wondered how it is that you can simultaneously listen to music, read a book and recognize the smell of freshly brewed coffee? It’s all thanks to the human ability to process multiple types of data at the same time, i.e. the fact that we are multimodal beings. Bard, the intelligent chatbot from Google, has been multimodal since July 2023. Since October, ChatGPT has been enhanced to understand multiple types of information. Both can not only understand text but also read and visualize data, conduct voice conversation and recognize images. Multimodal AI is thus gaining even more potential to revolutionize the business world. Let’s take a closer look at it to understand the vast possibilities hidden in multitasking AI.

What is multimodal AI?

Multimodal AI is a highly advanced form of AI that mimics the human ability to interpret the world using content and data from different senses. Just as humans understand text, images and sounds, multimodal AI integrates these different types of data to understand the context and complex meaning contained in information. In business, for example, it can enable a better understanding of customer opinions by analyzing both what they say and how they express it through tone of voice or facial expression.

Traditional AI systems are typically unimodal, meaning they specialize in one type of data, such as text or images. They can process large amounts of data quickly and spot patterns that human intelligence cannot pick up. However, they have serious limitations. They are insensitive to context and less adept at dealing with unusual and ambiguous situations.

This is why multimodal AI goes a step further, integrating modalities. This allows for deeper understanding and much more interesting interactions between humans and AI.

What can multimodal AI do?

Artificial intelligence models developed today employ the following pairs of modalities:

  • from text to image – such multimodal AI can create images based on textual prompts; this is a core capability of the famous Midjourney, the OpenAI-developed DALL-E 3, available in the browser as Bing Image Creator, the advanced Stable Diffusion or the youngest tool in the family, Ideogram, which not only understands textual prompts but can also place text on an image:
  • Multimodal AI

    Source: Ideogram (https://ideogram.ai)

    Multimodal AI models are also able to follow textual cues and the image they are “inspired” by simultaneously. They offer even more interesting, more precisely defined results and variations of created images. This is very helpful if you just want to get a slightly different graphic or banner, or add or remove a single element, such as a coffee mug:

    Multimodal AI

    Source: Ideogram (https://ideogram.ai)

  • From image to text – artificial intelligence can do much more than recognize and translate text seen in an image or find a similar product. It can also describe an image in words – as Midjourney does when you type the /describe command, Google Bard, and the Salesforce model (used mainly to create automated product and image descriptions on e-commerce sites,
  • Multimodal AI

    Source: HuggingFace.co (https://huggingface.co/tasks/image-to-text)

  • from voice to text – multimodal AI also empowers voice commands in Google Bard, but it is best performed by Bing Chat, as well as ChatGPT thanks to its excellent Whisper API, which copes with recognizing and recording speech along with punctuation in multiple languages, which can, among other things, greatly facilitate the work of international customer service centers, as well as prepare quick transcription of meetings and translation of business conversations into other languages in real-time,
  • from text to voice – ElevenLabs’ tool allows us to convert any text we choose into a realistic-sounding utterance, and even “voice cloning,” whereby we can teach the AI its sound and expression to create a recording of any text in a foreign language for marketing or presentations to foreign investors, for example,
  • from text to video – converting text to video with a talking avatar is possible in D-ID, Colossyan and Synthesia tools, among others,
  • from image to video – generating videos, including music videos, from images and textual cues is already made possible today by Kaiber, and Meta has announced the release of the Make-A-Video tool soon,
  • image and 3D model – this is a particularly promising area of multimodal AI, targeted by Meta and Nvidia, which enables the creation of realistic avatars from photos, as well as the building of 3D models of objects and products by Masterpiece Studio (https://masterpiecestudio.com/masterpiece-studio-pro), NeROIC (https://zfkuang.github.io/NeROIC/), 3DFY (https://3dfy.ai/), with which, for example, a two-dimensional prototyped product can be returned to the camera with a different side, a quick 3D visualization can be created from a sketch of a piece of furniture, or even a textual description:
  • Multimodal AI

    Source: NeROIC (https://zfkuang.github.io/NeROIC/resources/material.png)

  • from image to movement in space – this modality makes multimodal AI reach beyond screens into the zone of the Internet of Things (IoT), autonomous vehicles and robotics, where devices can perform precise actions thanks to advanced image recognition and the ability to respond to changes in the environment.

There are also experiments with multimodal AI translating music into images, for example (https://huggingface.co/spaces/fffiloni/Music-To-Image), but let’s take a closer look at the business applications of multimodal AI. So how does the issue of multimodality play out in the most popular AI-based chatbots, ChatGPT and Google Bard?

Multimodality in Google Bard, BingChat and ChatGPT

Google Bard can describe simple images and has been equipped with voice communication since July 2023, when it appeared in Europe. Despite the variable quality of the image recognition results, this has so far been one of the strengths that differentiates Google’s solution from ChatGPT.

BingChat, thanks to its use of DALL-E 3, can generate images based on text or voice prompts. While it cannot describe in words the images attached by the user, it can modify them or use them as inspiration to create new images.

As of October 2023, OpenAI also began introducing new voice and image features to ChatGPT Plus, the paid version of the tool. They make it possible to have a voice conversation or show ChatGPT an image, so it will know what you’re asking without having to describe it in exact words.

For example, you can take a photo of a monument while traveling and have a live conversation about what’s interesting about it. Or take a picture of the inside of your refrigerator to find out what you can prepare for dinner with the available ingredients and ask for a step-by-step recipe.

3 applications of Multimodal AI in business

Describing images can help, for example, to prepare goods inventory based on CCTV camera data or identify missing products on store shelves. Object manipulation can be used to replenish the missing goods identified in the previous step. But how can multimodal chatbots be used in business? Here are three examples:

  1. Customer service: A multimodal chat implemented in an online store can serve as an advanced customer service assistant that not only answers text questions but also understands images and questions asked by voice. For example, a customer can take a picture of a damaged product and send it to the chatbot, which will help identify the problem and offer an appropriate solution.
  2. Social media analysis: Multimodal artificial intelligence can analyze social media posts, which include both text and images and even videos, to understand what customers are saying about a company and its products. This can help a company better understand customer feedback and respond more quickly to their needs.
  3. Training and Development: ChatGPT can be used to train employees. For example, it can conduct interactive training sessions that include both text and images to help employees better understand complex concepts.

The future of multimodal AI in business

A great example of forward-looking multimodal AI is the optimization of a company’s business processes. For example, an AI system could analyze data from various sources, such as sales data, customer data and social media data, to identify areas that need improvement and suggest possible solutions.

Another example is employing multimodal AI to organize logistics. Combining GPS data, warehouse status read from a camera and delivery data to optimize logistics processes and reduce costs of business.

Many of these functionalities are already applied today in complex systems such as autonomous cars and smart cities. However, they have not been at this scale in smaller business contexts.


Multimodality, or the ability to process multiple types of data, such as text, images and audio, promotes deeper contextual understanding and better interaction between humans and AI systems.

An open question remains, what new combinations of modalities might exist shortly? For example, will it be possible to combine text analysis with body language, so that AI can anticipate customer needs by analyzing their facial expressions and gestures? This type of innovation opens up new horizons for business, helping to meet ever-changing customer expectations.

Multimodal AI

If you like our content, join our busy bees community on Facebook, Twitter, LinkedIn, Instagram, YouTube, Pinterest, TikTok.

Multimodal AI. New uses of artificial intelligence in business | AI in business #21 robert whitney avatar 1background

Author: Robert Whitney

JavaScript expert and instructor who coaches IT departments. His main goal is to up-level team productivity by teaching others how to effectively cooperate while coding.

AI in business:

  1. Threats and opportunities of AI in business (part 1)
  2. Threats and opportunities of AI in business (part 2)
  3. AI applications in business - overview
  4. AI-assisted text chatbots
  5. Business NLP today and tomorrow
  6. The role of AI in business decision-making
  7. Scheduling social media posts. How can AI help?
  8. Automated social media posts
  9. New services and products operating with AI
  10. What are the weaknesses of my business idea? A brainstorming session with ChatGPT
  11. Using ChatGPT in business
  12. Synthetic actors. Top 3 AI video generators
  13. 3 useful AI graphic design tools. Generative AI in business
  14. 3 awesome AI writers you must try out today
  15. Exploring the power of AI in music creation
  16. Navigating new business opportunities with ChatGPT-4
  17. AI tools for the manager
  18. 6 awesome ChatGTP plugins that will make your life easier
  19. 3 grafików AI. Generatywna sztuczna inteligencja dla biznesu
  20. What is the future of AI according to McKinsey Global Institute?
  21. Artificial intelligence in business - Introduction
  22. What is NLP, or natural language processing in business
  23. Automatic document processing
  24. Google Translate vs DeepL. 5 applications of machine translation for business
  25. The operation and business applications of voicebots
  26. Virtual assistant technology, or how to talk to AI?
  27. What is Business Intelligence?
  28. Will artificial intelligence replace business analysts?
  29. How can artificial intelligence help with BPM?
  30. AI and social media – what do they say about us?
  31. Artificial intelligence in content management
  32. Creative AI of today and tomorrow
  33. Multimodal AI and its applications in business
  34. New interactions. How is AI changing the way we operate devices?
  35. RPA and APIs in a digital company
  36. The future job market and upcoming professions
  37. AI in EdTech. 3 examples of companies that used the potential of artificial intelligence
  38. Artificial intelligence and the environment. 3 AI solutions to help you build a sustainable business
  39. AI content detectors. Are they worth it?
  40. ChatGPT vs Bard vs Bing. Which AI chatbot is leading the race?
  41. Is chatbot AI a competitor to Google search?
  42. Effective ChatGPT Prompts for HR and Recruitment
  43. Prompt engineering. What does a prompt engineer do?
  44. AI Mockup generator. Top 4 tools
  45. AI and what else? Top technology trends for business in 2024
  46. AI and business ethics. Why you should invest in ethical solutions
  47. Meta AI. What should you know about Facebook and Instagram's AI-supported features?
  48. AI regulation. What do you need to know as an entrepreneur?
  49. 5 new uses of AI in business
  50. AI products and projects - how are they different from others?
  51. AI-assisted process automation. Where to start?
  52. How do you match an AI solution to a business problem?
  53. AI as an expert on your team
  54. AI team vs. division of roles
  55. How to choose a career field in AI?
  56. Is it always worth it to add artificial intelligence to the product development process?
  57. AI in HR: How recruitment automation affects HR and team development
  58. 6 most interesting AI tools in 2023
  59. 6 biggest business mishaps caused by AI
  60. What is the company's AI maturity analysis?
  61. AI for B2B personalization
  62. ChatGPT use cases. 18 examples of how to improve your business with ChatGPT in 2024
  63. Microlearning. A quick way to get new skills
  64. The most interesting AI implementations in companies in 2024
  65. What do artificial intelligence specialists do?
  66. What challenges does the AI project bring?
  67. Top 8 AI tools for business in 2024
  68. AI in CRM. What does AI change in CRM tools?
  69. The UE AI Act. How does Europe regulate the use of artificial intelligence
  70. Sora. How will realistic videos from OpenAI change business?
  71. Top 7 AI website builders
  72. No-code tools and AI innovations