Skip to content


Multimodal Learning - The Future Of AI

AI, Multimodal Learning, Machine Learning2 min read

Purvanshi Mehta

Meet Purvanshi Mehta

AI Scientist Intern @ Amazon

Rochester, New York

Purvanshi it’s a graduate student at the University of Rochester majoring in statistical Machine Learning.

Her research interests are Multimodal Learning and Natural Language processing.

Former to joining UoR I was working as a student researcher in the Machine Learning lab at TU Kaiserslautern, Germany under Dr. Marius Kloft, where she explored regularization techniques for Multimodal Fusion (Poster presented in DLRL).

She have previously worked in Math word problem solving (published in IJCNLP) and relation extraction from text.

How did you learn about Artificial Intelligence?

I came to know about AI in my sophomore year of my undergraduate studies. I read the book ‘On Intelligence’ by Jeff Hawkins and was completely enthralled by its idea.

I started by reading some of the books like Pattern Recognition by Christopher Bishop helped me to form strong foundations.

Define artificial intelligence (AI) and what could be its impact on society?

Artificial Intelligence is the ability to create mental models of our environment, causal relationships between them and manipulating them to our advantage.

What ‘AI’ is today, is mere ‘pattern recognition’ but its impact would depend on future advancements in the area.

Although I also agree that there are some domains where Machine Learning can have a negative impact on society.

For example language generation for fake news.

Define multimodal learning and provide an example?

Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities.

In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together.

For example, images are usually associated with tags and text explanations; texts contain images to more clearly express the main idea of the article.

Different modalities are characterized by very different statistical properties.

Do you have any ‘AI’ projects that you are working on and want to share your experience with us?

I am currently working on Multimodal fusion techniques.

In the past, I have also worked on arithmetic word problem solving which involves solving mathematical problems in the form of natural language.

Deep Neural Networks are especially bad at reasoning and deduction.

Arithmetic word problem solving involves language processing and step by step reasoning.

For example -  "A grocery sells a bag of ice for $1.25 and makes 20% profit. If it sells 500 bags of ice, how much total profit does it make?"

According to my personal experience in ML, one should work on important and core problems and publish work which is non-engineering.

What are the main multimodal problems to work on according to you?

Multimodal learning can be divided into 5 main categories:

Representation Learning, Translation (Eg. Generating image captions from images).

Alignment (Eg. aligning the steps in a recipe to a video showing the dish being made).

Fusion (Eg. Using youtube video and audio to find out the sentiment).

Co-learning (how can learning from one modality help a computational model trained on a different model).

Human perception is multimodal therefore for building a machine that could have similar capabilities we need to 'Go Multimodal'.