How do Language Models Understand Images, Audio, and Video?

Language models began initially by, as the name suggests, modeling language using deep learning techniques. The most popular generative models, Large Language Models (LLMs), build on this capability of understanding language to create new content based on some form of input. Through (primarily) scaling the size of language models up, the ability to complete text based on an input has proven itself quite useful, leading to emergent reasoning, back-and-forth chatting, and advanced contextual understanding of text to near human abilities.

Scaling Laws for Neural Language Models

While text understanding is incredibly useful, we live in a multi-dimensional world where understanding through text is just a portion of how we as humans process information. Within the digital realm, we still have visual and auditory understanding, taking the form of images (both photographs or even just your screen), videos, and sound (speech / non-speech). As large language models work towards emulating cognitive human behavior in a digital world researchers have been active in providing these additional modalities.

Llama 3 Technical Report

This notebook will cover a high level overview of some of the techniques being used to teach language models image, video, and audio understanding with corresponding open source model examples.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
example_media		example_media
media		media
LICENSE		LICENSE
README.md		README.md
large_multimodal_models.ipynb		large_multimodal_models.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How do Language Models Understand Images, Audio, and Video?

About

Uh oh!

Releases

Packages

Languages

License

ALucek/multimodal-llm-breakdown

Folders and files

Latest commit

History

Repository files navigation

How do Language Models Understand Images, Audio, and Video?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages