Computer vision is having another ImageNet moment

In 2012, the now-famous AlexNet decisively outperformed other models in the ImageNet competition, ushering in the widespread adoption of Convolutional Neural Networks (CNNs). Practically overnight, this specific flavor of deep learning models halved error rates compared to other state-of-the-art computer vision techniques. It marked the beginning of a dramatic improvement in the performance of computer vision models, rapidly approaching human-level accuracy.

And now, in 2024, we find ourselves amid another computer vision revolution. While the 2012 revolution marked a significant step up in performance, this new revolution promises a major leap in accessibility to generalist computer vision models that are capable of solving a wide range of tasks. Similar to the advancements in natural language processing, this revolution is driven by the powerful Transformer architecture, the same model underpinning Large Language Models and ChatGPT.

The way we make computers see is about to change

The current paradigm of computer vision involves collecting vast amounts of images, extensive labeling, and training specialized models for each specific task. Once trained, the models are validated on data, hoping they are robust enough to handle the real world, which is filled with edge cases. This process is quite cumbersome, and it’s no surprise that it has led to the emergence of many specialized computer vision companies focused on helping computers solve tasks, some of which are quite mundane for humans.

But with the arrival of Multimodal LLMs, this is all about to change, and the shift is still going quite unnoticed also also mentioned in this recent tweet by Ethan Mollick:

Ethan Mollick remarks on the limited attention given to exploring the true power of AI vision.

When you start experimenting with multimodal models like OpenAI’s GPT-4o, Anthropic's Claude Sonnet 3.5, Google’s PaliGemma, or Tencent's YoloWorld, you can sense that we are on the cusp of a transformative phase. We are transitioning from a world where models were only accessible to experts to a world where you can simply tell (prompt) a generalist model what vision task it needs to solve, and it will just do it!

Look, I trained—uhhh, prompted a pothole detector

It is still early days, so do not expect perfection

Today, you can already feel the magic of multimodal LLMs solving various computer vision tasks, but it’s important to remember that these systems still have their issues. So, should you consider adding Multimodal LLMs to your computer vision solutions? Well, as with everything in life, “it depends".

If you are solving complex vision tasks where humans typically struggle and accuracy is crucial, then adding an LLMs to your stack may not (yet) be the best idea. However, if you are working on solutions with more straightforward vision tasks where occasional errors are acceptable, then experimenting with Multimodal LLMs is worth considering—especially during the prototyping or scale-up phase.

Once your solution proves useful, you can always switch to training your own computer vision models, which might be cheaper and faster. Alternatively, you can distill the power of these large models into smaller ones that can run, for example, on edge devices, as shown in this blog post: LLM Knowledge Distillation: GPT-4o.

Evaluation remains a crucial factor for success

Using these new LLM systems relieves you from the challenge of having to train your models, as this has already been done by companies like OpenAI. However, that does not mean you are relieved of the need to collect data and perform proper evaluation.

In the current computer vision paradigm, you collect a lot of images for training your model and use part of that data for evaluation. In the new paradigm, you do not have to train a model, but you still need to optimize your prompts and validate the performance of your system. So, evaluation remains as crucial as ever before! And to conduct proper evaluation, you will still need data, but fortunately, not as much as you would need to train a model from scratch.

With the current generation of LLMs, there are still annoying failure modes; this is nothing new. Whether you are using LLMs for Retrieval-Augmented Generation (RAG), generating text, or, in this case, solving computer vision tasks, good evaluations are essential to achieving success. See, for example, one of our earlier blog posts on Pushing a RAG Prototype to Production.

Do not underestimate the potential upside of initial success

Have you already managed to solve your computer vision problem sufficiently using a multimodal LLM? Congratulations, you are now in a great position! Why? You are now set up to ride the wave of where the current LLM market is rapidly heading: better models at lower prices (see also our previous blog on this topic: The Era of Choice in AI).

With minimal effort, your solution will keep improving and become cheaper to run. The only thing you need to do is keep an eye on your evaluations and align your prompts with newer models. Who can say no to that?

As we stand on the brink of this new era in computer vision, one thing is crystal clear: the future is no longer just about improving it's performance—it’s about making it simpler and more accessible for everyone!

Computer vision is having another ImageNet moment

The way we make computers see is about to change

It is still early days, so do not expect perfection

Evaluation remains a crucial factor for success

Do not underestimate the potential upside of initial success

Stijn Tonk

Keep up with the latest

Start your journey to People Positive AI.