How Do Foundation Models Transform Computer Vision?

Foundation models transform computer vision by significantly improving the accuracy and efficiency of tasks like image classification, object detection, and image generation. By leveraging pre-trained data, these models adapt quickly to new tasks with minimal additional training, driving innovations in automated systems and AI-driven analysis.

How Does a Foundation Model Alter with the Adoption of AI?

Adopting AI through foundation models enhances computational efficiency and versatility in application and enables multimodal capabilities. This means they can understand and process various data types beyond their initial training, leading to more robust and adaptable AI systems that can perform a wider range of tasks more effectively.

What Are Some Examples of Foundation Models in Computer Vision?

Examples of foundation models in computer vision include CLIP and DALL-E from OpenAI, which handle tasks ranging from object recognition to generating images from textual descriptions. These models demonstrate foundation models' versatility and capability to recognize and creatively interpret visual data.

What Architectural Innovations Support Foundation Models?

Architectural innovations such as transformer models, which use mechanisms like attention to process inputs, support the scalability and effectiveness of foundation models. These architectures are crucial for handling the large and diverse datasets on which foundation models are trained, allowing for efficient learning and adaptation across different tasks and modalities.

What Are the Future Prospects and Challenges for Foundation Models?

The prospects for foundation models include their potential to revolutionize fields like healthcare, law, and education by providing advanced AI tools capable of understanding complex, multimodal data. Challenges include ensuring ethical use, avoiding biases, improving model transparency, and managing the environmental impact of training large-scale models.

Back to Blogs

Contents

What are Foundation Models?
Architectural Evolution of Foundation Models
Training Objectives and Methodologies in Foundation Models
Foundation Models in Action: Transforming Computer Vision Tasks
Innovations in Model Architecture: Transforming Computer Vision
Achievements in Accuracy, Efficiency, and Versatility of Foundation Models in Computer Vision
The Future of Foundation Models in AI
Foundational Models in AI: Key Takeaways

Encord Blog

How Have Foundation Models Redefined Computer Vision Using AI?

April 30, 2024

8 mins

Back to Blogs

Contents

What are Foundation Models?
Architectural Evolution of Foundation Models
Training Objectives and Methodologies in Foundation Models
Foundation Models in Action: Transforming Computer Vision Tasks
Innovations in Model Architecture: Transforming Computer Vision
Achievements in Accuracy, Efficiency, and Versatility of Foundation Models in Computer Vision
The Future of Foundation Models in AI
Foundational Models in AI: Key Takeaways

Written by

Stephen Oladele

View more posts

Foundation models have markedly advanced computer vision, a field that has transitioned from simple pattern recognition to sophisticated systems capable of complex visual analysis. Advances in neural networks, particularly deep learning, have accelerated this evolution by improving the ability of applications to interpret and interact with their visual surroundings.

With the emergence of foundation models—large-scale AI models trained on extensive datasets—there is a shift towards more adaptable and scalable solutions in computer vision. These models, like OpenAI's CLIP, are already trained to recognize many visual patterns. They can do various tasks, like image classification, object detection, and image captioning, with minimal additional training.

Foundation models are changing how AI is developed because they are flexible and efficient. Multiple tasks can be done with a single, complete model, which saves developers time and money. This method makes work easier and helps the models do better on different tasks, setting the stage for more big steps in computer vision.

This article will explore the impact of foundational models in computer vision. We will examine their architectures, trace their evolution, and showcase their application through case studies in image classification, object detection, and image captioning. We'll discuss their broader impact on the field and look ahead to the exciting future of foundation models in AI.

Curate Data for Foundation Models with Encord

Clean & curate data smartly

Create quality labels quickly

Validate your label quality

Evaluate & monitor your models

Book a live demo

What are Foundation Models?

Foundation models are a big change in AI. They move away from specialized systems and toward more generalist frameworks that can get data from huge, diverse, and unlabeled datasets and use it for different tasks with minimal additional training.

Pre-trained models like GPT-3, BERT, and DALL-E have absorbed wide-ranging knowledge from huge datasets, enabling them to understand broad aspects of the world. This preliminary training allows these models to be fine-tuned for specific applications, avoiding the need to build new models from scratch for each task.

The Transformer architecture, commonly associated with these models, excels at processing data sequences through attention mechanisms that dynamically evaluate the importance of different inputs. This design enables the models to generate coherent and contextually relevant outputs across various data types, including text and images.

Foundation models are designed to be a common starting point customized to perform well on a wide range of downstream tasks, a strong base of modern AI systems.

Key Examples of Foundation Models in AI

Transformer-based Large Language Models (LLMs):

Transformer-based LLMs, such as GPT-3 and BERT, have significantly advanced the capabilities of AI in natural language processing. These models utilize a transformer architecture that allows for highly effective parallel processing and handling of sequential data. They are pivotal due to their ability to learn from vast amounts of data and generalize across various tasks without task-specific tuning, dramatically enhancing efficiency and flexibility in AI. applications.

Transformer Architecture

Transformer Architecture

CLIP (Contrastive Language–Image Pre-training):

CLIP by OpenAI is another foundational model designed to understand images in conjunction with textual descriptions. This multimodal model can perform tasks that require linking images with relevant text, making it exceptionally useful in applications that span both visual and textual data. Its ability to generalize from natural language to visual concepts without direct training on specific visual tasks marks a significant advancement in AI's capabilities.

CLIP Training

CLIP Training

Recommended Read: Top 8 Alternatives to the Open AI CLIP Model.

BERT (Bidirectional Encoder Representations from Transformers):

BERT is revolutionary in the NLP domain. Developed by Google, BERT's bidirectional training mechanism allows it to understand the context of a word based on all surrounding words, unlike previous models, which processed text linearly.

This capability has set new standards for NLP tasks, including question-answering and language translation. BERT's effectiveness is further enhanced by techniques like masked language modeling, which involves predicting randomly masked words in a sentence, providing a robust way to learn deep contextual relationships within the text. The model's flexibility is evident from its various adaptations, such as RoBERTa and DistilBERT, which adjust its architecture for optimized performance or efficiency.

Comparison of BERT architectures

Comparison of BERT Architectures

Architectural Evolution of Foundation Models

Dual-Encoder Architecture

Dual-encoder architectures employ two separate encoders, each handling a different type of input—textual, visual, or from different languages. Each encoder independently processes its input, and its outputs are aligned using a contrastive loss function, which synchronizes the embeddings from both encoders. This method is invaluable for tasks like image-text and multilingual information retrieval, where distinct processing pathways are necessary for each modality or language.

Fusion Architecture

Fusion architectures take a step further by integrating the outputs of individual encoders into a single, cohesive representation. This approach allows for more intricate interactions between modalities, leading to improved performance on tasks that demand a nuanced understanding of the combined data, such as visual question-answering and multimodal sentiment analysis.

Encoder-Decoder Architecture

Encoder-decoder architectures are traditionally used for sequence-to-sequence tasks and have been adapted for vision-language applications. These models encode the input into a latent representation, which the decoder then uses to generate an output sequence.

Approaches like cross-modal attention mechanisms have been introduced to improve the model's ability to focus on salient parts of the input, improving the relevance and coherence of the generated text.

Recommended Read: Guide to Vision-Language Models (VLMs).

Adapted Large Language Models (LLMs)

Adapted LLMs involve modifying pre-existing language models to accommodate new modalities or tasks by incorporating new encoders, such as visual encoders. This adaptation allows models like GPT and BERT to handle visual content understanding and generation, bridging NLP and computer vision applications.

Comparison of different E-D architectures

Comparison of different E-D architectures

The evolution of foundation model architectures has significantly expanded the capabilities of AI systems in handling vision-language tasks. Each architectural type offers unique advantages and caters to different application requirements, pushing the boundaries of what is achievable with multimodal AI.

Training Objectives and Methodologies in Foundation Models

Foundation models utilize diverse training objectives and methodologies, primarily focusing on contrastive and generative objectives. Each plays a critical role in guiding the development and effectiveness of these models across various applications.

Contrastive Objectives

Contrastive objectives aim to teach models to distinguish between similar and dissimilar examples. For instance, a contrastive image-text model might be trained to maximize the similarity between an image and a matching caption while minimizing the similarity between that image and unrelated captions. This teaches the model to create meaningful representations of both visual and textual data.

Here are the methodologies used in this training objective:

Contrastive Learning: This approach is essential for learning high-quality representations by maximizing the similarity between related pairs and minimizing it between unrelated pairs. It's extensively used in models like CoCa, which uses a dual-encoder system to align text and image representations.
Unlabeled Data Utilization: Contrastive learning is particularly valuable for using abundant unlabeled data, which is crucial given the high cost and effort required to curate large-scale labeled datasets.
Across Domains: Contrastive learning improves the ability of foundation models to work across domains without using labeled data by letting them adapt to different tasks.

Recommended Read: 5 Ways to Improve the Quality of Labeled Data.

Generative Objectives

These objectives focus on having the model create new data based on its understanding. For example, an image captioning model might have a decoder that takes the encoded representation of an image and generates a textual description, word by word.

Here are some examples:

Encoder-Decoder Architectures: These architectures generate new data based on learned representations. The CoCa model, for example, uses an encoder to process images and a decoder to generate text, facilitating detailed image captioning and comprehensive vision-language understanding.
Fine-Grained Representations: Generative objectives are crucial for managing detailed representations for tasks that require a deep understanding of content, such as intricate image descriptions or detailed text generation.

CoCa Model

Integrated Approaches

Modern foundation models often combine contrastive and generative objectives. This allows them to learn to discriminate between different datasets and generate realistic and contextually appropriate outputs.

Here are some examples of the methods:

Combining Objectives: Modern models often blend contrastive and generative objectives to leverage their strengths. This hybrid strategy enables training models that distinguish between data types and generate coherent, contextually accurate outputs.
CoCa Model: The CoCa model is an example of this unified approach. It has a decoupled decoder design that separately improves contrastive and generative goals. This makes the model better at both alignment and generation tasks.
Subsuming Capabilities: This method lets models like CoCa combine the best features of models good at zero-shot learning tasks (e.g., CLIP) and models good at multimodal image-text tasks (e.g., SIMVLM) into a single model.

Recommended Webinar: How to Build Semantic Visual Search with ChatGPT & CLIP.

Foundation models, through their diverse training objectives and methodologies, are pivotal in developing general AI. Due to their adaptability and effectiveness in addressing diverse and challenging AI problems, they excel in various applications, from simple classification tasks to complex multimodal interactions.

Foundation Models in Action: Transforming Computer Vision Tasks

Foundation models have significantly influenced a range of computer vision tasks, leveraging their extensive pre-trained knowledge to enhance performance across various applications. Here are some notable case studies:

Scene Change Detection in Videos

CLIP, a foundation model from OpenAI, has been utilized to detect video scene changes, such as differentiating between game and advertisement segments during sports broadcasts. This is achieved by evaluating the similarity between consecutive frames.

Object Detection and Classification

As developed by Deci, YOLO-NAS is a foundation model that achieves state-of-the-art performance in real-time object detection, effectively balancing accuracy and speed. It is suitable for applications like traffic monitoring and automated retail systems.

Medical Imaging

EfficientNet, another foundation model, has been successfully applied in the healthcare sector, particularly in medical image analysis. Its ability to maintain high accuracy while managing computational demands makes it an invaluable tool for diagnosing diseases from medical imaging data such as X-rays and MRIs.

Retail and E-Commerce

The BLIP-2 vision language model facilitates automatic product tagging and image indexing, which is crucial for e-commerce platforms. This function automatically generates product tags and descriptions based on their images, enhancing searchability and catalogue management.

Content Analysis in Media and Entertainment

The OWL-ViT model is employed for content analysis tasks in the media and entertainment industry. It supports open-vocabulary object detection, aiding video summarization, scene recognition, and content moderation. It ensures that digital platforms can efficiently categorize and manage a vast array of visual content.

These examples illustrate how foundation models are integrated into real-world applications, revolutionizing how machines understand and interact with visual data across various industries.

Recommended Read: The Full Guide to Foundation Models.

Innovations in Model Architecture: Transforming Computer Vision

Computer vision has improved greatly due to the development of model architectures such as YOLO-NAS, Mask2Former, DETR, and ConvNeXt, which perform well on various vision tasks.

YOLO-NAS

YOLO-NAS, developed by Deci AI, upped the game for object detection tasks by outperforming other YOLO models. It uses neural architecture search (NAS) to optimize the trade-off between accuracy and latency. It has enhanced quantization support, making it suitable for real-time edge-device applications.

YOLO-NAS has shown superior performance in detecting small objects and improving localization accuracy, which is crucial for autonomous driving and real-time surveillance applications.

YOLO-NAS by DeciAI

YOLO-NAS by DeciAI

Mask2Former

Mask2Former is a versatile transformer-based architecture capable of addressing various image segmentation tasks, including panoptic, instance, and semantic segmentation.

Its key innovation is masked attention, which extracts localized features within predicted mask regions. This model simplifies the research effort by handling multiple segmentation tasks and outperforms specialized architectures on several datasets.

Mask2Former Architecture

Mask2Former Architecture

DETR

DETR (Detection Transformer) makes the object detection pipeline easier by treating it as a direct set prediction problem. This means many common parts, such as non-maximum suppression, are unnecessary.

It uses a transformer encoder-decoder architecture and performs well in accuracy and runtime as the well-known Faster R-CNN baseline on the COCO dataset.

DETR Architecture

DETR Architecture

ConvNeXt

ConvNeXt modernizes traditional convolutional neural network (CNN) designs by incorporating strategies from transformers, significantly boosting performance and scalability.

This model overcomes the constraints of previous CNNs by integrating features such as larger kernel sizes and LayerScale, which stabilize training and enhance the network's capacity for representation.

ConvNeXt Architecture

ConvNeXt Architecture

GroundingDINO

GroundingDINO elevates self-supervised learning by deepening computer vision's ability to understand visual content without relying on labelled datasets. It utilizes knowledge distillation, where a smaller model is trained to emulate a more sophisticated, pre-trained "teacher" model.

This technique enables precise object identification and segmentation within images, significantly increasing the efficiency of training vision models on extensive, unlabeled datasets.

Grounding DINO Architecture

GroundingDINO Architecture

Achievements in Accuracy, Efficiency, and Versatility of Foundation Models in Computer Vision

Achievements in Accuracy

Foundation models like EfficientNet have set new benchmarks in image classification accuracy. EfficientNet-B7, for instance, achieves state-of-the-art accuracy on ImageNet while being considerably smaller and faster than previous models.

Vision Transformers (ViTs) have also demonstrated exceptional performance, often surpassing traditional CNNs in extensive image recognition tasks. These models have been pivotal in advancing the accuracy of computer vision systems, enabling them to perform high-quality image analysis across various domains.

Achievements in Efficiency

Hardware optimization has greatly enhanced the efficiency of foundation models. Deci's foundation models, for example, are optimized for specific hardware, ensuring efficient performance and resource utilization. This optimization is crucial for real-time applications that require low latency, such as object detection in video surveillance, where models like YOLO-NAS provide state-of-the-art performance.

Achievements in Versatility

Foundation models have shown remarkable versatility across a range of computer vision tasks. Models like Mask2Former and OWL-ViT handle segmentation tasks without task-specific modifications, showcasing their adaptability.

Additionally, the CLIP model by OpenAI has demonstrated its ability to understand and align visual and textual representations for versatile applications such as image-text retrieval and open-ended object detection.

Models like DALL-E-3 have expanded the limits of generative image synthesis, creating detailed and contextually appropriate images from text descriptions, thus opening new avenues for both creative and practical applications.

Empowering New Capabilities in Computer Vision

The integration of foundation models has opened up numerous new capabilities in computer vision:

Enhanced Multimodal Understanding: Models like CLIP have significantly improved the understanding of relationships between different data types, aiding tasks such as image-text retrieval and open-ended object detection.
Active Learning and Few-Shot Learning: Foundation models have made active learning strategies more effective by using pre-trained embeddings to label informative samples selectively. This is useful when there are few annotation resources available.
Generative Applications: Generative models like DALL-E-3 have expanded the limits of image synthesis, creating detailed and contextually appropriate images from text descriptions, thus opening new avenues for both creative and practical applications.

Recommended Webinar (On-Demand): Are Visual Foundation Models (VFMs) on par with SOTA?

The Future of Foundation Models in AI

Developments in model architectures and training objectives are expected to improve the capabilities of foundation models to make them more adaptable and effective across various domains. Here's a detailed look at the potential future advancements and the key challenges that need to be addressed:

Enhanced Model Architectures and Training Methods: Ongoing improvements in model architectures, such as transformer-based designs and more sophisticated training methods, will likely lead to more powerful and efficient foundation models.
Multimodal Capabilities: There is an increasing focus on developing foundation models that can handle various data types beyond text and images, such as audio and video. This will improve their applicability for more complex, multimodal tasks.
Efficient Training Processes: Advances in training processes are expected to improve the efficiency of foundation models, enabling them to utilize broader data sets more effectively and adapt more quickly to new tasks. Meta’s recent Llama 3 release is an example.
Generative AI for Complex Tasks: The application of generative AI in tasks like video generation highlights a shift towards more dynamic AI systems capable of creating high-quality, diverse outputs.
Open-Source Development and Collaboration: Collaborative efforts and open-source development are crucial for driving innovation in foundation model technology and helping to democratize access to advanced AI tools.

Power the next generation of LLMs & VLMs with Reinforcement Learning from Human Feedback

Foundational Models in AI: Key Takeaways

Foundation models have significantly transformed the computer vision field, enhancing accuracy, efficiency, and versatility. They have introduced new capabilities such as sophisticated image and video generation, advanced object detection, and improvements in real-time processing. The integration of foundation models is projected to broaden and deepen across various technological ecosystems, with profound impacts anticipated in sectors like healthcare, legal, and education. These developments indicate a future where AI will support and drive innovation and operational efficiencies across industries, leaving an indelible mark on technology and society.

Build better ML models with Encord

Get started today

Written by

Stephen Oladele

View more posts

Frequently asked questions

Foundation models are large-scale AI models pre-trained on extensive datasets to capture a broad understanding of data across various domains. They serve as a base for developing more specialized models through further fine-tuning, enabling diverse applications without the need to train from scratch for each new task.
Foundation models transform computer vision by significantly improving the accuracy and efficiency of tasks like image classification, object detection, and image generation. By leveraging pre-trained data, these models adapt quickly to new tasks with minimal additional training, driving innovations in automated systems and AI-driven analysis.
Adopting AI through foundation models enhances computational efficiency and versatility in application and enables multimodal capabilities. This means they can understand and process various data types beyond their initial training, leading to more robust and adaptable AI systems that can perform a wider range of tasks more effectively.
Examples of foundation models in computer vision include CLIP and DALL-E from OpenAI, which handle tasks ranging from object recognition to generating images from textual descriptions. These models demonstrate foundation models' versatility and capability to recognize and creatively interpret visual data.
Architectural innovations such as transformer models, which use mechanisms like attention to process inputs, support the scalability and effectiveness of foundation models. These architectures are crucial for handling the large and diverse datasets on which foundation models are trained, allowing for efficient learning and adaptation across different tasks and modalities.
The prospects for foundation models include their potential to revolutionize fields like healthcare, law, and education by providing advanced AI tools capable of understanding complex, multimodal data. Challenges include ensuring ethical use, avoiding biases, improving model transparency, and managing the environmental impact of training large-scale models.

Previous blog

Exploring Vision-based Robotic Arm Control with 6 Degrees of Freedom

Next blog

Encord Monthly Computer Vision Wrap: April Industry Newsletter

Related blogs

View all

sampleImage_visual-foundation-models-vfms-explained

machine learning

Visual Foundation Models (VFMs) Explained

The computer vision (CV) market is soaring, with an expected annual growth rate of 19.5%, according to Yahoo Finance. By 2023, it is predicted to reach a value of $100.4Bn, compared to $16.9Bn in 2022. This growth is largely attributable to the development of Visual Foundation Models (VFMs), which are engineered to understand and process the complexities of visual data. VFMs excel in various CV tasks, including image generation, object detection, semantic segmentation, text-to-image generation, medical imaging, and more. Their accuracy, speed, and efficiency make them highly useful at enterprise scale. This guide provides an overview of VFMs and discusses several prominent models available. We’ll list their benefits and applications and highlight prominent fine-tuning techniques for VFMs. Understanding Visual Foundation Models Foundation models are general-purpose large-scale artificial intelligence (AI) models that organizations use to build downstream applications, especially in generative AI. For instance, in the natural language processing (NLP) domain, large language models (LLMs) such as BERT, GPT-3, GPT-4, and MPT-30B are foundation models that enable businesses to build chat or language systems that are tailored to specific tasks and can understand human language to enhance customer engagement. Visual foundation models are foundation models that perform image generation tasks. VFMs usually incorporate components of large language models to enable image generation using text-based input prompts. They require appropriate prompt engineering to achieve high-quality image generation results. Some notable examples of proprietary and open-source VFMs include Stable Diffusion, Florence, Pix-2-Pix, DALL-E, etc. These models are trained on enormous datasets, allowing them to understand the intricate features, patterns, and representations in visual data. They use various architectures and techniques that focus on processing visual information, making them adaptable to many use cases. Evolution from CNNs to Transformers Traditionally, computer vision models have used convolutional neural networks (CNNs) to extract relevant features. CNNs focus on a portion of an image at a time, allowing them to effectively distinguish between objects, edges, and textures at inference time. In 2017, a research paper titled "Attention is All You Need” transformed the NLP landscape by introducing a new machine learning architecture for building effective language models. This architecture takes a text sequence and generates a text sequence as input-output formats. Its key component is the attention mechanism, which enables the model to focus on the essential portions of a text sequence. Overall, transformers understand longer texts better and provide enhanced speed and accuracy. The transformer architecture has given rise to the foundational LLMs that we know today. Although the attention mechanism was initially intended for language format, researchers soon saw its potential in computer vision applications. In 2020, a research paper titled “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” showed how the transformeralgorithm can transform images into vectorized embeddings and use self-attention to let the model understand the relationship between image segments. The resulting model is called a Vision Transformer (ViT). Vision Transformer Architecture Today, ViTs are used to power many VFMs. Additionally, the growing prevalence of GPUs has made it easier to process visual data and execute large-scale generative AI workloads. Consequently, the development and deployment of different VFMs have become more feasible. Self-Supervision & Adaptability Many visual foundation models use self-supervision techniques to learn from unlabeled data. Unlike supervised learning, where all data points must have labels, self-supervision techniques enable model training via unlabeled data points. This allows enterprises to quickly adapt them for specific use cases without incurring high data annotation costs. Interested in learning more about self-supervision? Read our detailed blog: Self-Supervised Learning Explained. Popular Visual Foundation Models Foundation models are making remarkable progress, leading to the emergence of a variety of VFMs designed to excel in different vision tasks. Let’s explore some of the most prominent VFMs. DINO (self-DIstillation with NO labels) DINO is a self-supervised model by Meta AI based on the ViT and teacher-student architecture. It enables users to quickly segment any object from an image, allowing for the extraction of valuable features from an image without the time-consuming fine-tuning and data augmentation process. SAM (Segment Anything Model) SAM revolutionizes image and video segmentation by requiring minimal annotations compared to traditional methods. CV practitioners can give a series of prompts to extract different image features. The prompts are in the form of clickables, meaning practitioners can select a specific portion of any image, and SAM will segment it out for quicker annotation. Overview of SAM If you would like to learn more about SAM, read our detailed guide on How To Fine-Tune Segment Anything. SegGPT SegGPT is a generalist segmentation model built on top of the Painter framework, which allows the model to adapt to various tasks using minimum examples. The model is useful for all segmentation tasks, such as instance, object, semantic, and panoptic segmentation. During training, the model performs in-context coloring, which uses a random coloring scheme (instead of specific colors) to identify segments by learning their contextual information, resulting in improved model generalizability. If you would like to learn more about SegGPT, read our comprehensive description in SegGPT: Segment Everything in Context Explained. Microsoft's Visual ChatGPT Microsoft’s Visual ChatGPT expands the capabilities of text-based ChatGPT to include images, enabling it to perform various tasks, including visual question answering (VQA), image editing, and image generation. The system uses a prompt manager that can input both linguistic and visual user queries to the ChatGPT model. Visual ChatGPT can access other VFMs such as BLIP, Stable Diffusion, Pix2Pix, and ControlNet to execute vision tasks. The prompt manager then converts all input visual signals into a language format that ChatGPT can understand. As a result, the ChatGPT model becomes capable of generating both text and image-based responses. The following diagram illustrates Visual ChatGPT architecture: Visual ChatGPT Architecture Applications of Visual Foundation Models VFMs have a range of applications across various industries. Let’s explore some of them below: Healthcare Industry: VFMs can improve medical image analysis, assisting in disease detection and diagnosis by detecting issues in X-rays, MRI and CTI scans, and other medical images. Cybersecurity Systems: VFMs can provide sophisticated observation, spot irregularities, and identify potential threats in the cybersecurity domain. Early threat detection enables organizations to safeguard their digital assets proactively. Automotive Industry: VFMs can help self-driving automobiles improve scene comprehension and pedestrian recognition, ensuring public safety. Retail Industry: VFMs can automate stock tracking and shelf replenishment through image-based analysis and improve inventory management. Manufacturing Industry:VFMs can improve visual quality control by detecting flaws in real-time, reducing time to repair and cutting down on maintenance costs. Benefit of Visual Foundation Models VFMs offer significant economic benefits across industries. These models are refined and pre-trained using enormous datasets, which speed up development, use fewer resources, and improve the quality of AI-powered applications. By eliminating the need for time-consuming manual feature engineering and annotation, VFMs can shorten product development cycles, allowing organizations to reduce the time to market for their AI applications. VFMs’ capacity to detect subtle details can improve user experience by enabling precise picture recognition, automatically identifying objects, and making recommendations. The transfer learning capabilities of VFMs are particularly beneficial for enterprise AI systems. Withtransfer learning, businesses can fine-tune VFMs to suit specific tasks without training the entire model from scratch. Transfer Learning Overview Challenges & Considerations of Visual Foundation Models VFMs have a robust visual understanding, but they are still relatively new models, and practitioners may experience several challenges when trying to make the models work as intended. Let’s briefly address these challenges below. Addressing Ethical, Fairness, & Bias-related Concerns in Visual AI While VFMs are smart models, they can sometimes exhibit bias due to the data they learn from. This becomes a concern if the data contains underrepresented classes. For example, a VFM in a security system may generate alarms only when it sees people of a certain demographic. Such a result may occur due to the training data containing a skewed representation of people. To prevent the model from giving biased results, companies must ensure that their datasets are collected from diverse sources and represent all classes fairly. Safeguarding Privacy, Compliance, & Data Security Visual foundation models pose challenges regarding data security, as large training datasets may inadvertently expose confidential information. Protecting data through robust anonymization, encryption, and compliance with regulations like GDPR is crucial. To prevent legal issues, it is essential to adhere to data regulations, intellectual property rights, and AI regulations. In sectors like healthcare and finance, interpretable AI is vital for understanding complex VFM predictions. For more information on safeguarding privacy, compliance, and data security, read What the European AI Act Means for You, AI Developer. Managing Costs While VFMs offer high speed and performance, they have a significant training cost based on the size of the data and model. OpenAI’s GPT-3 model, for example, reportedly cost $4.6MM to train. According to another OpenAI report, the cost of training a large AI model is projected to rise from $100MM to $500MM by 2030. These figures indicate the prohibitive costs organizations must bear to create large image models. They must invest heavily in computational resources such as GPUs, servers, and data pipelines, making development a highly challenging process. In addition, there is an inference cost for deployed models that must be taken into account. Fine-tuning Visual Foundation Models VFMs are pre-trained models with pre-defined weights, meaning they understand complex visual patterns and features. In other words, businesses do not need to undergo the training process from scratch. Instead, they can use a small amount of additional domain-specific data to quickly tweak the model’s weights and apply it to unique problems. Steps for Fine-Tuning Visual Models Select a Pre-Trained VFMs Model: Choose from popular ones like Visual GPT, Stable Diffusion, DALL-E, and SAM since they provide state-of-the-art performance on vision tasks. Each has strengths that fit different tasks, so your decision should be based on your business requirements. Get Your Fine-Tuning Training Data Ready: Resize images, label objects, and ensure data quality. In most cases, only a small amount of labeled data is required since most VFMs apply self-supervision to learn from unlabeled data. Keep Top Layers Intact:VFMs are complex deep learning models with several layers. Each layer extracts relevant features from the input data. For fine-tuning, freeze the top layers so that generalizable image features remain intact. Replace the final layers with a custom configuration to learn new features from Model Fine-Tuning Tweak It Gradually: Think of it like fine-tuning a musical instrument - unfreeze layers step by step to adapt to the fine details of your task. Use techniques like dropout, weight decay, adjusting learning rate, and batch normalization to prevent over-fitting and maximize performance. Experiment with learning rate schedules, such as step decay, cosine annealing, or one-cycle learning rates, to determine the best strategy for your dataset. Implement early stopping based on validation loss or accuracy and experiment with different hyperparameters such as batch size and optimizer settings. Evaluation & Testing: Once training is complete, evaluate the fine-tuned VFMs model on the testing dataset to measure its performance accurately. Use appropriate evaluation metrics for your specific task, such as intersection-over-union (IoU)and average precision. If the results are not satisfactory, repeat the steps again. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥 Strategies for Handling Imbalanced Datasets & Variability While using pre-trained VFMs expedites the model development and fine-tuning process, businesses can face data limitations that prevent them from achieving the desired model performance. There are several techniques to overcome data hurdles while fine-tuning VFMs. Data Augmentation: Increase class balance through data augmentation, which increases the dataset by manipulating existing images. Stratified Sampling: Ensure unbiased evaluation through fair representation of classes in training, validation, and testing data. Resampling Techniques: Address class imbalance with over-sampling and under-sampling methods like SMOTE. Weighted Loss Functions: Enhance focus on underrepresented classes during training by adjusting loss function weights. Ensemble Methods: Improve performance and robustness by combining predictions from multiple models. Domain Adaptation: This technique improves target model performance by leveraging the knowledge learned from another related source domain. Future Trends & Outlook In the realm of AI and computer vision, VFMs are the future. Here are some exciting trends that we can expect to see in the coming years: Architectural Advancements: VFMs will improve with more advanced architecture designs and optimization techniques. For instance, a self-correcting module in VFMs could continuously improve the model’s understanding of human intentions by learning from feedback. Robustness & Interpretability: VFMs will become more interpretable, and humans will be able to learn how a model thinks before making a prediction. This ability will significantly help in identifying biases and shortfalls. Multimodal Integration: With multimodal integration, VFMs will be able to handle different types of information, such as combining pictures with words, sounds, or information from sensors. For example, the multimodal conversational model, JARVIS, extends the capabilities of traditional chatbots. Microsoft Research’s JARVIS enhances ChatGPT's capability by combining several other generative AI models, allowing it to simultaneously process several data types, such as text, image, video, and audio. A user can ask JARVIS complex visual questions, such as writing detailed descriptions of highly abstract images. Synergies with Other AI Domains: The development of VFMs is closely connected to the development of other areas of AI, creating an alliance that amplifies their overall impact. For instance, VFMs that work with NLP systems can enhance applications like picture captioning and visual question answering. Visual Foundation Models — A Step Towards AGI Visual foundation models are a promising step towards unlocking artificial general intelligence (AGI). To develop algorithms that can be applied to any real-world task, these models need to be able to process multimodal data, such as text and images. While the NLP domain has showcased AGI-level performance using LLMs, such as OpenAI’s GPT-4, the computer vision domain has yet to achieve similar performance due to the complexity of interpreting visual signals. However, emerging visual foundation models are a promising step in this direction. Ideally, VFMs can perform various vision-language tasks and generalize accurately to new, unseen environments. Alternatively, a unified platform could merge different visual foundation models to solve different vision tasks. Models like SAM and SegGPT have shown promise in addressing multi-modal tasks. However, to truly achieve AGI, CV and NLP systems must be able to operate globally at scale. The All-Seeing project has demonstrated a model’s capabilities to recognize and understand everything in this world. The All-Seeing model (ASM) is trained on a massive dataset containing millions of images and language prompts, allowing it to generalize for many language and vision tasks using a unified framework while maintaining high zero-shot performance. Such advancements are a step towards achieving vision-language artificial general intelligence. Visual Foundation Models: Key Takeaways Here are some key takeaways: Visual foundation models generate images based on language prompts. VFMs perform well on many vision tasks without requiring large amounts of labeled training data. VFMs apply self-supervision to learn patterns from unlabeled training data. Customizing or fine-tuning VFMs for specific tasks improves their accuracy. Data limitations in VFMs can be addressed using techniques such as data augmentation, resampling, ensembling, and domain adaptation. Metrics like AP, IoU, and PQ help measure how good VFMs are at visual tasks. VFMs can achieve better results when combined with other smart systems like NLP, reinforcement learning, and generative models. VFMs are moving towards achieving vision-language artificial general intelligence. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥

Apr 24 2023

5 M

Machine Learning

Top 8 Alternatives to the Open AI CLIP Model

Multimodal deep learning is a recent trend in artificial intelligence (AI) that is revolutionizing how machines understand the real world using multiple data modalities, such as images, text, video, and audio. In particular, multiple machine learning frameworks are emerging that exploit visual representations to infer textual descriptions following Open AI’s introduction of the Contrastive Language-Image Pre-Training (CLIP) model. The improved models use more complex datasets to change the CLIP framework for domain-specific use cases. They also have better state-of-the-art (SoTA) generalization performance than the models that came before them. This article discusses the benefits, challenges, and alternatives of Open AI CLIP to help you choose a model for your specific domain. The list below mentions the architectures covered: Pubmed CLIP PLIP SigLIP Street CLIP Fashion CLIP CLIP-Rscid BioCLIP CLIPBert 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥 Open AI CLIP Model CLIP is an open-source vision-language AI model by OpenAI trained using image and natural language data to perform zero-shot classification tasks. Users can provide textual captions and use the model to assign a relevant label to the query image. Open AI CLIP Model: Architecture and Development The training data consists of images from the internet and 32,768 text snippets assigned to each image as its label. The training task involves using natural language processing (NLP) to predict which label goes with which image by understanding visual concepts and relating them to the textual data. CLIP Architecture The model primarily uses an image and a text encoder that convert images and labels into embeddings. Optimization involves minimizing a contrastive loss function by computing similarity scores between these embeddings and associating the correct label with an image. See Also: What is Vector Similarity Search? Once trained, the user can provide an unseen image as input with multiple captions to the image and text encoders. CLIP will then predict the correct label that goes with the image. Benefits of OpenAI CLIP OpenAI CLIP has multiple benefits over traditional vision models. The list below mentions the most prominent advantages: Zero-shot Learning (ZSL): CLIP’s training approach allows it to label unseen images without requiring expensive training on new datasets. Like Generative Pre-trained Transformer - 3 (GPT-3) and GPT-4, CLIP can perform zero-shot classification tasks using natural language data with minimal training overhead. The property also helps users fine-tune CLIP more quickly to adapt to new tasks. Better Real-World Performance: CLIP demonstrates better real-world performance than traditional vision models, which only work well with benchmark datasets. Limitations of OpenAI CLIP Although CLIP is a robust framework, it has a few limitations, as highlighted below: Poor Performance on Fine-grained Tasks: CLIP needs to improve its classification performance for fine-grained tasks such as distinguishing between car models, animal species, flower types, etc. Out-of-Distribution Data: While CLIP performs well on data with distributions similar to its training set, performance drops when it encounters out-of-distribution data. The model requires more diverse image pre-training to generalize to entirely novel tasks. Inherent Social Bias: The training data used for CLIP consists of randomly curated images with labels from the internet. The approach implies the model learns intrinsic biases present in image captions as the image-text pairs do not undergo filtration. Due to these limitations, the following section will discuss a few alternatives for domain-specific tasks. Learn how to build visual search engines with CLIP and ChatGPT in our on-demand webinar. Alternatives to CLIP Since CLIP’s introduction, multiple vision-language algorithms have emerged with unique capabilities for solving problems in healthcare, fashion, retail, etc. We will discuss a few alternative models that use the CLIP framework as their base. We will also briefly mention their architecture, development approaches, performance results, and use cases. 1. PubmedCLIP PubmedCLIP is a fine-tuned version of CLIP for medical visual question-answering (MedVQA), which involves answering natural language questions about an image containing medical information. PubmedCLIP: Architecture and Development The model is pre-trained on the Radiology Objects in Context (ROCO) dataset, which consists of 80,000 samples with multiple image modalities, such as X-ray, fluoroscopy, mammography, etc. The image-text pairs come from Pubmed articles; each text snippet briefly describes the image’s content. PubmedCLIP Architecture Pre-training includes fine-tuning CLIP’s image and text encoders to minimize contrastive language and vision loss. The pretrained module, PubMedCLIP, and a Convolutional Denoising Image Autoencoder (CDAE) encode images. A question encoder converts natural language questions into embeddings and combines them with the encoded image through a bilinear attention network (BAN). The training objective is to map the embeddings with the correct answer by minimizing answer classification and image reconstruction loss using a CDAE decoder. Performance Results of PubmedCLIP The accuracy metric shows an improvement of 1% compared to CLIP on the VQA-RAD dataset, while PubMedCLIP with the vision transform ViT-32 as the backend shows an improvement of 3% on the SLAKE dataset. See Also: Introduction to Vision Transformers (ViT). PubmedCLIP: Use Case Healthcare professionals can use PubMedCLIP to interpret complex medical images for better diagnosis and patient care. 2. PLIP The Pathology Language-Image Pre-Training (PLIP) model is a CLIP-based framework trained on extensive, high-quality pathological data curated from open social media platforms such as medical Twitter. PLIP: Architecture and Development Researchers used 32 pathology hashtags according to the recommendations of the United States Canadian Academy for Pathology (USCAP) and the Pathology Hashtag Ontology project. The hashtags helped them retrieve relevant tweets containing de-identified pathology images and natural descriptions. The final dataset - OpenPath - comprises 116,504 image-text pairs from Twitter posts, 59,869 image-text pairs from the corresponding replies with the highest likes, and 32,041 additional image-text pairs from the internet and the LAION dataset. OpenPath Dataset Experts use OpenPath to fine-tune CLIP through an image preprocessing pipeline that involves image down-sampling, augmentations, and random cropping. Performance Results of PLIP PLIP achieved state-of-the-art (SoTA) performance across four benchmark datasets. On average, PLIP achieved an F1 score of 0.891, while CLIP scored 0.813. PLIP: Use Case PLIP aims to classify pathological images for multiple medical diagnostic tasks and help retrieve unique pathological cases through image or natural language search. New to medical imaging? Check out ‘Guide to Experiments for Medical Imaging in Machine Learning.’ 3. SigLip SigLip uses a more straightforward sigmoid loss function to optimize the training process instead of a softmax contrastive loss as traditionally used in CLIP. The method boosts training efficiency and allows users to scale the process when developing models using more extensive datasets. SigLip: Architecture and Development Optimizing the contrastive loss function implies maximizing the distance between non-matching image-text pairs while minimizing the distance between matching pairs. However, the method requires text-to-image and image-to-text permutations across all images and text captions. It also involves computing normalization factors to calculate a softmax loss. The approach is computationally expensive and memory-inefficient. Instead, the sigmoid loss simplifies the technique by converting the loss into a binary classification problem by assigning a positive label to matching pairs and negative labels to non-matching combinations. Efficient Loss Implementation In addition, permutations occur on multiple devices, with each device predicting positive and negative labels for each image-text pair. Later, the devices swap the text snippets to re-compute the loss with corresponding images. Performance Results of SigLip Based on the accuracy metric, the sigmoid loss outperforms the softmax loss for smaller batch sizes on the ImageNet dataset. Performance comparison Both losses deteriorate after a specific batch size, with Softmax performing slightly better at substantial batch sizes. SigLip: Use Case SigLip is suitable for training tasks involving extensive datasets. Users can fine-tune SigLip using smaller batch sizes for faster training. 4. StreetCLIP StreetCLIP is an image geolocalization algorithm that fine-tunes CLIP on geolocation data to predict the locations of particular images. The model is available on Hugging Face for further research. StreetCLIP: Architecture and Development The model improves CLIP zero-shot learning capabilities by training a generalized zero-shot learning (GZSL) classifier that classifies seen and unseen images simultaneously during the training process. StreetCLIP Architecture Fine-tuning involves generating synthetic captions for each image, specifying the city, country, and region. The training objective is to correctly predict these three labels for seen and unseen photos by optimizing a GZSL and a vision representation loss. Performance Results of StreetCLIP Compared to CLIP, StreetCLIP has better geolocation prediction accuracy. It outperforms CLIP by 0.3 to 2.4 percentage points on the IM2GPS and IM2GPS3K benchmarks. StreetCLIP: Use Case StreetCLIP is suitable for navigational purposes where users require information on weather, seasons, climate patterns, etc. It will also help intelligence agencies and journalists extract geographical information from crime scenes. 5. FashionCLIP FashionCLIP (F-CLIP) fine-tunes the CLIP model using fashion datasets consisting of apparel images and textual descriptions. The model is available on GitHub and HuggingFace. FashionCLIP: Architecture and Development The researchers trained the model on 700k image-text pairs in the Farfetch inventory dataset and evaluated it on image retrieval and classification tasks. F-CLIP Architecture The evaluation also involved testing for grounding capability. For instance, zero-shot segmentation assessed whether the model understood fashion concepts such as sleeve length, brands, textures, and colors. They also evaluated compositional understanding by creating improbable objects to see if F-CLIP generated appropriate captions. For instance, they see if F-CLIP can generate a caption—a Nike dress—when seeing a picture of a long dress with the Nike symbol. Performance Results of FashionCLIP F-CLIP outperforms CLIP on multiple benchmark datasets for multi-modal retrieval and product classification tasks. For instance, F-CLIP's F1 score for product classification is 0.71 on the F-MNIST dataset, while it is 0.66 for CLIP. FashionCLIP: Use Case Retailers can use F-CLIP to build chatbots for their e-commerce sites to help customers find relevant products based on specific text prompts. The model can also help users build image-generation applications for visualizing new product designs based on textual descriptions. 6. CLIP-RSICD CLIP-RSICD is a fine-tuned version of CLIP trained on the Remote Sensing Image Caption Dataset (RSICD). It is based on Flax, a neural network library for JAX (a Python package for high-end computing). Users can implement the model on a CPU. The model is available on GitHub. CLIP-RSICD: Architecture and Development The RSICD consists of 10,000 images from Google Earth, Baidu Map, MapABC, and Tianditu. Each image has multiple resolutions with five captions. RSICD Dataset Due to the small dataset, the developers implemented augmentation techniques using transforms in Pytorch’s Torchvision package. Transformations included random cropping, random resizing and cropping, color jitter, and random horizontal and vertical flipping. Performance Results of CLIP-RSICD On the RSICD test set, the regular CLIP model had an accuracy of 0.572, while CLIP-RSICD had a 0.883 accuracy score. CLIP-RSICD: Use Case CLIP-RSICD is best for extracting information from satellite images and drone footage. It can also help identify red flags in specific regions to predict natural disasters due to climate change. 7. BioCLIP BioCLIP is a foundation model for the tree of life trained on an extensive biology image dataset to classify biological organisms according to their taxonomy. BioCLIP: Architecture and Development BioCLIP fine-tunes the CLIP framework on a custom-curated dataset—TreeOfLife-10M—comprising 10 million images with 454 thousand taxa in the tree of life. Each taxon corresponds to a single image and describes its kingdom, phylum, class, order, family, genus, and species. Taxonomic Labels The CLIP model takes the taxonomy as a flattened string and matches the description with the correct image by optimizing the contrastive loss function. Researchers also enhance the training process by providing scientific and common names for a particular species to improve generalization performance. This method helps the model recognize a species through a general name used in a common language. Performance Results of BioCLIP On average, BioCLIP boosts accuracy by 18% on zero-shot classification tasks compared to CLIP on ten different biological datasets. BioCLIP: Use Case BioCLIP is ideal for biological research involving VQA tasks where experts quickly want information about specific species. Watch Also: How to Fine Tune Foundation Models to Auto-Label Training Data. 8. CLIPBert CLIPBert is a video and language model that uses the sparse sampling strategy to classify video clips belonging to diverse domains quickly. It uses Bi-directional Encoder Representations from Transformers (BERT) - a large language model (LLM), as its text encoder and ResNet-50 as the visual encoder. CLIPBert: Architecture and Development The model’s sparse sampling method uses only a few sampled clips from a video in each training step to extract visual features through a convolutional neural network (CNN). The strategy improves training speed compared to methods that use full video streams to extract dense features. The model initializes the BERT with weights pre-trained on BookCorpus and English Wikipedia to get word embeddings from textual descriptions of corresponding video clips. CLIPBert Training involves correctly predicting a video’s description by combining each clip’s predictions and comparing them with the ground truth. The researchers used 8 NVIDIA V100 GPUs to train the model on 40 epochs for four days. During inference, the model samples multiple clips and aggregates the prediction for each clip to give a final video-level prediction. Performance Results of CLIPBert CLIPBert outperforms multiple SoTA models on video retrieval and question-answering tasks. For instance, CLIPBert shows a 4% improvement over HERO on video retrieval tasks. CLIPBert: Use Case CLIPBert can help users analyze complex videos and allow them to develop generative AI tools for video content creation. See Also: FastViT: Hybrid Vision Transformer with Structural Reparameterization. . Alternatives to Open AI CLIP: Key Takeaways With frameworks like CLIP and ChatGPT, combining computer vision with NLP is becoming the new norm for developing advanced multi-modal models to solve modern industrial problems. Below are a few critical points to remember regarding CLIP and its alternatives. OpenAI CLIP Benefits: OpenAI CLIP is an excellent choice for general vision-language tasks requiring low domain-specific expertise. Limitations: While CLIP’s zero-shot capability helps users adapt the model to new tasks, it underperforms on fine-grained tasks and out-of-distribution data. Alternatives: Multiple CLIP-based options are suitable for medical image analysis, biological research, geo-localization, fashion, and video understanding. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥

Apr 19 2024

8 M

sampleImage_learn-how-to-fine-tune-the-segment-anything-model-sam

Tutorials

How To Fine-Tune Segment Anything

Computer vision is having its ChatGPT moment with the release of the Segment Anything Model (SAM) by Meta last week. Trained over 11 billion segmentation masks, SAM is a foundation model for predictive AI use cases rather than generative AI. While it has shown an incredible amount of flexibility in its ability to segment over wide-ranging image modalities and problem spaces, it was released without “fine-tuning” functionality. This tutorial will outline some of the key steps to fine-tune SAM using the mask decoder, particularly describing which functions from SAM to use to pre/post-process the data so that it's in good shape for fine-tuning. What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a segmentation model developed by Meta AI. It is considered the first foundational model for Computer Vision. SAM was trained on a huge corpus of data containing millions of images and billions of masks, making it extremely powerful. As its name suggests, SAM is able to produce accurate segmentation masks for a wide variety of images. SAM’s design allows it to take human prompts into account, making it particularly powerful for Human In The Loop annotation. These prompts can be multi-modal: they can be points on the area to be segmented, a bounding box around the object to be segmented, or a text prompt about what should be segmented. The model is structured into 3 components: an image encoder, a prompt encoder, and a mask decoder. Source The image encoder generates an embedding for the image being segmented, whilst the prompt encoder generates an embedding for the prompts. The image encoder is a particularly large component of the model. This is in contrast to the lightweight mask decoder, which predicts segmentation masks based on the embeddings. Meta AI has made the weights and biases of the model trained on the Segment Anything 1 Billion Mask (SA-1B) dataset available as a model checkpoint. Learn more about how Segment Anything works in our explainer blog post Segment Anything Model (SAM) Explained. What is Model Fine-Tuning? Publicly available state-of-the-art models have a custom architecture and are typically supplied with pre-trained model weights. If these architectures were supplied without weights then the models would need to be trained from scratch by the users, who would need to use massive datasets to obtain state-of-the-art performance. Model fine-tuning is the process of taking a pre-trained model (architecture+weights) and showing it data for a particular use case. This will typically be data that the model hasn’t seen before, or that is underrepresented in its original training dataset. The difference between fine-tuning the model and starting from scratch is the starting value of the weights and biases. If we were training from scratch, these would be randomly initialized according to some strategy. In such a starting configuration, the model would ‘know nothing’ of the task at hand and perform poorly. By using pre-existing weights and biases as a starting point we can ‘fine tune’ the weights and biases so that our model works better on our custom dataset. For example, the information learned to recognize cats (edge detection, counting paws) will be useful for recognizing dogs. Why Would I Fine-Tune a Model? The purpose of fine-tuning a model is to obtain higher performance on data that the pre-trained model has not seen before. For example, an image segmentation model trained on a broad corpus of data gathered from phone cameras will have mostly seen images from a horizontal perspective. If we tried to use this model for satellite imagery taken from a vertical perspective, it may not perform as well. If we were trying to segment rooftops, the model may not yield the best results. The pre-training is useful because the model will have learned how to segment objects in general, so we want to take advantage of this starting point to build a model that can accurately segment rooftops. Furthermore, it is likely that our custom dataset would not have millions of examples, so we want to fine-tune instead of training the model from scratch. Fine tuning is desirable so that we can obtain better performance on our specific use case, without having to incur the computational cost of training a model from scratch. How to Fine-Tune Segment Anything Model [With Code] Background & Architecture We gave an overview of the SAM architecture in the introduction section. The image encoder has a complex architecture with many parameters. In order to fine-tune the model, it makes sense for us to focus on the mask decoder which is lightweight and therefore easier, faster, and more memory efficient to fine-tune. In order to fine-tune SAM, we need to extract the underlying pieces of its architecture (image and prompt encoders, mask decoder). We cannot use SamPredictor.predict (link) for two reasons: We want to fine-tune only the mask decoder This function calls SamPredictor.predict_torch which has the @torch.no_grad() decorator (link), which prevents us from computing gradients Thus, we need to examine the SamPredictor.predict function and call the appropriate functions with gradient calculation enabled on the part we want to fine-tune (the mask decoder). Doing this is also a good way to learn more about how SAM works. Creating a Custom Dataset We need three things to fine-tune our model: Images on which to draw segmentations Segmentation ground truth masks Prompts to feed into the model We chose the stamp verification dataset (link) since it has data that SAM may not have seen in its training (i.e., stamps on documents). We can verify that it performs well, but not perfectly, on this dataset by running inference with the pre-trained weights. The ground truth masks are also extremely precise, which will allow us to calculate accurate losses. Finally, this dataset contains bounding boxes around the segmentation masks, which we can use as prompts to SAM. An example image is shown below. These bounding boxes align well with the workflow that a human annotator would go through when looking to generate segmentations. Input Data Preprocessing We need to preprocess the scans from numpy arrays to pytorch tensors. To do this, we can follow what happens inside SamPredictor.set_image (link) and SamPredictor.set_torch_image (link) which preprocesses the image. First, we can use utils.transform.ResizeLongestSide to resize the image, as this is the transformer used inside the predictor (link). We can then convert the image to a pytorch tensor and use the SAM preprocess method (link) to finish preprocessing. Training Setup We download the model checkpoint for the vit_b model and load them in: sam_model = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth') We can set up an Adam optimizer with defaults and specify that the parameters to tune are those of the mask decoder: optimizer = torch.optim.Adam(sam_model.mask_decoder.parameters()) At the same time, we can set up our loss function, for example Mean Squared Error loss_fn = torch.nn.MSELoss() Training Loop In the main training loop, we will be iterating through our data items, generating masks, and comparing them to our ground truth masks so that we can optimize the model parameters based on the loss function. In this example, we used a GPU for training since it is much faster than using a CPU. It is important to use .to(device) on the appropriate tensors to make sure that we don’t have certain tensors on the CPU and others on the GPU. We want to embed images by wrapping the encoder in the torch.no_grad() context manager, since otherwise we will have memory issues, along with the fact that we are not looking to fine-tune the image encoder. with torch.no_grad(): image_embedding = sam_model.image_encoder(input_image) We can also generate the prompt embeddings within the no_grad context manager. We use our bounding box coordinates, converted to pytorch tensors. with torch.no_grad(): sparse_embeddings, dense_embeddings = sam_model.prompt_encoder( points=None, boxes=box_torch, masks=None, ) Finally, we can generate the masks. Note that here we are in single mask generation mode (in contrast to the 3 masks that are normally output). low_res_masks, iou_predictions = sam_model.mask_decoder( image_embeddings=image_embedding, image_pe=sam_model.prompt_encoder.get_dense_pe(), sparse_prompt_embeddings=sparse_embeddings, dense_prompt_embeddings=dense_embeddings, multimask_output=False, ) The final step here is to upscale the masks back to the original image size since they are low resolution. We can use Sam.postprocess_masks to achieve this. We will also want to generate binary masks from the predicted masks so that we can compare these to our ground truths. It is important to use torch functionals in order to not break backpropagation. upscaled_masks = sam_model.postprocess_masks(low_res_masks, input_size, original_image_size).to(device) from torch.nn.functional import threshold, normalize binary_mask = normalize(threshold(upscaled_masks, 0.0, 0)).to(device) Finally, we can calculate the loss and run an optimization step: loss = loss_fn(binary_mask, gt_binary_mask) optimizer.zero_grad() loss.backward() optimizer.step() By repeating this over a number of epochs and batches we can fine-tune the SAM decoder. Saving Checkpoints and Starting a Model from it Once we are done with training and satisfied with the performance uplift, we can save the state dict of the tuned model using: torch.save(model.state_dict(), PATH) We can then load this state dict when we want to perform inference on data that is similar to the data we used to fine-tune the model. You can find the Colab Notebook with all the code you need to fine-tune SAM here. Keep reading if you want a fully working solution out of the box! Fine-Tuning for Downstream Applications While SAM does not currently offer fine-tuning out of the box, we are building a custom fine-tuner integrated with the Encord platform. As shown in this post, we fine-tune the decoder in order to achieve this. This is available as an out-of-the-box one-click procedure in the web app, where the hyperparameters are automatically set. Original vanilla SAM mask: Mask generated by fine-tuned version of the model: We can see that this mask is tighter than the original mask. This was the result of fine-tuning on a small subset of images from the stamp verification dataset, and then running the tuned model on a previously unseen example. With further training and more examples, we could obtain even better results. 🔥 NEW RELEASE: We released TTI-Eval (text-to-image evaluation), an open-source library for evaluating zero-shot classification models like CLIP and domain-specific ones like BioCLIP against your (or HF) datasets to estimate how well the model will perform. Get started with it on GitHub, and do ⭐️ the repo if it's awesome. 🔥 Conclusion That's all, folks! You have now learned how to fine-tune the Segment Anything Model (SAM). If you're looking to fine-tune SAM out of the box, you might also be interested to learn that we have recently released the Segment Anything Model in Encord, allowing you to fine-tune the model without writing any code.

Apr 13 2023

10 M

What are Foundation Models?

Architectural Evolution of Foundation Models

Training Objectives and Methodologies in Foundation Models

Foundation Models in Action: Transforming Computer Vision Tasks

Innovations in Model Architecture: Transforming Computer Vision

Achievements in Accuracy, Efficiency, and Versatility of Foundation Models in Computer Vision

The Future of Foundation Models in AI

Foundational Models in AI: Key Takeaways

Encord Blog

How Have Foundation Models Redefined Computer Vision Using AI?

What are Foundation Models?

Architectural Evolution of Foundation Models

Training Objectives and Methodologies in Foundation Models

Foundation Models in Action: Transforming Computer Vision Tasks

Innovations in Model Architecture: Transforming Computer Vision

Achievements in Accuracy, Efficiency, and Versatility of Foundation Models in Computer Vision

The Future of Foundation Models in AI

Foundational Models in AI: Key Takeaways

Written by

What are Foundation Models?

Key Examples of Foundation Models in AI

Architectural Evolution of Foundation Models

Dual-Encoder Architecture

Fusion Architecture

Encoder-Decoder Architecture

Adapted Large Language Models (LLMs)

Training Objectives and Methodologies in Foundation Models

Contrastive Objectives

Generative Objectives

Integrated Approaches

Foundation Models in Action: Transforming Computer Vision Tasks

Scene Change Detection in Videos

Object Detection and Classification

Medical Imaging

Retail and E-Commerce

Content Analysis in Media and Entertainment

Innovations in Model Architecture: Transforming Computer Vision

YOLO-NAS

Mask2Former

DETR

ConvNeXt

GroundingDINO

Achievements in Accuracy, Efficiency, and Versatility of Foundation Models in Computer Vision

Achievements in Accuracy

Achievements in Efficiency

Achievements in Versatility

Empowering New Capabilities in Computer Vision

The Future of Foundation Models in AI

Foundational Models in AI: Key Takeaways

Build better ML models with Encord

Written by

Exploring Vision-based Robotic Arm Control with 6 Degrees of Freedom

Encord Monthly Computer Vision Wrap: April Industry Newsletter

Related blogs

Visual Foundation Models (VFMs) Explained

Top 8 Alternatives to the Open AI CLIP Model

How To Fine-Tune Segment Anything

Intelligent Character Recognition: Process, Tools and Applications

Exploring Vision-based Robotic Arm Control with 6 Degrees of Freedom

4 Reasons Why Computer Vision Models Fail in Production

Grok-1.5 Vision: First Multimodal Model from Elon Musk’s xAI

Panoptic Segmentation Tools: Top 9 Tools to Explore in 2024

Top 10 Open Source Computer Vision Repositories

15 Interesting Github Repositories for Image Segmentation

Top 10 Video Object Tracking Algorithms in 2024

5 Questions to Ask When Evaluating a Video Annotation Tool

Claude 3 | AI Model Suite: Introducing Opus, Sonnet, and Haiku

Stable Diffusion 3: Multimodal Diffusion Transformer Model Explained

Apple Vision PRO - Extending Reality to Radiology

Few Shot Learning in Computer Vision: Approaches & Uses

GPT-4 Vision Alternatives

Top 15 DICOM Viewers for Medical Imaging

Top 8 Use Cases of Computer Vision in Manufacturing

Top 8 Applications of Computer Vision in Robotics

What is RLAIF - Reinforcement Learning from AI Feedback?

How to Detect Data Quality Issues in Torchvision Dataset using Encord Active

Top Tools for RLHF

How to Use OpenCV With Tesseract for Real-Time Text Detection

Instance Segmentation in Computer Vision: A Comprehensive Guide

Software To Help You Turn Your Data Into AI