Report

CLIP Chronicles: Bridging Pixels and Prose in Deep Learning

Author

Chinmay Shah, M.S., Joshua J. Cook, M.S., ACRP-PM, CCRC, Praya Cheekapara, B.S., Moise Brutus, B.S., Shusen Pu, Ph.D.

Published

March 11, 2024

2 Methodology

The CLIP (Contrastive Language Image Pretraining) model is founded on the concept of acquiring perceptual understanding through natural language.

Figure 2. Summary of the CLIP architecture. CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. (Radford et al. 2021)

The fundamental concept behind CLIP involves utilizing captioned images picked up from the internet to build a model that is capable of predicting text or a bunch of words associated with an image. The CLIP model achieves this by pretraining on a set of images and captions and generating a comparison between their encoding’s output using an image and text encoder respectively. In this process, images that match well with corresponding text are assigned high values, whereas those that do not match receive low values. Here the model learns to place matching pairs nearby and place non-matching pairs further away. This approach, known as “contrastive learning,” entails training the model to discern whether elements belong together or not.

2.1 Setting up a Large Dataset

Establishing a comprehensive dataset presented a significant challenge in training this model. Previous efforts primarily relied on three datasets—MS-COCO, Visual Genome, and YFCC100M. MS-COCO and Visual Genome, each comprising approximately 100,000 crowd-labeled images, proved insufficient. Although YFCC100M boasted over 100 million images, the authors narrowed down the selection by filtering for quality and English-language descriptions, resulting in 15 million images comparable to those in ImageNet. The abundance of publicly available data served as a primary motivation for incorporating natural language supervision. The authors curated a dataset consisting of 400 million image-text pairs, averaging around 20,000 pairs per query.

2.2 Components of the CLIP Model

Clip as a model is a combination of a variety of subcomponents that are used to achieve this association between images and text. At its core, CLIP consists of a dual encoder system, which includes separate encoders for processing text and images. The image encoder is typically based on a convolutional neural network (CNN), capable of extracting hierarchical visual features from input images. Conversely, the text encoder leverages transformer-based models to encode textual descriptions or prompts. These encoders learn to represent images and text in a shared embedding space, where semantically similar concepts are positioned closer to each other. Additionally, CLIP incorporates a contrastive learning objective, where the model is trained to maximize the similarity between matching pairs of text and images while minimizing the similarity between non-matching pairs.

2.2.1 What is a Text Encoder?

Figure 3. The above figure describes the role of a text encoder. (Radford et al. 2021)

The text encoder within CLIP is a standard transformer encoder. A transformer can be thought of as a system which takes an entire input sequence of words, then re-presents, and compares those words to create an abstract and contextualized representation of the entire input. The self-attention mechanism within a transformer is the main mechanism that creates that contextualized representation.

2.2.2 Transformers

Transformers are a powerful class of deep learning models that have revolutionized natural language processing (NLP) tasks. Unlike traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which process data sequentially or locally, transformers leverage attention mechanisms to capture long-range dependencies and relationships within input sequences. At the heart of a transformer architecture are self-attention mechanisms, which enable the model to weigh the importance of each token in the input sequence with respect to every other token. By attending to relevant tokens and aggregating information across the entire sequence, transformers can effectively capture global context and semantic relationships, making them highly effective for tasks such as language translation, text generation, and sentiment analysis. Additionally, transformers consist of multiple layers of self-attention and feedforward neural networks, allowing them to learn hierarchical representations of input sequences.

2.2.3 What is an Image Encoder?

Figure 4. The above figure describes the role of an image encoder. (Radford et al. 2021)

The image encoder converts an image into a vector (a list of numbers) that represents the image’s meaning. In the CLIP paper we employ a RESNET-50 architecture as an image encoder.

Figure 5. The above figure describes a simple convolution operation where we use a horizontal line detection kernel used to detect horizontal line features in the input image.

2.2.4 What is Convolution?

Convolutional layers, the building blocks of CNNs, operate by applying filters (also known as kernels) to input images. These filters slide across the input image, computing a dot product between the filter weights and the pixel values within each receptive field. Figure 4 above describes a simple convolution where we use a horizontal line detecting kernel to detect horizontal line features in the input image. By stacking these convolutional kernels together, we build a CNN. The following equation (eqn 1.) defines the convolution operation.

2.2.5 Convolutional Neural Network

Convolutional neural networks (CNNs) are a class of deep learning models specifically designed for processing structured grid-like data, such as images. CNNs are characterized by their ability to automatically learn hierarchical representations of visual features directly from raw pixel data. CNNs can effectively learn increasingly abstract and hierarchical representations of visual features, leading to robust performance in tasks such as image classification, object detection, and image segmentation. The whole idea behind a convolutional network is, by doing a combination of convolutions and down sampling of an image, you can extract more and more subtle feature representations which can be turned into some final output. Fig 6. describes a classic CNN architecture from a very famous object detection model YOLO (Redmon et al. 2016). In the figure each of these boxes describe an input image’s horizontal and vertical size, as well as their “depth”, in terms of number of features. The input image is an RGB image, so it has a depth of 3. By extracting more and more features, and downsampling the image via max pooling, the network distills the image into an abstract representation which is trained to hold some meaning about the image.

Figure 6. A classic convolutional architecture from the YOLO paper, an landmark object detection model. (Redmon et al. 2016)

2.2.6 Comparing the Cosine Similarity

Cosine similarity is a measure used to determine the similarity between two vectors in a high-dimensional space by computing the cosine of the angle between them. In machine learning and natural language processing, cosine similarity is valuable for comparing the similarity of documents, sentences, or words represented as vectors in a vector space model. Widely used in machine learning and natural language processing, it compares vectors’ directions rather than magnitudes, making it robust for tasks like document retrieval, text classification, and recommendation systems. The following equation 2 describes how the cosine similarity is calculated. This cosine similarity is used

Figure 7. The figure focuses on putting the first and second part together and calculating the contrastive loss. (Radford et al. 2021)

In the CLIP model as seen in figure 7 cosine similarity is employed to calculate the contrastive loss by comparing the cosine similarity between text and image embeddings. This involves computing the cosine similarity between the normalized text embedding and the normalized image embedding, which helps quantify the semantic similarity between the given text and image pair. By maximizing the cosine similarity for positive pairs (matching text and image) and minimizing it for negative pairs (mismatched text and image), the model learns to effectively align semantic representations across different modalities.

2.3 Zero Shot Classification

Zero-shot classification in the CLIP model enables it to recognize objects or concepts that it hasn’t been explicitly trained on. By leveraging a large pre-trained language and vision model, CLIP learns to associate text prompts with corresponding images during training, allowing it to generalize to unseen classes at inference time. This capability is achieved through the model’s ability to understand natural language descriptions and relate them to visual content, enabling versatile and adaptive classification without the need for fine-tuning on specific classes.

3 Impact and Critical Assessment

Impact - Since CLIP allows people to design their own classifiers, how classes are designed influences model performance and model biases.There are also discrepancies across gender and race for the people categorized into “crime” and “non-human” categories even when care is taken into thoughtful class design. Since CLIP does not need task-specific training data, it can unlock specific tasks with greater ease.This also means that  it may raise privacy or surveillance related risks (Radford et al. 2021).

Table 1. Comparison of visual n-Grams to CLIP. (Radford et al. 2021)

Accuracy -  The best CLIP model improves accuracy on ImageNet from a proof of concept 11.5% to 76.2% and matches the performance of the original ResNet50 despite using none of the crowd-labeled training examples (Radford et al. 2021).

Results- Figure 8 showed CLIP’s zero-shot classifiers performance when compared to a simple off-the-shelf baseline which is fitting a logistic regression classifier on the features of the canonical ResNet50 (Radford et al. 2021). This comparison was across 27 datasets. Zero-shot CLIP outperforms this baseline slightly on 16 of the 27 datasets. CLIP underperforms when it comes to complex or abstract tasks such as counting objects in synthetic scenes (CLEVR Counts).This can show the the poorer capabilities of zero-shot CLIP on more complex tasks. Figure 9 summarizes the findings of the representation learning capabilities of a model. The performance on the from dataset evaluation suite was from Kornblith et al (Chen et al. 2020)]. Models trained with CLIP scale well and the largest model outperforms the best existing model (a Noisy Student EfficientNet-L2) on the compute efficiency. They also found that CLIP vision transformers are about three times more compute efficient than CLIP ResNets. This allows higher overall performance within the compute budget. These results replicate previous findings where it was reported that vision transformers are more compute efficient than convnets when trained on large datasets (Radford et al. 2021).

Figure 8. The Zero Shot CLIP is fully comparable with all fully supervised baselines across all 27 datasets. (Radford et al. 2021)

Figure 9. Linear probe performance of CLIP models in comparison with state-ofthe-art computer vision models (Radford et al. 2021)

Scalability -There is concern that  pre-training a large internet dataset would have unintentional overlap with downstream evals. De-duplication analysis was done to detect this.Although there was found to be a  small amount of overlap,  the overall accuracy was  rarely shifted by more than 0.1% with only seven datasets above this threshold. This was similar to previous  findings of similar duplicate analysis in previous work on large scale pre-training where there was minimal changes in overall performance.Future research  would be on the characterization of the capabilities,faults, and biases in such models (Radford et al. 2021).

4 Replication of Results

4.1 Dataset

The CIFAR-10 dataset is a collection of 60,000 32X32 pixel images divided into two subsets. The main subset consists of 50,000 images referred to as the training set. The secondary subset consists of 10,000 images referred to as the testing set. The images are broken into the 10 following individual classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class comprising 6,000 images further broken down into 5,000 in the training set and 1,000 in the testing set. CIFAR-10 is one of the most widely used data sets used to train computer vision tasks. We gained access to the dataset using the built-in Torchvision Dataset library.

To test the model out and try to train it on a broader dataset we also explored the caltech 256 dataset. The Caltech 256 dataset is a widely used benchmark dataset in computer vision, consisting of 30,607 images across 256 object categories. The dataset covers a diverse set of object classes, including animals, vehicles, household items, and natural scenes. This rich diversity makes the dataset challenging and suitable for evaluating the performance of computer vision algorithms, particularly in tasks such as object recognition, classification, and scene understanding. The dataset was broken down into smaller sections, where we picked up like 5000 images for training and 1000 images for validation. This dataset was available online and is free to download. 

4.2 Model Training

The pretrained CLIP model available on GitHub offers a selection of five different models, each varying in architecture size and complexity. For this project, we opted to utilize the “ViT-B32” model, notable for its extensive parameter count of over 155 million, making it capable of capturing rich visual and textual features during training. With our chosen model in hand, we embarked on the training process using the dataset described in the previous section.

The training loop unfolds in several key steps to iteratively refine the model’s performance:

  • Within each iteration, a batch of images along with their corresponding captions is loaded into memory.

  • The data is then fed through the CLIP model, which generates predictions based on the learned associations between textual and visual representations.

  • These predictions are compared against the ground truth annotations to compute the loss, which quantifies the disparity between the predicted and actual labels.

  • Through backpropagation, the computed loss is propagated back through the network, enabling the model to adjust its parameters in the direction that minimizes the loss, thereby improving its performance.

  • This fine-tuning process continues for a predetermined number of epochs, gradually refining the model’s understanding of the intricate relationship between our specific set of images and their corresponding textual descriptions.

Figure 10 represents the result that we got during training where we have plotted validation accuracy and loss vs no. of epochs.

Figure 10. The above plot represents the loss and validation accuracy vs no. of epochs.

As you can see from the plot there is a lot more data and a lot more many epochs required in addition with some tunning to get the loss to be zero and increase the accuracy. Due to equipment limitation we had to run a smaller subset of the data for a lesser number of epochs due to the shear size of the model. The model contains over 155 million parameters and the huge dataset posed a huge challenge for our personal computers.

See the Code tab for a full code listing.

4.3 Classification Examples

4.3.1 Principle of Similarity Matrix

In the context of the CLIP model, a similarity matrix serves as a vital tool for quantifying the semantic relationships between textual descriptions and associated images. Represented as a square matrix, it captures pairwise similarities between elements in the dataset, aiding tasks such as image retrieval and clustering. By computing similarity scores between image-caption pairs, the matrix facilitates the exploration of underlying patterns and structures within multimodal data. This matrix plays a crucial role in enhancing our understanding of the intricate connections between textual and visual information, driving advancements in multimodal learning within the CLIP framework.

Figure 11. The above figure describes a similarity matrix. And how the clip model is structured to find the similarity between text and images. (Radford et al. 2021)

4.3.2. Cosine Similarity Between Text and Image Features

In the illustration below, we present a straightforward classification example for the CLIP model. The image depicts a cosine similarity matrix comparing five images with five distinct captions. The diagonal of the matrix represents the similarity index, showcasing how each image aligns with its corresponding caption. This visualization offers a clear depiction of the model’s ability to discern semantic relationships between images and textual descriptions through cosine similarity computations.

Figure 12. The above image illustrates the working of the CLIP model where we see the highest similarity between images and respective captions. (Radford et al. 2021)

4.4 Possible real world applications

4.4.1 Robotics Application

In today’s rapidly advancing world, there is a growing demand for robots designed to perform intricate tasks aimed at enhancing our daily lives. Among the recent trends, there’s a notable surge in the development of autonomous robots. In this context, we advocate for the integration of vision and language-based models into the field of robotics. These models enable robots to leverage camera input, identify various objects within their view, and generate tailored prompts for user interaction.

For instance, imagine a humanoid robot equipped with vision capabilities observing objects on a table. The robot can then provide users with a range of options, such as ‘Pick up the bottle’ or ‘Dispose of the trashcan.’ This approach empowers users to receive prompts that outline the robot’s potential actions, allowing us select the most appropriate response.

Figure 13. Humanoid Robotics Application. The image on the right shows the prompts that the user will see as possible actions.

4.4.2 Healthcare Application

There are numerous examples throughout the healthcare field where deep learning models such as CLIP may open the door to multimodal data integration which could enhance our understanding of individual health (i.e., personalized medicine) as well as population-level epidemiology (ex: pandemic surveillance). A recent article by Acosta Et. Al. has highlighted 7 opportunities for general multimodal biomedical AI as shown in Figure 14 (Acosta et al. 2022)

Figure 14. Data modalities and their applications in healthcare (Acosta et al. 2022). CLIP specifically has applications with EHR data that is uploaded as images, such as radiological scans and environmental monitoring.

Healthcare would specifically benefit from the CLIP model where images and scans are utilized, which is common in specialities such as oncology, neurology, radiology, and orthopedics. Some CNN models are already being implemented to separate “noise’ from the true image that is produced by scans in these specialities, as shown in Figure 15 (Liu et al. 2021). As explained by Acosta Et. Al., models like CLIP could play a crucial role in not just separating noise, but also in extracting and interpreting information from scans and EHR data that are uploaded as images, which is common in Electronic Health Record (EHR) systems (Acosta et al. 2022). Taking this a step further, once analyzed by CLIP, this data could then be integrated into the existing collection of personalized omics data that is maintained in EHRs, providing a much more robust assessment of individual health. 

Figure 15. An example of radiological “cleaning” with CNNs (Liu et al. 2021). Low-resolution and low-count PET scans of the abdomen and brain were cleaned with a trained CNN to produce a deblurred or denoised image that resembled the full-count and high-resolution images taken on a higher quality and more expensive machine (Liu et al. 2021).

Environmental monitoring is an area of medicine that is frequently overlooked but could potentially benefit from models like CLIP. Acosta Et. Al. briefly mention that these models could be used in conjunction with ambient sensors to improve remote care systems both at home and in assisted living facilities (ALFs) (Acosta et al. 2022). The CLIP model could be trained to recognize high-risk situations, such as stroke symptoms (which include physical changes) and accidents such as falls. CLIP may even be able to recognize activities of mental stimulation compared to boredom, which would benefit the goals of ALFs to provide physical as well as mental support to the aging population. 

4.4.3 Assistive Device Application

Another real-world application of the CLIP model is to employ the model as an assistive technology to aid people with visual impairments. Historically, the majority of blind people have had to rely on passive devices such as a walking cane to navigate their surroundings using the cane to detect objects in their path. A small number utilize guide dogs to navigate around obstacles. But these methods are not ideal. As an example, a guide dog is unable to interpret street signs. We imagine a CLIP model being combined with a large language model deployed to a device that can recognize & identify objects for the blind and for people with visual impairments. The advantage of the CLIP model is that the output object identification won’t be limited to a one-word response. Leveraging the fact that model is trained on captions and textual descriptions. The output response has the potential to be able to provide a vivid sentence describing the object as it relates to its surroundings. This would allow the visually impaired person to essentially have awareness of their environment.

5 Limitations and Conclusion

5.1 Limitations

CLIP is geared towards a contrastive objective rather than a prediction objective. While this approach requires relatively significant less computational resources it does it at a cost to accuracy. A contrastive objective tends to be better at predicting the relationship between two objects, in this case being the text and the image instead of predicting it. As such the model tends to have difficulty distinguishing between objects of a specific class.  As an example, being able to correctly label different breeds of dogs, rather having more success in just being able to classify the imagine as a dog. CLIP is a generalized model meaning that it was created to be able to recognize a wide variety of tasks by taking images and relating them to text to improve zero-shot performance. However, that generality underperforms on several specialized tasks such as medical imaging, self-driving tasks, and satellite image classification to name a few (Radford et al. 2021).

CLIP is a new model having been introduced in 2021 and naturally the CLIP model has yet to reach the zero-shot performance accuracy of the SOTA models available today. The accuracy margin increases even more when compared to specialty models trained for a specific task as an example celebrity identification. Achieving such accuracy would require an additional 1000 times increase in the computation resources which is impossible given current hardware limitations (Radford et al. 2021). GPU, memory requirements, and energy consumption being at the forefront of those limitations given the use of today’s hardware. The hardware limitations could be one of the reasons why the researchers at OpenAI used a contrastive approach mentioned in the previous paragraph. 

5.2 Conclusion

Regardless of these limitations the CLIP model advances image classification by combining images and text to learn. It accomplishes this by its pre-training task which harnesses image-text pairs which are jointly trained on an image encoder in addition to a text encoder. CLIP shows great promise in its efficiency in relation to its performance. By contrast the models presented in the paper were trained on a few datasets, with the largest Vision Transformer taking 12 days to train on only 592 V100 GPUs and the largest Vision Transformer taking 18 days to train on 256 V100 GPUs (Radford et al. 2021). Compare that with some of the larger image classification models taking months and some even more than a year to train. 

Furthermore, the CLIP model is scalable given the number of captions enabled images that can be found on the internet on websites such as social media sites making it ideal for self-supervised learning. Self-supervised in the sense that the training data can be scrapped directly from the internet instead of relying on people or machines to annotate and label the data. This step alone significantly reduces the pre-training time. Just as it took about a decade for the leap from LeNet to Alexnet to materialize in part due to the hardware limitations, mostly the storage capacity of computers and the advancement of the GPU.  We predict that with continuing research and the advancement of the next generation of hardware, the CLIP model will play a pivotal role in the future of classification problems.

6 References

Acosta, J. N., Falcone, G. J., Rajpurkar, P., and Topol, E. J. (2022), “Multimodal biomedical AI,” Nature Medicine, 28, 1773–1784. https://doi.org/10.1038/s41591-022-01981-2.
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G. (2020), “Big Self-Supervised Models are Strong Semi-Supervised Learners,” arXiv. https://doi.org/10.48550/arXiv.2006.10029.
Fukushima, K. (1980), “Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, 36, 193–202. https://doi.org/10.1007/BF00344251.
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012), ImageNet Classification with Deep Convolutional Neural Networks,” Neural Information Processing Systems, 25. https://doi.org/10.1145/3065386.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998), “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 86, 2278–2324. https://doi.org/10.1109/5.726791.
Liu, J., Malekzadeh, M., Mirian, N., Song, T.-A., Liu, C., and Dutta, J. (2021), “Artificial Intelligence-Based Image Enhancement in PET Imaging,” PET Clinics, 16, 553–576. https://doi.org/10.1016/j.cpet.2021.06.005.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021), “Learning Transferable Visual Models From Natural Language Supervision,” arXiv. https://doi.org/10.48550/arXiv.2103.00020.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016), “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. https://doi.org/10.1109/CVPR.2016.91.
Sabour, S., Frosst, N., and Hinton, G. E. (2017), “Dynamic Routing Between Capsules,” arXiv. https://doi.org/10.48550/arXiv.1710.09829.
Simonyan, K., and Zisserman, A. (2014), “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv 1409.1556.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015), “Going deeper with convolutions,” pp. 1–9. https://doi.org/10.1109/CVPR.2015.7298594.