Final Project: Unsupervised Continual Learning

Jones Lin, Grant Ovsepyan, Christopher Gottwaldt, Michael Berkey

Project Presentation

You can see us present our project in this presentation video.

The following is a link to our Google Slides presentation that was used as the visual aspect of the video.

Motivation

We wanted to create an unsupervised continual learning model that could maintain its image classification performance on prior datasets even after being trained on new datasets. Deep learning models have achieved great performance on various tasks including image classification, detection [[24](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], segmentation [[9](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], etc. with applications in high-impact fields such as the medical field [[1](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)]. Despite this progress, most of these models still require static training, which differs from human cognition systems [[11](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] which can train in a more dynamic manner. When introducing new data, training a network on this entire dataset -- meaning using the new and old data -- is expensive and often infeasible due to limited computation resources as well as data privacy problems; This means that requiring model training to be static is very limiting. 

Under the static training strategy, models need to be trained on the whole dataset at one time or else they will "forget" what they have learned abruptly upon being provided with new data; This phenomenon is termed as "catastrophic forgetting" [[5](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] [[17](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)]. For example, if we train a ResNet18 [[7](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] model, termed A, on the ImageNet dataset [[4](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], and we then feed it novel data with new labeled categories to get another model named B, we find that the model B forgets the knowledge gleaned from the ImageNet dataset: i.e. the model's performance on the ImageNet dataset drops significantly.

Previous works tried to address this catastrophic forgetting problem by replaying the previous data [[22](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] [[23](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], adding a regularization term into the loss function [[12](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] [[10](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], and isolating important parameters [[16](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] [[15](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)]. The above methods all aim to prevent catastrophic forgetting during supervised learning. In our project, we consider a more strict situation in which we want to achieve continual learning under unsupervised learning. 

Unsupervised learning is another method of training machine learning models where the model is not using labeled data. Supervised learning is useful in that it can be very accurate with less training data, but the downside is that this data has to be labeled accurately: a notoriously tedious process that can be minimized with the use of unsupervised learning models.

 While being a useful method, supervised learning has trouble competing with unsupervised learning models. The SimSiam models [[3](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], which take an unsupervised continual learning approach to their training, have demonstrated themselves to be very effective in classification tasks. SimSiam models take two randomly augmented views of the same image, process them through an encoder network, and then maximize the similarity between the outputs for the image augmentations [[2](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)]. 

Even one of the most naïve SimSiam models, FINETUNE, outperforms virtually every supervised continual learning model save for DER when benchmarked on Split-CIFAR-10, Split-CIFAR-100, and Split Tiny-ImageNet. In [this](<https://openreview.net/pdf?id=9Hrka5PA7LW>) paper, they list some metrics in which unsupervised continual learning models are able to outperform the supervised continual learning models [[14](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)]. They demonstrate the greater performance in section 5.3 with notable findings such as: Demonstrated by higher CKA feature similarities and lower l2 distances, unsupervised continual learning models appear to be more robust to forgetting with more layers, unsupervised models are less prone to catastrophic forgetting when compared to their supervised counterparts, and the task loss has a flatter and smoother landscape.

The reason we want to solve this problem is because we would like to have a model that can learn from a variety of datasets without losing performance on previous datasets. This is important in getting closer to achieving a model that has uses outside of a very specialized use case while also minimizing both manual intervention and computational costs.

Approach

In our project, we wanted to mitigate catastrophic forgetting in self-supervised continual learning (SSL) using our own, novel approach; Our hope is that the deep learning models can remember what they have learned from their training on previous datasets. We are not the first to consider continual learning under unsupervised learning situations [[20](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] [[6](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] [[18](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], but these attempts either utilize pseudo labels to train their models or contrast learning as a pretext task to achieve the goal. Since previous methods focus on how to achieve SSL via contrast learning, our methodology differs from them in that we use image reconstruction as a pretext task. More specifically, we use Masked Autoencoders (MAE) [[8](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] as our baseline, and we explored whether the existing continual learning techniques are able to be compatible with it, such as: Semantic Drift Compensation [[26](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], the Nearest-Mean-of-Exemplars Classification [[22](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)], etc. We hope this approach will enable the training of models that can adapt to new tasks without forgetting their prior knowledge.

Before evaluating why the model is forgetting, we needed to find out the bounds. To accomplish this, we needed to train the whole dataset jointly with the SimSiam [[3](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] and mask MAE [[8](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] models. Following this process, we split the various benchmarks into multiple sub-datasets; We then split the 101 categories in the Caltech101 dataset into 10 small datasets which contain 10 or 11 classes per sub-dataset. Next, we trained the models on these ten sub-datasets sequentially, and we evaluated their performance on the previous dataset in order to determine the level of forgetting that is taking place (e.g. one model trained on the MAE and another trained on SimSiam). We compared their performance to conclude whether MAE is more effective in achieving continual learning. Finally, we combined the higher-performing method with existing continual learning techniques such as semantic drift compensation [[26](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)] and the Nearest-Mean-of-Exemplars Classification [[22](<https://christophergottwaldt.notion.site/CS-639-Final-Project-Presentation-Masked-Autoencoders-in-Continual-Learning-87c76186be404077b110a034f55aadb5>)].

As previously explained, SimSiam unsupervised continual learning models have shown lots of promise, so we tried testing along with MAE as well with two experiments: Using the base VIT model without pretraining the model, we evaluated the average accuracy and the forgetting of the models. The following two figures show that the MAE is better at both accurately classifying (represented by the first graph below) and preventing forgetting (represented by the second graph below). 

Untitled

Image Classification Accuracy of SimSiam vs MAE Methods

Untitled

Image Classification Forgetting Rate of SimSiam vs MAE Methods