Fig. 6. Classification of the different techniques able to deal with the temporal
heterogeneity in the behaviour of the data samples.
to establish relations between the different strategies results. In this
case, nonetheless, most of the works we just presented compare their
methods with EWC [
143
], using it as a baseline. In [
143
], the MNIST
dataset is employed to simulate the different input data distributions.
They take a random permutation of pixels and apply that permutation
to all of the images to create each input domain. With this strategy,
they only need one dataset, denoted Permuted MNIST, to simulate
any number of domains. In [
144
], authors compare the accuracy ob-
tained with their learning method (P&C) in each domain, and also the
mean accuracy, with EWC, showing that P&C achieves slightly better
results avoiding catastrophic forgetting. That also happens in [
139
],
where CLEAR method is compared with both EWC and P&C. CLEAR
outperforms EWC in most situations, and attain very similar results to
P&C. [
140
,
145
] also use the permuted MNIST dataset and improve the
mean accuracy of EWC, and some other methods they compare with.
Lastly, generative strategies like [
141
] prove to be efficient to prevent
catastrophic forgetting. They use the datasets MNIST and SVHN to show
that the performance is barely affected when they change the input
dataset.
5.2. Real concept drifts In this kind of situation, the users are allowed to change the task
they are performing during the training process. In Section
3.3.2
we
highlighted the necessity of a model designed in a way such that
different clients can have distinct outputs even though they own similar
inputs. However, our interest now is pursuing a model able to flip from
one output to another for the same client in different timestamps. To
overcome this challenge, there are two kinds of strategies: On the one
hand, we have Contextual Information Methods, and on the other hand,
we have Architecture-based Methods, see
Fig. 6
.
The approach of Contextual Information methods is closely related to
the one we talked about in Section
3.3.2
[
108
]. That work discussed
the possibility of adding a piece of information 𝑧 to the original data
inputs, such as a task identifier or a domain identifier. This information
𝑧 could be designed to address both the multi-domain and the multi-
task issues in a variety of situations besides the one presented in the
article. Some more research articles in this line are [
146
–
148
]. The
authors of [
146
] claim the sought tasks affect the process of training,
and propose using a task identifier to achieve personalization. The other
approaches [
147
,
148
] also rely on some kind of contextual information
to determine the task corresponding to each data sample, and on the
whole, they act as if they had a specific network for each of the tasks. In
this case, when a new task appears in the training stage, each layer of
Table 6 Summary of the datasets employed in the works presented in Section
5.2
. Asterisks
indicate that the datasets have been modified in particular ways. Some of the datasets
mentioned were not referenced so far: Atari games [
153
].
Article
Datasets used in experiments
[
146
]
MNIST*;
CIFAR-10*
[
147
]
MNIST*;
CIFAR-100
[
148
]
MNIST*
[
149
]
ImagenNet;
CUBS;
Oxford102Flowers
[
150
]
ImagenNet;
CUBS;
Oxford102Flowers
[
151
]
ImagenNet;
CUBS;
Oxford102Flowers
[
152
]
Atari games
the network is expanded and the new neurons are used for the current
task, but not for previous ones.
The other kind of strategies, Architecture-based methods, focus on
deep incremental multi-task learning techniques that modify the neural
network depending on the performing task, without any forgetting on
previous tasks. The most studied way for accomplishing it is using
some kind of mask on the neural network. Several works have explored
this alternative: [
149
], for instance, study the case of having a neural
network already trained for one task and make use of a weight-level
binary mask to cancel some of the weights, so the resulting network can
solve another previously-defined task. This process can be effectively
repeated for several tasks without any forgetting of the original one, as
the weights are not effectively changed. There is a slightly different ap-
proach proposed by the same authors [
150
], which consists of starting
with a neural network trained for one task, setting some of their weights
to zero, i.e, eliminating some neural connections, and retraining the
model a few epochs for the initial task. After that, when trying to
learn a new task, the already set weights are fixed, and the eliminated
connections are reestablished and trained. This method is not scalable
to several tasks, as the number of neural connections is limited. Another
strategy based on the same ideas uses a ternary neuron-level mask to
perform training [
151
]. The reasoning behind the use of a ternary mask
is that some neurons may be useful for both a new task and a previously
learned one, so three possible states are considered for each neuron
concerning each task: unused, used but not trainable, or trainable. This
paper also faces the scalability problem, as they allow the network to
grow if necessary, setting the new neurons as unused for previous tasks
in order to not modify their accuracy.
On the other hand, authors of [
152
] propose a completely different
technique. They start with a deep neural network for the first task, and
when they are interested in learning a new task, they start a new neural
network and create connections from the ones that already existed to
each layer of the new one, in order to leverage the knowledge from
previous tasks. This strategy is really useful when dealing with related
tasks, but tasks that interfere with each other might harm the outcome
model.
These kinds of methods present a lot of differences in the way
they implement their experimental results. Some of them pay attention
to the accuracy obtained, while others are more concerned about
the error they got, and some others concentrate on the level of for-
getting they commit. For instance, in [
146
] the authors employ the
MNIST dataset with pixel permutation, like in [
143
], and also ex-
changes some class labels in parts of the dataset to simulate the
different behaviours. They compare their results with EWC, LwF [
154
],
and improve their results. However, they emphasize that this form
of simulating the different tasks and behaviours is quite unrealistic.
Surprisingly, [
149
–
151
] employ the same datasets to perform their
experiments: the ImageNet dataset [
104
], used for training the pre-
trained ImageNet-VGG-16 neural network; the CUBS dataset [
155
], and
the Oxford102Flowers dataset [
156
] (see
Table 6
). This is, as we have
seen in this paper, very rare.