Definition 3. Given a time period [0, 𝑡], and a set of samples 𝑆 0,𝑡 =
{(𝑥 𝑗 , 𝑦 𝑗 )}
𝑡 𝑗 =0
with a certain probability distribution 𝐷 𝑡 (𝑥, 𝑦)
, where 𝑥 𝑗 is
a feature vector and 𝑦 𝑗 is its correspondent output; we say a Concept
Drift occurs at timestamp 𝑡 if there is a significant difference between
𝐷 𝑡 (𝑥, 𝑦)
and 𝐷 𝑡 +1
(𝑥, 𝑦)
:
∃ 𝑡 ∶ 𝐷 𝑡 (𝑥, 𝑦) ≁ 𝐷 𝑡 +1
(𝑥, 𝑦). Note that concept drift is a complicated issue, and it becomes even
worse in a federated environment. If the problem we are trying to deal
with is by nature a federated problem that evolves in time, each client
might experiment a drift in different moments. Also, a local concept
drift does not necessarily have an impact on the global distribution. It
may be the case that a local drift on client 𝑖 results in a change in the
distribution of 𝑖, but not in the joint distribution, 𝐷 𝑡 𝐺 (𝑥, 𝑦)
. In a situation
like this, that client should implement some kind of personalization to
adapt the global model to its particularities. This is an example of why
concept drifts are potentially dangerous for the model performance, and
hence must be detected and counteracted.
When trying to deal with concept drifts, one should notice that not
all of them are alike, as data can evolve in multiple ways. Similar to
what occurred with non-IID data in FL, it is important to characterize
concept drifts to distinguish them. However, in the case of concept
drifts, most of the existing works present a common ground, and
base their classification according to which factor from the equation
𝑃 (𝑥, 𝑦) = 𝑃 (𝑥)
⋅𝑃 (𝑦 |𝑥) is altered. According to this criteria, we determine
three types of shift [
116
,
122
]: (1) virtual, (2) real, and (3) total (see
Fig. 4
):
(i) Virtual Concept Drift makes reference to variations in just the
marginal probability density, 𝑃 𝑡 (𝑥)
≠ 𝑃 𝑡 +𝑘 (𝑥)
and 𝑃 𝑡 (𝑦 |𝑥) =
𝑃 𝑡 +𝑘 (𝑦 |𝑥). Returning to the example used in Section
3.1
of training
an autonomous car, this situation happens, for instance, when
clients move into places or regions previously unseen for them.
(ii) Real Concept Drift, related to differences in conditional probabili-
ties, 𝑃 𝑡 (𝑥) = 𝑃 𝑡 +𝑘 (𝑥)
and 𝑃 𝑡 (𝑦 |𝑥) ≠ 𝑃 𝑡 +𝑘 (𝑦 |𝑥), is caused by a change
in the conditional probability of the classes with respect to the
input features, i.e., similar input data samples that have unequal
labels. An example of this would be, again, a yellow traffic light.
Sometimes a client would stop the car when encountering a
yellow traffic light, and some others may continue driving.
(iii) Total Concept Drift is the mixture of the two other drifts, 𝑃 𝑡 (𝑥)
≠
𝑃 𝑡 +𝑘 (𝑥)
and 𝑃 𝑡 (𝑦 |𝑥) ≠ 𝑃 𝑡 +𝑘 (𝑦 |𝑥), and it is the result of both
probabilities evolving significantly over time.
This classification is analogous to the one we proposed in Section
3
for the data heterogeneity across clients. Apart from these cases we just
discussed, concept drift also takes place when the task itself changes,
since that also modifies 𝐷 𝑡 (𝑥, 𝑦)
. This scenario is closely related to multi-
task learning [
48
,
123
,
124
]. Nevertheless, in the scenario we consider
the task remains unchanged. It is therefore a single-incremental-task
scenario [
110
].
4.2. Concept drift detection Once we settled what we understand by concept drift, we can
discuss the methods developed to deal with it. Those methods typically
consist of three parts. The first step is that they need to detect modifi-
cations in the distributions. Then, they have to act in consequence to
the detected changes, so the model obtained is adjusted to the current
scenario. Finally, it is important to explain the drift and understand
their implications for future training. In this Section we just examine
the detection strategies. The algorithms that implement a response to
these drifts will be reviewed in Section
5
.
Many concept drift detection strategies have been proposed to
attach the situations of virtual and real drifts [
8
,
125
–
133
]. These
approaches are often classified as Data Distribution-based or Error Rate- based methods correspondingly. They use different statistical properties
of the input and output distributions to identify their breaking points,
corresponding to drifts. Most of these strategies of concept drift de-
tection consider a situation where data is centralized in one single
machine. As far as we are concerned, the only works that present
a concept drift detection strategy on federated settings are [
8
,
130
].
Nonetheless, the strategies we highlight are, from our point of view,
easily adaptable to the FL framework.
Data Distribution-based methods aim to detect virtual concept drift.
When trying to detect this kind of drift, the only required information
is the input pattern of the data samples, {𝑥 𝑖 }
𝑀 𝑖 =1
, or some transformation
of it. For instance, the strategy developed in [
125
] works directly with
the input data, and it consists of measuring the similarities among
the features, grouping them in clusters, and evaluating the number of
features from the new data sample in each cluster to identify a drift.
One possible way of adapting this to a federated environment would
be that each client calculates their own clusters, and detects local drifts
with independence of what occurs for other clients. This drifts would
be communicated to the central server, that would be the responsible
of taking into account that information.
On the other hand, [
126
] works with an alteration of the input
data. They determine a mapping that relies on the input features of
the samples 𝑓 ∶ 𝑋 ⊂ R
𝑚 ⟶ {−1, 1}
and apply it to the whole input
dataset, splitting it into two groups (the ones that go to 1 and the ones
that go to −1). Then, they statistically compare if data received before
and after a certain timestamp is equally distributed in those groups.
If they are not, a drift is detected. The map 𝑓 has to verify certain
conditions for this method to work with a high level of precision. One
Information Fusion 88 (2022) 263–280 271
M.F. Criado et al.