participants does not present significant differences. Nonetheless, in
most real-life problems and situations this assumption is not satisfied:
each client acts in a particular way, thus collects biased data which
differs from the one collected by another participant. Further, clients
may be interested in applying the global model obtained to predict
information in different scenarios. All of these possibilities are gathered
under the equivalent concepts of non-IID data or data heterogeneity,
which results ambiguous. Also, in real-world problems it is important to
account for another source of heterogeneity: devices constantly extract
new data from their environment, and process it sequentially along
time, so it is crucial to implement some Continual Learning strategy
to adapt the present model to the current data samples [
9
,
10
].
There is another difficulty when evaluating heterogeneous data.
To test if a method is able to provide the clients accurate models, a
dataset that reflects the heterogeneity they present is required. Cur-
rently, there are almost no specific benchmark designed to evaluate
the goodness of a model that tries to face non-IID realistic problems.
One example is the LEAF benchmark [
32
], which is a framework that
include 5 datasets: FEMNIST, Shakespeare, Sent140, CelebA and Reddit,
and reflects domain heterogeneity scenarios. However, there are some
other non-IID cases that are not considered, like the ones that affect the
clients behaviours. We will further explain these terms shortly. Despite
existing such benchmark, most of the works we present in this paper
do not employ currently existing benchmarks, and instead they perform
their empirical results modifying some datasets to obtain the hetero-
geneity they require for their experiments. An example of this is the
MNIST dataset [
33
]. Lots of works use this dataset in their experiments,
but in some works, images are rotated, different types of noise are
added to the samples (domain shift), or the labels of two classes are
exchanged with a fixed proportion (behaviour changes). In addition,
some modifications had such an impact that they were preserved as
distinct datasets, such as MNIST-M [
34
] or MNIST-c [
35
]. In the end,
each strategy is experimentally tested with a dataset customized in a
unique way, distinct from what other works with the same objective do.
This is a huge problem as it stands in the way of properly comparing
the different methods and analysing which one provides a better result.
Establishing at least one common benchmark dataset that represents
several kinds of heterogeneous data should be a priority.