Definition 1. Data is said to be IID if the probability belonging to
a data sample does not vary as other samples are drawn, and every
sample randomly selected can belong to any local dataset with the
same probability. In mathematical terms, if we consider the union of
the datasets 𝐷 =
⋃
𝑁 𝑖 =1
𝐷 𝑖 ⊂ 𝑋 × 𝑌 , we claim 𝐷 is an IID dataset if, and
only if,
{
𝑃 𝑖 ((𝑥, 𝑦), (𝑥 ′
, 𝑦 ′
)) = 𝑃 𝑖 (𝑥, 𝑦)
⋅ 𝑃 𝑖 (𝑥 ′
, 𝑦 ′
), 𝐷 𝑖 (𝑥, 𝑦) = 𝐷 𝑗 (𝑥, 𝑦). (2)
For all 𝑖, 𝑗 ∈ 𝑁, and for all (𝑥, 𝑦), (𝑥 ′
, 𝑦 ′
) ∈ 𝐷 𝑖 ∪ 𝐷 𝑗 .
This definition has one important consequence: Under IID data, all
the clients distributions are equal to the global distribution:
𝐷 𝐺 (𝑥, 𝑦) =
𝑁 ∑
𝑖 =1
𝑀 𝑖 𝑀 𝐷 𝑖 (𝑥, 𝑦) = 𝐷 𝑖 (𝑥, 𝑦)
𝑁 ∑
𝑖 =1
𝑀 𝑖 𝑀 = 𝐷 𝑖 (𝑥, 𝑦) ∀𝑖 ∈ 𝑁. Reciprocally, we say data is non-IID if any of the two conditions given
in Eq.
(2)
are not satisfied. However, this situation is quite less informa-
tive than the one from the IID data scenario, since we are unaware of
which affirmation is not satisfied, where do the distributions differences
lie, etc.
Notice that a local dataset is always IID if the samples it holds are
independent. In other words, in settings where the data is centralized
or gathered together, data is always identically distributed, since there
is only one device involved in training. The term non-IID in machine
learning implies the existence of various participants, or sets of data,
and it is mostly used in the decentralized paradigm.
3.1. Taxonomy of data heterogeneity Data can be non-IID for many reasons. For instance, there may exist
a partition of the clients such that each group presents an IID dataset,
but the mixture of them turns out to be non-IID. In fact, this is the
reasoning behind the Group-level personalization techniques developed
in FL (Section
3.2.2
). Another option would be that the participants
local datasets present slightly different properties while sharing some
others. This situation could be handled with Client-level personalization strategies (Section
3.2.1
). On the whole, the reason for the data to
be non-IID is an important piece of information to decide the most
convenient approach to face it. Hence, we want to dig deeper into the
possible causes of disturbances in the joint probabilities. These causes
can rely on multiple elements, and lead to unequal distributions among
the clients [
31
,
39
]. To characterize the possible causes of non-IID data,
it happens to be more useful to think in terms of the probability density
functions, 𝑃 (𝑥, 𝑦) and 𝑃 𝑖 (𝑥, 𝑦)
, rather than distributions, since they can
be factorized in two different ways:
𝑃 (𝑥, 𝑦) = 𝑃 (𝑥)
⋅ 𝑃 (𝑦 |𝑥), (3)
𝑃 (𝑥, 𝑦) = 𝑃 (𝑦)
⋅ 𝑃 (𝑥 |𝑦). (4)
Given these factorizations, we can better distinguish which term rep-
resents the clients particularities. If we are dealing with clients who