3. Non-IID data in Federated Learning Lots of research has been done regarding the issue of dealing with
non-IID data, specially in the context of Federated Learning, where it
acquires great importance. In this paper, we will use the words ‘hetero-
geneous data’ as a synonym for non-IID data. Existing works focus on
both developing new techniques to tackle data heterogeneity [
5
,
36
],
and proving the convergence of traditional FedAvg trained with non-
IID data under some restrictive assumptions [
37
,
38
]. However, in most
cases, little specifications are made concerning the heterogeneity source
of data, if made at all.
Firstly, we are going to give a formal definition of Independent and Identically Distributed data (IID data). For that, we need to settle the data
probability distributions of the different data owners. Recall we denote
a client by 𝑖 ∈ 𝑁, with 𝑁 = {1, … , 𝑛} being the set of all clients. We also
denote each data sample as 𝑑 = (𝑥, 𝑦) ∈ 𝑋 × 𝑌 , where 𝑥 is the feature
vector of the sample, and 𝑦 is its corresponding label, in supervised
settings. Each client collects its own data 𝐷 𝑖 , and therefore it has a
data probability distribution 𝐷 𝑖 (𝑥, 𝑦)
, where each data sample (𝑥, 𝑦) has
probability 𝑃 𝑖 (𝑥, 𝑦)
. The overall data distribution is a weighted mean of
these local data distributions:
𝐷 𝐺 (𝑥, 𝑦) =
𝑁 ∑
𝑖 =1
𝑀 𝑖 𝑀 ⋅ 𝐷 𝑖 (𝑥, 𝑦)
(1)
where 𝑀 𝑖 is the amount of data collected by the 𝑖th device, and
𝑀 =
∑
𝑁 𝑖 =1
𝑀 𝑖 is the total number of data samples. Recall that, in this
moment, we are working under standard FL assumptions, i.e., the local
datasets are fully available from the beginning, so we do not have to
deal with shifts in the distributions over time. For that reason, we do
not use any temporal index.