the parents’ smartphone, hence informing them what their baby
might want to tell them.
III.
S
IGNAL
P
ROCESSING
T
ASK
In this section we now describe the algorithms used to
analyse the infant’s cries in detail. As shown in Fig. 4, our
method consists of five main steps.
A.
Signal Collecting
The first step includes collecting the signal. We use a
microphone to collect sound signals of the surroundings. This
step does not only include
collecting the raw data, but also
normalizing it to reduce the difference between different crying
parts. We record sound with a duration of ten seconds and
normalize the collected signal by transforming it to WAV format
with 16-bit resolution and a sampling rate of 8kHz.
B.
Signal Preprocessing
In the signal pre-processing step, we remove unwanted noise
and silence fragments from the initial signal.
It includes the
following steps:
1)
Framing: In accordance to best practice to split larger
signals into smaller bits, referred to as frames, we set the length
of each frame to 256 sample points. In addition, we include an
overlap of 50% between neighboring frames (i.e. 128 sample
points) to avoid any negative effects caused by splitting up the
signal into smaller bits.
2)
End-Point Detection: In order to detect voiced segments,
we also detect end-points, i.e., we remove silent pieces. There
are various methods that can be employed to detect end-points,
such as double-threshold detection based on short-time energy
and short-time zero-crossing rate, or based on cepstrum features
[10]. Here, we choose a single-threshold detection method
based on intensity as it obtained better results than the double-
threshold detection method. The intensity of the n-th frame is
calculated as follows:
𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 = ∑|𝑆
𝑛
(𝑖)|
𝑁
𝑖=1
(1)
Where N is the number of sample point of frames, and S
i
is
the value of the i-th sample point.
Then we set the threshold as follow:
𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 𝑚𝑎𝑥
1≤𝑖≤𝑁
|𝐼
𝑛
(𝑖)| × 0.1 (2)
Where N is the number of frames, and
𝐼
𝑖
is the intensity of
the i-th frame.
3)
Detecting frames containing crying: The
next step
required is to detect the crying signals from the voiced signal.
Here we use a double-threshold detection based on short-time
energy and short-time zero-crossing rates as suggest in [10]. We
first determine the short-time zero-crossing rate of the n-th
frame and the short-time energy of n-th frame using the
following equation:
𝑍𝐶𝑅
𝑛
=
1
2 ∑
|𝑠𝑔𝑛(𝑆
𝑛
(𝑖)) − 𝑠𝑔𝑛(𝑆
𝑛
(𝑖 − 1))| (3)
𝑁
𝑖=1
𝐸
𝑛
= ∑ 𝑆
𝑛
(𝑖)
2
𝑁
𝑖=1
(4)
Where N is the number of sample point of frames, and
𝑆
𝑛
(𝑖)
is the value of i-th sample point.
We then use these two equations to calculate the accurate
value and set the thresholds accordingly. Then, we set three
states: silent, voiced and uncertain. We sequentially determinate
each frame and get the voiced frames. Finally we connect these
frames to get the voiced signal.
C.
Feature Extraction
As long series of signals alone do not yet allow us to judge
and classify cries, the next step includes identifying specific
characteristics that can be used to train our classifier on. Various
methods have been introduced to extract features [11]. In our
algorithm, we extract four time domain features and six
frequency domain features.
Consider the following definitions:
• 𝑆
𝑛
(𝑖): the signal of the n-th frame in the time domain.
• 𝑆
′
𝑛
(𝑖): the signal of the
n-th frame in the frequency
domain after framing, Hamming windowing, and FFT.
Signal c ollecting
Signal
Preproc essing
Feature Extrac tion
Feature Selection
Crying
Classification
Figure 4. Algorithm workflow
•
T: the features extracted in time domain.
•
F: the features extracted in the frequency domain.
•
N: the number of sampling in a frame.
[11] supplies the detail of each feature. Due to space
limitations, the calculation equations of each feature are
summarized as follows:
1)
Magnitude
𝑇𝑀
𝑛
= ∑|𝑆
𝑛
(𝑖)|
𝑁
𝑖=1
(5)
2)
Average
𝑇𝐴
𝑛
=
1
𝑁 ∑
𝑆
𝑛
(𝑖)
𝑁
𝑖=1
(6)
3)
Root mean square
TRMS
𝑛
= √
∑ 𝑆
𝑛
(𝑖)
2
𝑁
𝑖=1
𝑁
(7)
4)
Spectral centroid
𝐹𝐶
𝑛
=
∑ (|𝑆
′
𝑛
(𝑖)|
2
× 𝑖)
𝑁
𝑖=1
∑ (|𝑆
′
𝑛
(𝑖)|
2
)
𝑁
𝑖=1
(8)
5)
Spectral bandwidth
𝐹𝐵
𝑛
= √
∑ (|𝑆
′
𝑛
(𝑖)|
2
× (𝑖 − 𝐹𝐶
𝑛
)
2
)
𝑁
𝑖=1
∑ (|𝑆
′
𝑛
(𝑖)|
2
)
𝑁
𝑖=1
(9)
6)
Spectral roll-off
∑|𝑆
′
𝑛
(𝑖)|
2
𝐹𝑅
𝑖=1
= 0.85 × ∑ 𝑆
′
𝑛
(𝑖)
𝑁
𝑖=1
(10)
7)
Valley
𝐹𝑉𝑎𝑙𝑙𝑒𝑦
𝑛,𝑘
= log {
1
𝛼𝑁 ∑
𝑆
′
𝑛,𝑘
(𝑁 − 𝑖)
𝛼𝑁
𝑖=1
} (11)
8)
Peak
𝐹𝑉𝑎𝑙𝑙𝑒𝑦
𝑛,𝑘
= log {
1
𝛼𝑁 ∑ 𝑆
′
𝑛,𝑘
(𝑖)
𝛼𝑁
𝑖=1
} (12)
Where k is the number of sub-band and
𝛼 is a constant. We
set k and
𝛼 to 7 and 0.2, respectively.
9)
MFCC
MFCC is an abbreviation for Mel-Frequency Cepstral
Coefficients. In the first step, we get
𝑆
′
𝑛
by framing, windowing
and FFT.
The next step is to filter
𝑆
′
𝑛
by the Mel-filter bank.
Then, we use Discreate Cosine Transform (DCT) and extract
dynamic difference parameters. We extract MFCC1-MFCC12
[10].
D.
Feature Selection
Sequential forward floating search (SFFS) algorithms are
often used to select the optimal set of feature vectors. Starting
with an empty set, a subset x from the unselected features each
round is selected. Then, the evaluation function is optimized
after joining the subset x, and then the subset z is selected from
the selected features, so that after eliminating the subset z, the
evaluation function is optimal [11].
We use support vector
machines for classification and k-fold cross-validation to
calculate the classification accuracy. Finally, we use the SFFS
algorithm to obtain feature sets. The detail of SFFS is shown as
follows.
E.
Crying Classification
Now that we have the highly abstract feature vector of crying
signal, we use this feature vector to train the SVM classifier.
SVM is a supervised learning model in the field of machine
learning [13]. Its principle can be described from linear
separability, then extended
to linearly inseparable cases, and
even extended to non-linear functions, with a deep theoretical
background.
We use Python's existing SVC (Support Vector Classifier)
for training. The training set labels used are hunger, sleepiness,
pain, and non-crying. SVC uses the "one-versus-one" method to
achieve multiple classifications. This algorithm adopts the
Input: F is the set of all unselected features;
result
:= {∅};
E
() is the evaluation function;
done
:= false;
Dostları ilə paylaş: