1) Framing: In accordance to best practice to split larger signals into smaller bits, referred to as frames, we set the length of each frame to 256 sample points. In addition, we include an overlap of 50% between neighboring frames (i.e. 128 sample po...
2) End-Point Detection: In order to detect voiced segments, we also detect end-points, i.e., we remove silent pieces. There are various methods that can be employed to detect end-points, such as double-threshold detection based on short-time energy an...
3) Detecting frames containing crying: The next step required is to detect the crying signals from the voiced signal. Here we use a double-threshold detection based on short-time energy and short-time zero-crossing rates as suggest in [10]. We first d...