Data preprocessing. To train and test the neural network, it was necessary to process video files to present the data in the required format. The dataset consisted of storyboarded video. On each received frame, OpenPose recognizes characteristic points of the skeleton and displays information on 25 key points in the form of their coordinates along the X, Y axes, as well as data on the degree of confidence in the correct recognition of points. For all received video data, the likelihood of the characteristic points of the skeletons recognized by the OpenPose program was assessed. This involved not only data filtering, but also heuristic algorithms, as well as how natural the actions and transitions between them appear in the video. Filtration was carried out using two methods. The first excluded all poorly defined points that had an OpenPose-derived confidence level of zero. The second method consisted of selecting varying threshold values for the remaining points used. This method was used in conjunction with interpolation over the remaining points with an assessment of the degree of confidence in their receipt. These operations were carried out for each dimension. [8]
Heuristic algorithms included checking the skeletal models for plausibility. Three methods were used. The first assessed the proportional similarity of recognized body parts. For example, the recognized shoulder and forearm should not differ in length by more than 20%, since even perspective distortion cannot significantly affect the length of the arm links. This method was used both crosswise between body parts and to check the symmetry of body parts. The second method checked the correspondence of certain points on the frame: two shoulder joints; two hip joints; shoulder, hip joints, points on the chest, pelvic bone. The third method assessed the smoothness of human movements by checking the speed of movement of recognized key points. In this way, outliers or noisy sample points were excluded for subsequent interpolation. Manually, only the fastest and most natural transitions between actions were left in the dataset.
Construction of a movement model. A multi-stage convolutional neural network (CNN) structure is used. Such a convolutional model makes predictions based on a history of a fixed duration, which can lead to better results than simpler models because it can estimate changes over time. In multi-stage forecasting, the model must learn to predict a range of future values. Thus, unlike a single-stage model, which predicts only one future point, a multi-stage model predicts a sequence of future values.
The model will make a set of predictions based on a time window of data. Main features of input windows:
1. Width - the number of time steps and marks;
2. Time shift between them.
The Window Generator program is created. This program processes indexes and offsets, divides feature windows into pairs (features, labels), arranges the contents of the resulting windows and generates packages of these windows from training, evaluation and testing data. The initial method includes all the necessary logic to handle input indexes and labels. Frames of training (70%), evaluation (20%) and test (10%) samples are accepted as input data. Given a list of sequential inputs, the split_window method converts them into an input window and a label window.