CN111798486B

CN111798486B - Multi-view human motion capture method based on human motion prediction

Info

Publication number: CN111798486B
Application number: CN202010546274.6A
Authority: CN
Inventors: 周晓巍; 鲍虎军; 方琦; 帅青
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2022-05-17
Anticipated expiration: 2040-06-16
Also published as: CN111798486A

Abstract

The invention discloses a multi-view human motion capture method based on human motion prediction, which carries out three-dimensional human reconstruction on pictures synchronously shot by a plurality of cameras from different views to obtain a three-dimensional skeleton of each human body; for the subsequent frame, according to the reconstructed three-dimensional skeleton of the previous frame, a prediction result and confidence coefficient are given to the position of the human body three-dimensional key point of the current frame; operating a human body detector for the pictures of the key frames to detect a two-dimensional bounding box of each human body; for the pictures of the non-key frames, projecting the three-dimensional skeleton predicted by the motion of the previous frame into the images of all the visual angles of the current frame to quickly obtain a two-dimensional bounding box of a human body under all the visual angles of the current frame, thereby reducing the expense of a human body detector and improving the algorithm efficiency; the invention also utilizes the time sequence information to calculate the visibility of key points of the human body, and improves the accuracy of the reconstruction of the three-dimensional skeleton of the human body based on the visibility.

Description

Multi-view human motion capture method based on human motion prediction

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a multi-view human motion capture method based on human motion prediction.

Background

The multi-view human motion capture refers to restoring the three-dimensional motion of a human body based on multi-view video, most of the existing related methods comprise two stages of detection and reconstruction, although good effects are achieved on a public data set, the detection and reconstruction modules of the existing related methods are separated, a detector cannot benefit from the result of the reconstruction of a previous frame, and motion prediction is lacked. Furthermore, the visibility of key points is not well exploited.

Disclosure of Invention

The invention aims to provide a human body detection method based on motion prediction aiming at the defects of the prior art, and the visibility of each key point is calculated by using time sequence information so as to improve the accuracy and the algorithm efficiency of multi-user motion capture.

The purpose of the invention is realized by the following technical scheme: a multi-view human motion capture method based on human motion prediction comprises the following steps:

(1) inputting pictures synchronously shot by a plurality of cameras at different visual angles, and performing three-dimensional human body reconstruction to obtain a three-dimensional skeleton of each human body as an initial frame result;

(2) and (3) motion prediction: for a subsequent frame, maintaining a series of motion related variables according to a three-dimensional skeleton reconstructed from a previous frame, and giving a prediction result and confidence coefficient to the human body three-dimensional key point position of the current frame by using a motion prediction method;

(3) for the picture of the key frame, operating a human body detector to detect a two-dimensional bounding box of each human body; for the picture of the non-key frame, projecting the three-dimensional skeleton predicted by the motion of the previous frame into the image of each visual angle of the current frame to quickly obtain a two-dimensional bounding box of the human body under each visual angle of the current frame;

(4) intercepting a single human body from an image by using a two-dimensional bounding box, inputting the single human body into a human body two-dimensional key point detector, outputting a heat map of each key point, and taking the position and the confidence coefficient of the human body two-dimensional key point from the heat map; calculating the visibility of each key point by using the time sequence information, and setting the confidence coefficient of the key points which are judged to be invisible to zero; a three-dimensional skeleton is reconstructed by triangulation, and multi-view human motion capture is achieved.

Further, the step (1) is specifically: and (3) carrying out two-dimensional key point detection on each picture, establishing a matching relation between the two-dimensional key points in each visual angle, and reconstructing a three-dimensional human body by utilizing triangulation to obtain a three-dimensional coordinate of each human body key point.

Further, in the step (2), the motion-related variables include positions and velocities of three-dimensional key points of the human body.

Further, in the step (2), a kalman filter is adopted for the motion prediction, the position and the speed of the three-dimensional key point are used as state variables, and the position of the three-dimensional key point is used as an observation variable, so as to establish a linear system; the Kalman filter can give a prediction result of the position of a three-dimensional key point on the basis of considering the movement speed, and the obtained covariance matrix can give confidence information; the prediction process of the kalman filter can be expressed as:

x_t＝Fx_t-1

P_t＝FP_t-1F^T+Q

wherein x is_t-1，x_tState vectors at t-1 and t, respectively, P_t-1，P_tAre respectively x_t-1，x_tQ is a noise covariance matrix, F is a state transition matrix, and Δ t represents the time interval of adjacent frames.

Further, in the step (2), the motion prediction adopts a neural network, the network memorizes a reconstruction result of a past period of time, outputs the predicted position of the three-dimensional key point of the current time according to time sequence memory, and gives a confidence.

Further, in the step (3), the determining of the key frame specifically includes: and taking a frame as a key frame at regular intervals, or adjusting according to the confidence coefficient in the prediction process, and if the confidence coefficient is not ideal all the time, increasing the density of the key frame.

Further, in the step (3), under the condition that the three-dimensional key point position of each human body estimated in the previous frame is available, each section of human body bone is approximated by using a cylinder, and the visibility of each human body key point is judged according to the shielding relation between the cylinders and the front-back relation of the human body in a certain camera system.

Further, in the step (3), the visibility judgment specifically includes: approximating the human skeleton by using a cylinder with the radius r and the height h, wherein the center of the cylinder is positioned on the average value of three-dimensional positions reconstructed in a frame before two adjacent key points of the corresponding bone; in each view angle, a line segment is determined from the center of the camera to a certain key point, whether the line segment is intersected with each cylinder in a three-dimensional space is calculated, the shielding relation between the line segment and each cylinder is determined according to the front-back position relation of the human body reconstructed from the previous frame, namely whether the joints of the following people are shielded by the joints of the previous people is judged, and the visibility of all the key points in the view angle is calculated.

The invention has the beneficial effects that: according to the invention, the three-dimensional human body skeleton of the previous frame is projected to the image of each visual angle of the current frame after motion prediction to obtain the two-dimensional human body bounding box, so that the expense of a human body detector is reduced, and the algorithm efficiency is improved. The invention also utilizes the time sequence information to calculate the visibility of key points of the human body, and improves the accuracy of the reconstruction of the three-dimensional skeleton of the human body based on the visibility.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic overall flow chart of an embodiment of the present invention, wherein, sub-diagram (a) is a schematic diagram of a detection result of a human body two-dimensional bounding box; the subgraph (b) is a schematic diagram of a human body two-dimensional key point prediction result; and the sub-graph (c) gives a reconstruction result of the three-dimensional key points.

Fig. 2 is a schematic diagram of a human body three-dimensional key point reconstruction result and a motion prediction result thereof according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a two-dimensional key point heat map of a human body according to an embodiment of the invention.

Fig. 4 is a schematic view illustrating the visibility determination of key points of a human body according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and it will be appreciated by those skilled in the art that the present invention may be practiced without departing from the spirit and scope of the present invention and that the present invention is not limited by the specific embodiments disclosed below.

As shown in fig. 1, the present invention provides a multi-view human motion capture method based on human motion prediction, which specifically includes the following steps:

1. inputting a plurality of pictures synchronously shot by different visual angles of a calibrated camera, and firstly using the existing multi-visual angle human body motion capture method (such as mvpos) to carry out three-dimensional human body reconstruction as the result of an initial frame.

Specifically, two-dimensional key point detection is carried out on each picture, a matching relation between two-dimensional key points among all visual angles is established by using a matching algorithm, a three-dimensional human body is reconstructed by using triangulation, and a three-dimensional coordinate of each human body key point is obtained, namely a three-dimensional skeleton of each human body is finally obtained.

2. And (3) motion prediction: for a subsequent frame, a series of motion-related variables including the position, speed and the like of a human body three-dimensional key point are maintained according to a reconstructed three-dimensional framework of a previous frame, a prediction result and confidence coefficient are given to the position of the human body three-dimensional key point of a current frame by utilizing a motion prediction method such as a Kalman filter or a neural network, and the part is shown in an arrow pointing to the current frame picture from a previous frame reconstruction result in the picture. The advantage of this step is that if no motion prediction is performed, for a fast moving human body, the situation of tracking loss is easily caused by only using the projection of the previous frame and the observation of the current frame for matching tracking, and the problem can be solved by motion prediction.

Specifically, for the kalman filter, a linear system may be established using the position and velocity of the three-dimensional key point as state variables and the position of the three-dimensional key point as an observation variable. The filter can give the prediction result of the position of the three-dimensional key point on the basis of considering the movement speed, and the obtained covariance matrix can give confidence coefficient information. Let x_t-1，x_tState vectors at t-1 and t, respectively, P_t-1，P_tAre respectively x_t-1，x_tQ is a noise covariance matrix, F is a state transition matrix, and Δ t represents the time interval of adjacent frames. The prediction process of the kalman filter can be expressed as:

x_t＝Fx_t-1

P_t＝FP_t-1F^T+Q

wherein,

specifically, for a neural network, such as LSTM, the network will remember the reconstruction results from a past time, output the predicted location of the three-dimensional keypoint at the current time based on these time series memory, and give confidence.

3. For the key frame pictures, the invention operates a human body detector (such as yolo) to detect the two-dimensional bounding box of each person, which is consistent with the existing method, the input of the human body detector is the current frame picture, and the two-dimensional bounding box of each person is output. For the picture of the non-key frame, the invention does not need to operate the human body detector, as shown in fig. 1(a), the three-dimensional skeleton predicted by the motion of the previous frame is projected to the image of each visual angle of the current frame, and the two-dimensional bounding box of the human body under each visual angle of the current frame can be obtained quickly. The advantage of this step is that if the position of the human body in the image at the next moment is estimated according to the motion prediction, the human body detector can be prevented from being operated in each frame, and only a part of key frames need to be detected, so that the operation time can be effectively reduced.

In particular, the determination of the key frame is related to the actual demand. Most of the time, one frame can simply be taken as a key frame at regular intervals. The judgment can also be carried out according to the confidence coefficient in the prediction process, and if the confidence coefficient is not high all the time, which indicates that the human body possibly moves too fast or has complex movement at the moment, the key frame can be considered to be denser.

As shown in fig. 1(b), a single person is cut out from an image by using a two-dimensional bounding box, and the cut-out is used as an input to a human body two-dimensional key point detector (such as HRNet), and a heat map of each key point is output, and the position and confidence of the human body two-dimensional key point can be extracted from the heat map. And finally, reconstructing a three-dimensional skeleton by using triangulation (matching is not required because two-dimensional detection results are projected by the same three-dimensional skeleton at the moment), and realizing multi-view human motion capture, as shown in fig. 1 (c).

As shown in fig. 2, a human three-dimensional skeleton at a certain time and a prediction skeleton of a motion prediction method for a future time are shown in the figure. With the motion prediction mechanism, even if the motion with larger amplitude like jumping is carried out, the two-dimensional bounding box of the prediction projection can be close to the human body of the current frame as much as possible, and the situation of tracking loss is avoided.

The invention also provides a method for improving the reconstruction accuracy of the human three-dimensional skeleton by utilizing the visibility of the key points. As shown in fig. 3, it is mentioned that the intercepted single person is input to the two-dimensional keypoint detector, and the result directly output by the detector is a two-dimensional keypoint heat map. The local peak point of the heat map provides the location of the two-dimensional keypoint, whose value provides the confidence of that keypoint. However, the confidence may be affected by a number of factors, such as whether a keypoint is occluded, the size of the keypoint, whether the overall pose of the person is common, and the like. For example, in fig. 3, the head of the following person is blocked due to the existence of the head of the previous person, but the response area of the corresponding heat map still appears to be large, which may cause unreliable estimation. Confidence does not fully represent the visibility of the keypoint. Because the visibility plays a weighting role in the subsequent triangularization reconstruction process, the method is different from the prior method which directly adopts confidence coefficient, and the method utilizes the time sequence information to calculate the visibility of each key point, thereby inhibiting the invisible two-dimensional key point and preventing the two-dimensional key point from interfering the subsequent reconstruction process.

The visibility judgment method specifically comprises the following steps: under the condition that the three-dimensional key point position of each person estimated in the previous frame exists, the method utilizes the cylinders to approximate each section of human bone, and can judge the visibility of each human key point according to the shielding relation among the cylinders and the front-back relation of the human body under a certain camera system. And then, a part of two-dimensional key points are suppressed by utilizing the obtained key point visibility, so that a more reliable result is provided for the calculation of reconstruction.

Specifically, as shown in fig. 4, the human skeleton is approximated by a cylinder with radius r and height h, where r and h are determined according to the physical size of an actual person, or can be obtained by statistical data. For example, for the human body statistical model SMPL, a cylinder parameterized by r and h is used to fit a point cloud corresponding to a corresponding bone of the SMPL, so that statistical parameters of r and h can be obtained. The center of the cylinder is located at the average of the three-dimensional positions reconstructed in the previous frame corresponding to two adjacent key points of the bone. In each view angle, a line segment can be determined from the center of the camera to a certain key point, whether the line segment intersects each cylinder in a three-dimensional space can be calculated through geometrical knowledge (the relation between the distance to the axis of the cylinder and the radius is judged), the occlusion relation between the line segment and each cylinder is determined according to the front-back position relation of the person obtained through reconstruction in the previous frame, namely whether the joints of the following person are occluded by the joints of the preceding person is judged, and the visibility of all the key points in the view angle is calculated. And for the key points judged to be invisible, the confidence coefficient of the key points is set to zero, so that the key points cannot be considered in the subsequent triangularization process.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A multi-view human motion capture method based on human motion prediction is characterized by comprising the following steps:

(1) inputting pictures shot by a plurality of cameras synchronously at different visual angles, and performing three-dimensional human body reconstruction to obtain a three-dimensional skeleton of each human body as an initial frame result;

2. The method for capturing multi-view human motion based on human motion prediction as claimed in claim 1, wherein the step (1) is specifically as follows: and (3) carrying out two-dimensional key point detection on each picture, establishing a matching relation between the two-dimensional key points in each visual angle, and reconstructing a three-dimensional human body by utilizing triangulation to obtain a three-dimensional coordinate of each human body key point.

3. The multi-view human motion capture method based on human motion prediction of claim 1, wherein in the step (2), the motion-related variables comprise positions and velocities of three-dimensional key points of the human body.

4. The multi-view human motion capture method based on human motion prediction of claim 1, wherein in the step (2), a kalman filter is adopted for motion prediction, the position and the speed of the three-dimensional key point are used as state variables, and the position of the three-dimensional key point is used as an observation variable, so as to establish a linear system; the Kalman filter can give a prediction result of the position of a three-dimensional key point on the basis of considering the movement speed, and the obtained covariance matrix can give confidence information; the prediction process of the kalman filter can be expressed as:

x_t＝Fx_t-1

P_t＝FP_t-1F^T+Q

wherein x is_t-1,x_tState vectors at t-1 and t, respectively, P_t-1,P_tAre respectively x_t-1,x_tQ is a noise covariance matrix, F is a state transition matrix, and Δ t represents the time interval of adjacent frames.

5. The multi-view human motion capture method based on human motion prediction of claim 1, wherein in the step (2), the motion prediction employs a neural network, the neural network memorizes the reconstruction result of a past period of time, outputs the predicted position of the three-dimensional key point of the current time according to time sequence memory, and gives a confidence.

6. The method for capturing multi-view human motion based on human motion prediction as claimed in claim 1, wherein in the step (3), the determination of the keyframe specifically is: and taking a frame as a key frame at regular intervals, or adjusting according to the confidence coefficient in the prediction process, and if the confidence coefficient is not ideal all the time, increasing the density of the key frame.

7. The method for capturing multi-view human body motion based on human body motion prediction as claimed in claim 1, wherein in the step (3), under the condition of having the three-dimensional key point position of each human body estimated from the previous frame, each segment of human body bone is approximated by a cylinder, and the visibility of each human body key point is determined according to the occlusion relationship between cylinders and the front-back relationship of the human body under a certain camera system.

8. The method for capturing multi-view human motion based on human motion prediction as claimed in claim 7, wherein in the step (3), the visibility judgment is specifically as follows: approximating the human skeleton by using a cylinder with the radius r and the height h, wherein the center of the cylinder is positioned on the average value of three-dimensional positions reconstructed in a frame before two adjacent key points of the corresponding bone; in each view angle, a line segment is determined from the center of the camera to a certain key point, whether the line segment is intersected with each cylinder in a three-dimensional space is calculated, the shielding relation between the line segment and each cylinder is determined according to the front-back position relation of the human body reconstructed from the previous frame, namely whether the joints of the following people are shielded by the joints of the previous people is judged, and the visibility of all the key points in the view angle is calculated.