Kinematic-based Markerless Human Tracking in 3D Probabilistic Occupancy Grids

—Markerless human motion tracking can be employed in many applications such as automatic surveillance, motion capture, human-machine interface and activity recognition. This problem has been extensively studied in the computer vision research community in the last years. In this context, the present paper presents an approach for 3D markerless human motion tracking based on a skeletal kinematic model of the human body. This method is applied over a 3D probabilistic occupancy grid of the environment, which is constructed by means of a Bayesian fusion of images from multiple synchronized sensoring cameras. Although the use of kinematic models in 3D human tracking is widely employed, its use over 3D probabilistic occupancy grids still was not vastly investigated in the literature. The experiments were performed using a public dataset with video sequences of people in motion. The results show that the method is capable of dealing adequately with the 3D markerless human motion tracking problem.


I. INTRODUCTION
Visual tracking, which is the recursive detection and location of objects (or more generally, visual patterns) in videos [1], is a classical computer vision problem.In this context, the visual tracking of people has been studied extensively in the literature [2], [3].Among many applications, one could mention automatic surveillance, motion capture, human-machine interface and activity recognition, as examples of relevant problems where the visual tracking of people is employed.
Tracking articulated targets, such as the human body, using 2D images is a difficult problem to be treated mainly due to: i) the complex nature of 3D movements; ii) the loss of information in images because of 2D space restriction; iii) the color changes caused by luminosity variations; iv) the existence of others objects moving into the scene.Thus, to minimize some of these issues, multiple synchronized cameras can be used to sensor the environment where people are moving.From the set of images captured by the cameras a 3D reconstruction of the environment can be performed.
The reconstruction technique used in the present approach is called 3D probabilistic occupancy grid [4].This technique was proposed as a way to overcome some problems, for instance, the existence of phantom volumes and holes in the reconstructions.These problems occur when other popular methods are employed, such as the shape-from-silhouette.In the 3D probabilistic occupancy grid approach, the images obtained by the multiple synchronized cameras are fused by means of a Bayesian inference [5].Doing so, the decision about the occupancy or not of a position (voxel) into the 3D space is taken accounting all the information coming from each camera, allowing a better inference compared with shapefrom-silhouette, which takes the decision accounting each 2D image separately.
Although one can find some model-free proposals in the literature [6], many 3D motion tracking methods employ predefined representation models of the targets [7], [8].Usually, the object appearance model is associated with the object kinematic model that describes the possible movements and valid poses [9].The main contribution of the present work, which is still not vastly investigated in the literature, is the use of a human body skeletal kinematic model to perform the markerless visual tracking of people in a 3D probabilistic occupancy grid.Experiments were performed with videos sequences from a public dataset [10].The results show that the method is capable of dealing adequately with the 3D markerless human motion tracking problem.

II. RELATED WORK
Markerless human tracking and motion analysis have been studied for some years and remarkable achievements were already accomplished [2], [3].Several approaches can be found in the literature, which deal with many different application's scenarios, data acquisition approaches and tracking methodologies.A vast review of works can be find in [12], [13], [14], [15].
In this context, it is hard to establish a sufficiently general taxonomy that could be used to classify and analyze the works in the field.Thus, some relevant attributes must be chosen in a way that some extent of comparison among different approaches is allowed.It can be noticed, for instance, that many methods focus on the recovery of human poses from monocular images or videos.Agarwal and Triggs [16] propose a learning-based method along with a histogram-of-shapecontext silhouette shape descriptor which allows the recovery of 3D human body pose from monocular images.Urtasun et al. [17] also present a method for 3D pose recovery from monocular videos, but this method employs a strong motion model in the tracking process and do not use a learning approach.
Bourdev and Malik [18], as well as Yang and Ramanan [19], focus on the identification of body parts to pursue the pose estimation.The former propose the concept of poselets, which maps the appearance of body parts to their 3D pose; while the latter learn the different body parts based on the HOG descriptor [20].More recently, deep neural networks have also been applied to the monocular pose estimation problem, as presented by Toshev and Szegedy [21] and Tompson et al. [22].
Another extremely relevant methods are the ones based on depth images.The advent of low cost devices, such as the Microsoft Kinect, called attention of the research community for such kind of input data.Certainly a remarkable work in this subject is the approach presented by Shotton et al. [23] in which deep decision random forests [24] are used to recover the 3D human pose in real-time from a single depth image.In more recent works, Vemulapalli et al. [25], also recover 3D human poses from depth images, in the context of action recognition, by mapping the skeletons models to a Lie group; and Ionescu et al. [26] estimate the 3D human pose classifying body parts using SIFT [27] and random forests.
Multiple views methods are another kind of approach which still achieves the highest accuracy among the markerless methodologies [28].In this context, Sigal et al. [29], [30] present a probabilistic graphical model to represent and track a human body in a multiple camera environment.Simas et al. [6] propose a method based on nonparametric belief propagation which can be applied to track people and other previous unknown moving objects.In the works of Starck and Hilton [31] and De Aguiar et al. [32], the employed human body representation model is a mesh-based surface.More recently, Elhayek et al. [28] and Belagiannis et al. [33] proposed approaches in multiple views scenarios facing challenging situations, such as self-occlusions, moving cameras, outdoor scenes and cluttered background.In this context, despite the several methodologies found in the literature concerning the markerless human motion tracking, the use of a kinematic model over a 3D probabilistic occupancy still was not well explored.

III. PROPOSED METHODOLOGY
A typical visual tracking system is formed by four main components: the observation model, the representation model, the movement model and the tracking algorithm.In the proposed approach the target (a moving person) is observed by means of a 3D probabilistic occupancy grid.The representation model is composed of two parts: a set of Gaussian blobs, modeling the volume occupied by each rigid body part (appearance model); and a kinematic hierarchical model (human body skeletal model), representing the spatial relation between the rigid body parts.In this work no movement model has been used.Finally, the tracking algorithm is based on the Expectation-Maximization (EM) algorithm and on the Cyclic-Coordinate Descent (CCD) inverse kinematic method.The proposed approach is summarized by the diagram shown in the Fig. 1.Each one of the components will be explained in further details in the following sections.

A. Observation Model
The observation model defines which kind of sensor information about the targets is extracted from the environment.In the present method, the environment is sensored by multiple synchronized and calibrated cameras.The captured images are used to build a probabilistic volumetric reconstruction of the scene.This reconstruction is composed of voxels, which present a probability to be occupied by the interested objects.Many volumetric reconstruction methods use a simple binary background segmentation of each image and analyze those images individually.Despite being simple and broadly used, these methods can lead to some problems in the determination of the object's volume and position, such as phantom volumes and holes into objects.
To deal with these difficulties Franco and Boyer [4] propose to obtain a fusion of all image information using a 3D probabilistic occupancy grid.This technique tackles some difficulties presented by a number of uncertainties associated to the image capturing stage, like sensor noise, calibration errors and lightning changes.
In this method, every pixel of a camera is treated as a statistical sensor susceptible to uncertainties.The problem is then treated as a Bayesian estimation.The 3D space is discretized into volume elements, called voxels.The Bayesian estimation is used to calculate the probability of each voxel to be occupied by the object of interest.
In the determination of a voxel's occupancy status, the value of the projected pixels are taken into consideration along with a statistical background model for those pixels [35].The background model is obtained using a video sequence of the background scene without moving objects.The projection between pixels and corresponding voxels is done using the cameras' calibration matrices.A brief review of the method is presented below, while a detailed explanation can be found in [4].
1) Occupancy Grid Theory Review: The voxel occupancy inference is performed using the Bayes' rule.The probability of each voxel of the grid to be occupied is given by where I is the set of n current images, and I i p is the image data at pixel p in the image of camera i, i = 1...n.It is assumed that the image data corresponding to the set of m observed background images can be summarized into a single statistical model B i p for each pixel p in the image of each camera i, i = 1...n.τ symbolizes the prior knowledge about the scene, about the sensor characteristics and the general knowledge about the system.Finally, G is the space occupancy grid.For each space point X in the grid discretization the voxel occupancy probability is inferred according to the Bayes' rule shown in the Equation (1).
a) Generic sensor model: is the term p(I i p |G X , B i p , τ ) used to define the Bayes' rule Equation ( 1).This term directly relates the pixel's color observation to voxel occupancy, using the image formation p(I i p |F i p , B i p , τ ) and silhouette formation p(F i p |G X , τ ) terms, that will be presented after, and marginalizing over silhouette detection F i p as expressed below: where F i p is a binary silhouette detection variable for the pixel p in the image i, i = 1...n.F i p = 1 if the pixel sensor p in image i reports the presence of an object of interest anywhere along its viewing line, and F i p = 0 otherwise.b) Silhouette formation term: models the silhouette detection response of a single pixel sensor (i, p) to the occupancy state of the analysed voxel G X .The silhouette formation term is defined by two expressions, considering the case when the voxel is occupied (G X = 1) and when it is not (G X = 0): where S and R are hidden variables.S, the sampling variable, is equal to 1 if the voxel X is on the viewing line of pixel (i, p) and equal to 0 if it is not.The external detection cause R equals to 1, accounts for the possibility that some other object lies on the same viewing line as the voxel, and it is equals to 0 if no other object obstructs the viewing line.U(F i p ) is the uniform distribution for the silhouette detection, when the voxel and pixel are not aligned (S = 0), while P d (F i p ) and P f (F i p ) are respectively, the detection and the false alarm probability distributions (when S = 0) defined as: Still in the Equations 3 and 4, it is needed to define the parametric forms of p(R|τ ) (external detection term) and p(S|τ ) (sampling term).Both terms are considered uniforms.Concerning the external detection term it means that the detection is equally likely to be triggered by the voxel occupancy or by other causes anywhere along the viewing line of p.
Considering the sampling term, the uniform sampling means that all the voxels that fall within a k × k window around the pixel p have equal weight.c) Image formation term: seeks to explain the color information of a pixel (i, p), given the knowledge of the background color model at this pixel and whether or not a silhouette detection occurred at this pixel.This term is defined as by two expressions, the first one for the case of silhouette detection at pixel (i, p), and second for the case when no silhouette detection occurred: where U i p is the uniform distribution over the observed colors when the silhouette is detected, since there is no knowledge about the color of the objects of interest.And N (I i p |µ i p , σ i p ) is a normal distribution in (Y,U,V) space for each pixel, that defines the color background model.The parameters µ i p and σ i p are estimated from a set of background sample images.Considering the brief presentation until here, the voxel occupancy inference algorithm can be defined as shown in the Algorithm 1.

Algorithm 1 Voxel occupancy inference
for each voxel X in the grid do for each image i from each one of the cameras do -calculate the X voxel projection in the image i for each pixel p in the k × k window around the projection of the voxel X do p(G X |I, B, τ end for end for end for * Sum of log probabilities

B. Representation Model
The representation model defines how the interested objects are "seen" by the motion tracking method.It is composed of two main parts: the appearance model and the skeletal kinematic model.
The appearance model is composed of a set of associated Gaussian blobs [36].As the human body is an articulated object composed of rigid parts, each blob models a rigid body part of the tracked person.A blob is a Gaussian distribution in tree dimensions often represented by an ellipsoidal shape.Its mean position is given by while the surface is defined by a standard deviation around the mean, defined by These geometric shapes enclose the occupied voxels belonging to objects' rigid parts.The relation between the occupied voxels of the grid and the blobs is given by the Mahalanobis distance: closer voxels have greater probabilities of belonging to a certain blob.The basic components of the representation model are illustrated in Fig. 2.This statistical approach is good to model body parts and fits well with the probabilistic reconstruction method employed, however it carries no information about the relationship between connected body parts, e.g. the position of the hand depends on the position of the arm, which depends on the chest.To tackle these difficulties the skeletal kinematic model is introduced.
The skeletal kinematic model, adapted by the authors from the model presented by Caillette [36], is shown in Fig. 3.It describes the kinematic relationship between rigid body parts.In the human body the moving parts are the limbs which are hold together by joints.A hierarchical skeletal kinematic model is used, in which the bones are attached to joints in a tree structure.These joints are linked one to another from the root of the skeleton (in the pelvis) to the leaves (hands and feet).The kinematic skeleton presents two important functions: it is used to constrain the movement of the blobs to valid poses; and it can be used to interpret the resulting poses of the body.
Into the present approach only the main parts of the human body are represented in the skeletal model, such as torso, arms and legs.Smaller parts, e.g.hands and fingers, are neglected, since the resolution of the 3D reconstruction is not high enough to allow the effective observation of these tiny parts.Another important issue related to the model is the definition of the joints.The interdependence between the adjacent body parts movements creates complex constraints in the rotation of the joints.
Basically, a kinematic model can be employed in direct or inverse kinematic problems.In the first scenario the model state is determined according to a defined set of joints parameters.In the latter situation, the set of joints's angles must be found as a function of a determined model pose.The inverse kinematic problem is usually solved as an optimization problem.
The global position of the model, concerning the root of the kinematic hierarchical tree, is defined by 6 parameters (i.e. 3 for translation and 3 for rotation).The position of each node (joint) is defined according to the position of the parent node.Considering this, the position of a leaf (e.g.foot) is obtained through the application of recursive transformations from the root to the leafs.The Euler angles were used to encode the joints' configurations.
The joints of the model are represented by {J1, J2, ..., J21} and the root is located in the pelvis region.All the joints are modeled with 1 degree of freedom (rotation around a fixed axis) because simplicity and performance reasons.More complex joints are indeed ensembles of single joints.To exemplify the functioning of the kinematic model, a transformation between a joint Jt i−1 and its child joint Jt i is defined by a rotation θ i around the axis w i in the Jt i−1 coordinate system.The rotation is followed by a translation l i which is correspondent to the length of the bone which link both joints.The global and final position of the joint Jt i is defined by P i after the applied rotation and translation.The maximum and minimum possible values for each joint angle θ i are in the interval [θ − i , θ + i ].

C. Tracking Algorithm
The tracking algorithm matches the representation model to the 3D occupancy grid at each time instant, finding the most likely pose of the tracked person in a given frame.The proposed approach is composed of four steps: i) initialization, ii) updating blobs' parameters using Expectation-Maximization (EM), iii) kinematic model pose estimation by the means of Cyclic-Coordinate Descent (CCD) method, and iv) realignment of the blobs with the kinematic model (bones).Each of the steps is detailed below and shown in diagram of Fig. 4.
i) Initialization: In the first frame of the video sequence the pose of the representation model needs to be initialized.In this work a manual adjustment of the model has been employed.The user must adjust the pose of the skeletal kinematic model, along with the associated Gaussian blobs, to the reconstructed 3D volume in the first frame of the sequence.The defined pose is saved and can be reused in other videos without the necessity of repeating the process.This mechanism, although manual, allows the tracking of people in any initial pose.
ii) Updating Blobs with EM: From a previous frame of a video sequence, the representation model needs to be matched to the current human body pose.To perform this iterative matching through the frames, the EM algorithm is used [37].In this process the Gaussian blobs' parameters are updated from the previous 3D reconstruction frame.In the Expectation step, each voxel of the current frame is associated with the blob with the highest probability of representing that body part, given by the minimal Mahalanobis distance between the voxel and the blob.Then, in Maximization step, the mean vector and covariance matrix of the blobs are estimated according to the associated voxels.
iii) Kinematic Model Pose Estimation: Using the new position of the blobs, given by the EM algorithm in the previous step, the skeletal kinematic model is updated in two phases.First, goal unconstrained positions for the joints are obtained from the associated blobs parameters.Second, which is an inverse kinematic problem, the joints' parameters (joints' angles) must be determined, given the goal positions defined by the blobs in the previous step.A simple approach for the problem is the Cyclic-Coordinate Descent (CCD) algorithm, an iterative local optimization method [36].Each joint of the skeleton is optimized in an independent way, from leaves to the root.The optimization consists in minimizing the error between the joint positions and their goal positions.Joint angular limitation is applied by clamping the resulting angle.
As shown in Fig. 2, each blob is attached to a bone in the skeletal model.This attachment can be seen as a virtual spring, which pulls the bone towards the blob's position and orientation; the former is defined by the mean, while the latter is given by the biggest axis of the blob.The Algorithm 2 calculates the goal positions for the kinematic model joints.It typically needs only a few iterations to generate a satisfying solution.Further details are provided in [36].
After the goal joints positions calculation, the inverse kinematic problem must be solved, which means that the global configuration of the skeletal model needs to be determined according to the goal positions and to the models' constraints.In this context, the global position and orientation of the root's kinematic tree is defined by the ordered pair (P 0 , R 0 ); r i defines the local rotation of θ i over the local axis w i , and t i the translation of length l i along the first axis of the local coordinate system.The global position P i of the joint Jt i is calculated recursively in the kinematic chain {Jt 1 , ..., Jt NJ } as follows: Algorithm 2 Calculation the goal joints positions while the sum of the squared distances between the goals from the last iteration and the current ones is below a predefined threshold do 1.compute the goal position for the tip of the current joint using the base of the joint as a fixed rotation point; 2. translate the goal positions of the base and tip of each joint so as to minimize the projection error of the mean of the blobs onto the bone; 3. optimize the goal position of the base of the joint using this time the goal position of the extremity of the joint as a fixed rotation point; 4. the goal position coming from both the current joint and its parent are merged into a single goal position; end while where, Considering these two equations, the direct kinematic formulation is defined by This is the formulation for the definition of the entire skeletal kinematic model.The intermediate and recurrent calculations are stored to reuse, avoiding unnecessary computation.Another important optimization adopted was the implementation of the local rotations r i using quaternions [38], which allow the speed up of rotations calculations, specially when t i is null.
Considering this formulation of the direct kinematic, the inverse kinematic problem is solved using the Cyclic-Coordinate Descent (CCD) approach.The CCD is a local iterative optimization method.In this method each joint of the kinematic skeletal model is optimized independently, from the leafs up to the root of the hierarchical tree.The CCD method tries to minimize the distance between each joint, and its children joints, and their goal positions calculated in the previous step.
In this context, lets consider a joint Jt i and its children joints {Jt i,1 , ..., Jt i,n }, with global position given by {P i , P i,1 , ..., P i,n } and goal positions {G i , G i,1 , ..., G i,n }.Considering the simplest optimization problem, which would be the optimization of the joint Jt i alone or just with one child, the unique degree of freedom of such joint is the rotation θ i over the axis w i .Thus, the aim is to calculate the variation ∆θ i which minimize the distance between the joint position P i,1 and the goal position G i,1 .
The position of the joint Jt i base is In this context, the angular variation ∆θ i which minimizes the distance between P i,1 and G i,1 also maximizes the scalar product between P and G, having the following expression [39]: The Equation 14 find a solution in the interval ∆θ i ∈ − π 2 , π 2 .To allow a smooth movement between all the joint of the kinematic tree, an attenuation coefficient η is introduced.This attenuation also limits undesirable oscillations while all the joints are being optimized simultaneously towards contradictory goal positions.In each iteration the angle θ i is updated according to the following equation: where the coefficient η controls the convergence rate.The employed value was determined empirically as η = 0.3.
From this point, lets consider the existence of more than one child of the joint Jt i with goal positions to be achieved.The analytical calculation of ∆θ i becomes a complex task.Thus, a heuristic is employed to combine all the individual optimizations of the the joints.Lets denote {∆θ i,1 , ..., ∆θ i,k } as the variations calculated in the Equation 14.The angular variation for the current joint Jt i is the weighted summation of the individual angles, as follows: where {λ i,1 , ..., λ i,k } are the weights which gave more importance to the goals closer to the root of the kinematic tree.
If the weights were uniform, the goal positions closer to the root of the kinematic model would just be partially optimized, while the the goal positions near to the kinematic chain's extremities would be favored by their more frequent updates.Considering this, λ i,j is equal to the inverse of the number of joints separating Jt i to the joint associated with the goal G i,j .iv) Realignment of the Blobs with the Bones: The last step is to realign the blobs with the bones of the skeletal model.Since the blobs' shapes are supposed to remain the same through the tracking, the regeneration of the blobs consists in a series of rotations and translations along the skeleton kinematic tree.According to [36], lets assume that the blob B is attached at an offset α along a bone of the skeletal model, lets denote by P the global position of this bone obtained after application of the kinematic constraints, and lets denote by R the associated rotation matrix.The corrected mean µ ′ X is computed as a simple conversion from local to global coordinates given by while the corrected covariance matrix σ ′ X of blob B is given by IV. IMPLEMENTATION The proposed methodology was implemented using the C++ programming language and OpenCV and OpenGL libraries.The developed software executes all the stages of the method, from the 3D reconstruction until the markerless human tracking.The software tool includes a visualization functionality, as can be observed in Fig. ??, and it also allows the entire configuration of method's parameters by the means of the graphical interface shown in Fig. ??.Besides this, the software is also fundamental in the initialization of the representation model, which is performed as a interactive process.The user should select each joint and visually adjust the approximated angles between articulations.The blobs must also be attached to the bones of the skeleton.The initialization defined by the user can be saved by the software tool and reloaded to be used as the initial configuration of other videos sequences.

V. EXPERIMENTS AND RESULTS
Several experiments were performed to assess all the different parts of the proposed methodology.The video sequences of moving people were obtained in the public online dataset 4D Repository [10].The repository contains a set of live and dynamic events, such as human activities, captured using the multi-camera platform GRImage [11].The dataset provides, for each sequence: i) the calibration information for the multicamera set up, ii) images acquired from multiple cameras, iii) silhouettes extracted from these images by eliminating the background, iv) reconstructed mesh geometry at different time frames.Fig. 7 shows a 3D probabilistic occupancy grid, while a isosurface showing only the voxels with a defined probability of occupancy is shown in Fig. 8. Fig. 7. Four frames of the 3D occupancy grid of the observed environment where a person moves.The red regions represents the higher probabilities of the voxels to be occupied.Visualization performed with the Vis5D software [34].The CCD algorithm functioning is illustrated in Fig. 10.From a initial state of the kinematic model and a set of goal positions, the method iteratively minimizes the distance between the joint and the goal positions.It can be noticed that in the final configuration some of the joints are not exactly located over the goal position.It happens because the optimization is performed considering the kinematic constraints over the global model.Besides the evaluation and the experiments performed with some of the individual parts of the methodology, an overall assessment was realized.To this purpose a video sequence was used, which consists of a dancer performing dance movements captured by eight synchronized and calibrated RGB cameras positioned around the environment, shown in Fig. 11.In Fig. 12 a volumetric reconstruction of the human body is shown, along with the correspondent representation model.The tracking process was executed over the sequence and Fig. 13 shows three frames of the volumetric reconstruction from the eight cameras' views.Fig. 14 shows the tracked representation model pose for the same frames presented in Fig. 13.As can be observed, the kinematic model imposes restrictions to the body pose.Doing so, the tracking process became more robust, once the model cannot assume invalid poses.The appearance model just along with the tracking algorithm is not capable of achieve such performance, because it does not present the semantic information concerning the human body configuration embedded in the skeletal kinematic model.As future works, a quantitative evaluation of the results is desirable.More experiments also must be performed to precisely determine strengths and limitations of the methodology.

Fig. 1 .
Fig. 1.Diagram showing the main building blocks of the proposed methodology.

Fig. 2 .
Fig. 2. Representation model: (a) the appearance model is composed of ellipsoidal geometric shapes (Gaussian blobs), which represent the human body rigid parts; (b) the skeletal kinematic model is composed of a set of joints and limbs with the ellipsoids attached to it.

Fig. 4 .
Fig. 4. The tracking process is summarized by the diagram showing all algorithm's steps.
(a) New goal positions denoted by the (red) dots.(b) Kinematic model pose estimation with the CCD method.(c) Realignment of the blobs to the kinematic model.

Fig. 10 .
Fig. 10.Four iterations of the Cyclic-Coordinate Descent (CCD) algorithm, from the goal positions (red dots) and the initial model state (left), to the optimized final configuration (right).

Fig. 11 .
Fig. 11.Sample input images from eight synchronized cameras at a given instant.

Fig. 12 .
Fig. 12. Sample results: the 3D probabilistic reconstruction of the tracked human body (left), the resulting pose of the human body representation model (skeletal kinematic model and associated Gaussian blobs) (right).

Fig. 14 .
Fig. 14.Dancer representation model pose, composed of the skeletal kinematic model and the associated Gaussian blobs.