Ego-motion Localization using Lucas-Kanade Optical Flow

One Sentence Summary: This project aims to identify the feasibility of tracking an object’s position from the moving object’s frame using optical flow (ego-motion) and test its resilience in diverse environmental contexts. Source Code

Contents:

Overview

Optical flow tracking is a technique that estimates the motion of objects, surfaces, and edges in a visual scene based on the relative motion between the observer and the scene. This project focuses on the Lucas-Kanade Algorithm, a sparse optical flow estimation algorithm that is particularly useful in tracking the movement of objects where object detection may not be possible. Using a moving average filter to refine optical flow vectors, we can determine the average motion’s magnitude and direction, which can be mapped onto an XY coordinate plane. Rather than remaining stationary and tracking a moving object, the camera is attached to the moving object (ego-motion) to track the landmarks around it and find its position relatively. The applicability of the method spans from ground-based robot localization (visual odometry) to aerial drone navigation. This investigation delves into the method’s adaptability and resilience across diverse environmental contexts and operational ranges, but also pertinent limitations and challenges, including susceptibility to plane alterations, processing overhead, environmental variables, and motion-induced blur.

Steps within Algorithm:

  1. A Canny Edge Detection filter is applied to bring out the edges.
  1. The Shi-Tomasi corner detection identifies the most reliable tracking features of the image.
  1. The Lucas-Kanade method assumes that the displacement of the pixels between two consecutive frames is small and constant within a neighborhood of the pixel. The algorithm then solves a system of linear equations to find the optical flow vector for each pixel that has a feature to track.
  1. The dx and dy of each frame are appended to a stack with fixed length and averaged, creating a moving average filter to smooth the vectors over a window of frames. The moving average filter reduces the noise and outliers in the optical flow vectors and provides a more stable estimate of the motion.
  1. The algorithm then computes the magnitude and directional vector of the optical flow for each landmark, which represents the speed and direction of the motion of the object relative to the camera.
  1. The positional graph can be created by tracking cumulative x and y values and updating it each frame using the dx and dy.

Check out the Source Code or In-Depth Explanation at the bottom for more.

Testing and Demonstration


The following are demonstrations from different distances, speeds, and environments. Each path starts and ends at the same location to help provide quantifiable error without another source of localization.

Drone Navigation Location Accuracy

Ideal Scenario - Textured White Surface Panning

Wood Pattern Panning

The wood panning is recorded in the same way as the white surface (one above), except it has a greater variety in the magnitude and direction of the vectors. The built-up error is noticeable but still very minute.


Video camera mounted on Chassis Facing the Roof

A video camera is strapped onto the chassis, moving in a parallelogram, starting and ending at the same location. The algorithm tracked the roof’s position from the moving camera's frame, showing little error since the roof stayed relatively consistently on the same plane as the camera and had no abrupt movements. However, slight back-and-forth jitters are present since the camera is not firmly attached. Since the movement in this video is over a very short distance, less motion smoothing will produce the most accurate result.

Limitations


The algorithm has some limitations that affect its performance and applicability in certain scenarios. Some of the main limitations are:

  1. Inability to handle changes in plane: The algorithm assumes that the object to be tracked is on the same plane as the camera’s constant depth. However, if the object changes its plane or depth relative to the camera, the algorithm will produce inaccurate results, as it will not account for the change in perspective and scale. This can happen if the object moves closer or farther away from the camera or if the camera tilts out of the plane. In the Lucas Kanard Paper, they indicated that these problems can be solved to not only generalize 2D planar motion, but also “rotation, scaling, and shearing” (1).
    Hand Held Recording

    Slight variations in the plane result in drastic levels of error.

  1. Processing time: The algorithm requires a significant amount of processing time to compute the optical flow vectors for each frame, especially if the number of features to track is large. This can limit the real-time performance of the algorithm, as it may not be able to keep up with the frame rate of the camera. These can be solved by using different techniques
    Screen Recording Real time Speed of Processing.

    The processing time is slower and does not match the input video speed, until the entire video is finished and exported to a output file. This makes it difficult to use in live applications, although there are many steps that can be to improve this algorithm to be able to do that.

  1. Fast motion: The algorithm relies on the assumption that the displacement of the pixels between two consecutive frames is small. However, if the motion of the object or the camera is too fast, the displacement may exceed the window size of the Lucas-Kanade method, and the algorithm will fail to find a match for the features. This can result in tracking errors or failures, as well as motion blur in the frames, which decreases the clarity of the edges within the frame. This also introduces what I call spiky artifacts within the image that look like sparks. To decrease the effect of fast motion, we can change the window size so that it can capture the displacement of the points even if they are spread farther apart.

    Using the same closeup input video from processing time, with a greater window size and canny threshold as well as a lower moving average smoothing. The spiky artifacts become apparent, disrupting the accuracy of the tracking till the data is unusable.

    Before vs. After Parameters tuned to minimize spiky artifacts

    Note: The window size and canny thresholds have been decreased to accommodate the rapid speed of the video.

  1. Moving Objects - Besides the frame of reference, other objects cannot be moving since the model looks at the relative motion of the environment to determine its position. One way to overcome this is by creating a bounding box around the moving object and omitting landmarks/corners within the region.

Explanation Behind the Algorithm


  1. Preprocessing: The Canny Edge Detection function is used to bring out the edges in each frame of the video. The process consists of 4 main steps.
    1. Smoothing an image using a 5x5 Gaussian filter to reduce noise, which can be done by convolving the image with a Gaussian kernel.
    1. Calculating the intensity gradient of the image using a Sobel operator - 3x3 matrix that calculates the horizontal and vertical gradients of the images, which can be used to find the gradient magnitude and direction.
      • Edge Gradient
      • Direction

    1. Thin out the edges using non-maximum suppression by finding the pixels that constitute local maxima. The rest of the pixels are set to zero.
    1. Compare edges to a high and low threshold, determined by the Built-in openCV Canny Function’s parameters. Edges that are above the high threshold are considered to be strong edges. The pixels that are below the low threshold are considered to be non-edges. The pixels that are between the high and low thresholds are considered to be weak edges, which are to be discarded unless connected to a sure-edge. [1]
    1. Identifying Landmarks: The goodFeaturesToTrack function is used to identify the best tracking landmarks in the frame.
      1. The function implements the Shi-Tomasi method based on the eigenvalues of the second-moment matrix [2][3][4]. In a grayscale image, this matrix is defined as:
        • IxI_x and IyI_y are image derivatives in xx and yy directions respectively.
        • w(x,y)w(x,y) is a window function - gives weights to pixels underneath.
        • The corner response function in the Shi-Tomasi method is defined as the minimum eigenvalue of the second-moment matrix M, where λ1\lambda_1 and λ2\lambda_2 are the eigenvalues of M:
        • If RR is greater than a certain threshold, the point is classified as a corner.
        • Non-maximum suppression is applied to the detected corners, essentially meaning only the corner with the highest corner response function value is selected from the neighborhood.
    1. Feature Tracking: The calcOpticalFlowPyrLK function is used to track the corners in the subsequent frames [6]. Simplified Explanation of the Steps:
      1. Find the disparity vector h which minimizes the difference in F(x+h)F(x+h) and G(x)G(x), the pixel values, for xx in RR (region of interest) [5].

        Equation for each pixel in the area:

        Ix(qi)vx+Iy(qi)vy=It(qi)I_x(q_i)v_x + I_y(q_i)v_y = -I_t(q_i) 

        where qiq_i is the pixel, IxI_x, IyI_y, ItI_t are the changes in the image brightness along the xx, yy, and tt directions, and vxv_x and vyv_y are the speeds of the pixel that we want to find.

      1. Because this technique assumes the difference in pixel positions is small, Linear approximation can be used to approximate F(x+h)F(x+h) as hF(x)hF'(x), resulting in hG(x)F(x)F(x)h \approx \frac{G(x) - F(x)}{F'(x)}.
        1. Using the linear approximation F(x+h)F(x)+hF(x)F(x+h) \approx F(x) + hF'(x), we can find the hh which minimizes the L2L2 norm measure of the difference between the curves:  E=x[F(x+h)G(x)]2 E = \sum_x [F(x + h) - G(x)]^2. To minimize the error with respect to hh, we set  Eh=0 \frac{\partial E}{\partial h} = 0 to get

          hxF(x)[G(x)F(x)]x[F(x)]2h \approx \frac{\sum_x F'(x)[G(x) - F(x)]}{\sum_x [F'(x)]^2}. This can be generalized to multiple dimensions in a similar way by minimizing EE.
      • Choice of the Lucas Kanarde Method: The advantages over the Farneback method or others is the computational efficiency, as it computes a sparse optical for a selected set of features rather dense optical flow. The Farneback processing time is over twice as long and vector accuracy cannot compare either.
        Farneback vs Kanade Comparison
    1. Postprocessing: The average flow is calculated and displayed onto the OpenCV frame using an arrowed line drawn to indicate the movement’s magnitude and direction. The position is also plotted in real time on a XY positional graph using matplotlib.

      The moving average value is the average over the last kk values of a stack containing nn entries [7].>

    Other


    References:


    [1] https://docs.opencv.org/4.x/da/d22/tutorial_py_canny.html

    [2] https://docs.opencv.org/3.4/d4/d8c/tutorial_py_shi_tomasi.html

    [3] https://en.wikipedia.org/wiki/Harris_Corner_Detector

    [4] https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_shi_tomasi/py_shi_tomasi.html

    [5] https://cseweb.ucsd.edu/classes/sp02/cse252/lucaskanade81.pdf

    [6] https://docs.opencv.org/3.4/dc/d6b/group__video__track.html#ga5d10ebbd59fe09c5f650289ec0ece5af

    [7] https://en.wikipedia.org/wiki/Moving_average

    [8] https://people.csail.mit.edu/lpk/mars/temizer_2001/Optical_Flow/

    [9] https://www.sciencedirect.com/science/article/pii/S1474667017331543