{"id":1299,"date":"2025-01-20T07:03:03","date_gmt":"2025-01-20T07:03:03","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/20\/zero-shot-player-tracking-in-tennis-with-kalman-filtering-80bba73a4247\/"},"modified":"2025-01-20T07:03:03","modified_gmt":"2025-01-20T07:03:03","slug":"zero-shot-player-tracking-in-tennis-with-kalman-filtering-80bba73a4247","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/20\/zero-shot-player-tracking-in-tennis-with-kalman-filtering-80bba73a4247\/","title":{"rendered":"Zero-Shot Player Tracking in Tennis with Kalman Filtering"},"content":{"rendered":"<p>    Zero-Shot Player Tracking in Tennis with Kalman Filtering<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Automated tennis tracking without labels: GroundingDINO, Kalman filtering, and court homography<\/h4>\n<p><iframe loading=\"lazy\" src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FE8KiqH8uM5g%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DE8KiqH8uM5g&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FE8KiqH8uM5g%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"854\" height=\"480\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/6f735abc63f905de122bb8a0679f97fd\/href\">https:\/\/medium.com\/media\/6f735abc63f905de122bb8a0679f97fd\/href<\/a><\/iframe><\/p>\n<p>With the recent surge in sports tracking projects, many inspired by <a href=\"https:\/\/x.com\/skalskip92\/status\/1816162584049168389\">Skalski\u2019s popular soccer tracking project<\/a>, there\u2019s been a notable shift towards using automated player tracking for sport hobbyists. Most of these approaches follow a familiar workflow: collect labeled data, train a YOLO model, project player coordinates onto an overhead view of the field or court, and use this tracking data to generate advanced analytics for potential competitive insights. However, in this project, we provide the tools to bypass the need for labeled data, relying instead on GroundingDINO\u2019s zero-shot tracking capabilities in combination with a Kalman filter implementation to overcome noisy outputs from GroundingDino.<\/p>\n<p>Our dataset originates from a set of <a href=\"https:\/\/github.com\/HaydenFaulkner\/Tennis\">broadcast videos<\/a>, publicly available under an MIT License thanks to Hayden Faulkner and team.\u00b9 This data includes footage from various tennis matches during the 2012 Olympics at Wimbledon, we focus on a match between Serena Williams and Victoria Azarenka.<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Ff5Ig7CL6nAc%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Df5Ig7CL6nAc&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Ff5Ig7CL6nAc%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"854\" height=\"480\" frameborder=\"0\" scrolling=\"no\"><a href=\"https:\/\/medium.com\/media\/52659e7e9e29b83dcfe248555616f546\/href\">https:\/\/medium.com\/media\/52659e7e9e29b83dcfe248555616f546\/href<\/a><\/iframe><\/p>\n<p>GroundingDINO, for those not familiar, merges object detection with language allowing users to supply a prompt like \u201ca tennis player\u201d which then leads the model to return candidate object detection boxes that fit the description. RoboFlow has a great tutorial <a href=\"https:\/\/colab.research.google.com\/github\/roboflow-ai\/notebooks\/blob\/main\/notebooks\/zero-shot-object-detection-with-grounding-dino.ipynb\">here<\/a> for those interested in using it\u200a\u2014\u200abut I have pasted some very basic code below as well. As seen below you can prompt the model to identify objects that would very rarely if ever be tagged in an object detection dataset like a dog\u2019s\u00a0tongue!<\/p>\n<pre>from groundingdino.util.inference import load_model, load_image, predict, annotate<br><br>BOX_TRESHOLD = 0.35<br>TEXT_TRESHOLD = 0.25<br><br># processes the image to GroundingDino standards<br>image_source, image = load_image(\"dog.jpg\")<br><br>prompt = \"dog tongue, dog\"<br>boxes, logits, phrases = predict(<br>    model=model, <br>    image=image, <br>    caption=TEXT_PROMPT, <br>    box_threshold=BOX_TRESHOLD, <br>    text_threshold=TEXT_TRESHOLD<br>)<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A71jEpkc5smUd2N7NvFDW9w.png?ssl=1\"><figcaption>GroundingDino output when prompted with \u201cDog\u201d and \u201cDog tongue.\u201d Picture is owned by the\u00a0author.<\/figcaption><\/figure>\n<p>However, distinguishing players on a professional tennis court isn\u2019t as simple as prompting for \u201ctennis players.\u201d The model often misidentifies other individuals on the court, such as line judges, ball people, and other umpires, causing jumpy and inconsistent annotations. Additionally, the model sometimes fails to even detect the players in certain frames, leading to gaps and non-persistent boxes that don\u2019t reliably appear in each\u00a0frame.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/proxy\/1%2A2-1BsD7WVTQzwc9Z74XNtw.png?ssl=1\"><figcaption>Tracking picks up a lines person in the first example and a ball person in the second. Image made by the\u00a0author.<\/figcaption><\/figure>\n<p>To address these challenges, we apply a few targeted methods. First, we narrow down the detection boxes to just the top three probabilities from all possible boxes. Often, line judges have a higher probability score than players, which is why we don\u2019t filter to only two boxes. However, this raises a new question: how can we automatically distinguish players from line judges in each\u00a0frame?<\/p>\n<p>We observed that detection boxes for line and ball personnel typically have shorter time spans, often lasting just a few frames. Based on this, we hypothesize that by associating boxes across consecutive frames, we could filter out people that only appear briefly, thereby isolating the\u00a0players.<\/p>\n<p>So how do we achieve this kind of association between objects across frames? Fortunately, the field of multi-object tracking has extensively studied this problem. Kalman filters are a mainstay in multi-object tracking, often combined with other identification metrics, such as color. For our purposes, a basic Kalman filter implementation is sufficient. In simple terms (for a deeper dive, check this <a href=\"https:\/\/towardsdatascience.com\/what-i-was-missing-while-using-the-kalman-filter-for-object-tracking-8e4c29f6b795\">article<\/a> out), a Kalman filter is a method for probabilistically estimating an object\u2019s position based on previous measurements. It\u2019s particularly effective with noisy data but also works well associating objects across time in videos, even when detections are inconsistent such as when a player is not tracked every frame. We implement an entire Kalman filter <a href=\"https:\/\/github.com\/dcaustin33\/kalman_filter_object_detection\">here<\/a> but will walk through some of the main steps in the following paragraphs.<\/p>\n<p>A Kalman filter state for 2 dimensions is quite simple as shown below. All we have to do is keep track of the x and y location as well as the objects velocity in both directions (we ignore acceleration).<\/p>\n<pre>class KalmanStateVector2D:<br>    x: float<br>    y: float<br>    vx: float<br>    vy: float<\/pre>\n<p>The Kalman filter operates in two steps: it first predicts an object\u2019s location in the next frame, then updates this prediction based on a new measurement\u200a\u2014\u200ain our case, from the object detector. However, in our example a new frame could have multiple new objects, or it could even drop objects that were present in the previous frame leading to the question of how we can associate boxes we have seen previously with those seen currently.<\/p>\n<p>We choose to do this by using the Mahalanobis distance, coupled with a chi-squared test, to assess the probability that a current detection matches a past object. Additionally, we keep a queue of past objects so we have a longer \u2018memory\u2019 than just one frame. Specifically, our memory stores the trajectory of any object seen over the last 30 frames. Then for each object we find in a new frame we iterate over our memory and find the previous object most likely to be a match with the current given by the probability given from the Mahalanbois distance. However, it\u2019s possible we are seeing an entirely new object as well, in which case we should add a new object to our memory. If any object has &lt;30% probability of being associated with any box in our memory we add it to our memory as a new\u00a0object.<\/p>\n<p>We provide our full Kalman filter below for those preferring code.<\/p>\n<pre>from dataclasses import dataclass<br><br>import numpy as np<br>from scipy import stats<br><br>class KalmanStateVectorNDAdaptiveQ:<br>    states: np.ndarray # for 2 dimensions these are [x, y, vx, vy]<br>    cov: np.ndarray # 4x4 covariance matrix<br><br>    def __init__(self, states: np.ndarray) -&gt; None:<br>        self.state_matrix = states<br>        self.q = np.eye(self.state_matrix.shape[0])<br>        self.cov = None<br>        # assumes a single step transition<br>        self.f = np.eye(self.state_matrix.shape[0])<br>        <br>        # divide by 2 as we have a velocity for each state<br>        index = self.state_matrix.shape[0] \/\/ 2<br>        self.f[:index, index:] = np.eye(index)<br><br>    def initialize_covariance(self, noise_std: float) -&gt; None:<br>        self.cov = np.eye(self.state_matrix.shape[0]) * noise_std**2<br><br>    def predict_next_state(self, dt: float) -&gt; None:<br>        self.state_matrix = self.f @ self.state_matrix<br>        self.predict_next_covariance(dt)<br><br>    def predict_next_covariance(self, dt: float) -&gt; None:<br>        self.cov = self.f @ self.cov @ self.f.T + self.q<br><br>    def __add__(self, other: np.ndarray) -&gt; np.ndarray:<br>        return self.state_matrix + other<br><br>    def update_q(<br>        self, innovation: np.ndarray, kalman_gain: np.ndarray, alpha: float = 0.98<br>    ) -&gt; None:<br>        innovation = innovation.reshape(-1, 1)<br>        self.q = (<br>            alpha * self.q<br>            + (1 - alpha) * kalman_gain @ innovation @ innovation.T @ kalman_gain.T<br>        )<br><br>class KalmanNDTrackerAdaptiveQ:<br><br>    def __init__(<br>        self,<br>        state: KalmanStateVectorNDAdaptiveQ,<br>        R: float,  # R<br>        Q: float,  # Q<br>        h: np.ndarray = None,<br>    ) -&gt; None:<br>        self.state = state<br>        self.state.initialize_covariance(Q)<br>        self.predicted_state = None<br>        self.previous_states = []<br>        self.h = np.eye(self.state.state_matrix.shape[0]) if h is None else h<br>        self.R = np.eye(self.h.shape[0]) * R**2<br>        self.previous_measurements = []<br>        self.previous_measurements.append(<br>            (self.h @ self.state.state_matrix).reshape(-1, 1)<br>        )<br><br>    def predict(self, dt: float) -&gt; None:<br>        self.previous_states.append(self.state)<br>        self.state.predict_next_state(dt)<br><br>    def update_covariance(self, gain: np.ndarray) -&gt; None:<br>        self.state.cov -= gain @ self.h @ self.state.cov<br><br>    def update(<br>        self, measurement: np.ndarray, dt: float = 1, predict: bool = True<br>    ) -&gt; None:<br>        \"\"\"Measurement will be a x, y position\"\"\"<br>        self.previous_measurements.append(measurement)<br>        assert dt == 1, \"Only single step transitions are supported due to F matrix\"<br>        if predict:<br>            self.predict(dt=dt)<br>        innovation = measurement - self.h @ self.state.state_matrix<br>        gain_invertible = self.h @ self.state.cov @ self.h.T + self.R<br>        gain_inverse = np.linalg.inv(gain_invertible)<br>        gain = self.state.cov @ self.h.T @ gain_inverse<br><br>        new_state = self.state.state_matrix + gain @ innovation<br><br>        self.update_covariance(gain)<br>        self.state.update_q(innovation, gain)<br>        self.state.state_matrix = new_state<br><br>    def compute_mahalanobis_distance(self, measurement: np.ndarray) -&gt; float:<br>        innovation = measurement - self.h @ self.state.state_matrix<br>        return np.sqrt(<br>            innovation.T<br>            @ np.linalg.inv(<br>                self.h @ self.state.cov @ self.h.T + self.R<br>            )<br>            @ innovation<br>        )<br><br>    def compute_p_value(self, distance: float) -&gt; float:<br>        return 1 - stats.chi2.cdf(distance, df=self.h.shape[0])<br><br>    def compute_p_value_from_measurement(self, measurement: np.ndarray) -&gt; float:<br>        \"\"\"Returns the probability that the measurement is consistent with the predicted state\"\"\"<br>        distance = self.compute_mahalanobis_distance(measurement)<br>        return self.compute_p_value(distance)<\/pre>\n<p>Having tracked every detected object over the past 30 frames, we can now devise heuristics to pinpoint which boxes most likely represent our players. We tested two approaches: selecting the boxes nearest the center of the baseline, and picking those with the longest observed history in our memory. Empirically, the first strategy often flagged line judges as players whenever the actual player moved away from the baseline, making it less reliable. Meanwhile, we noticed that GroundingDino tends to \u201cflicker\u201d between different line judges and ball people, while genuine players maintain a somewhat stable presence. As a result, our final rule is to pick the boxes in our memory with the longest tracking history as the true players. As you can see in the initial video, it\u2019s surprisingly effective for such a simple\u00a0rule!<\/p>\n<p>With our tracking system now established on the image, we can move toward a more traditional analysis by tracking players from a bird\u2019s-eye perspective. This viewpoint enables the evaluation of key metrics, such as total distance traveled, player speed, and court positioning trends. For example, we could analyze whether a player frequently targets their opponent\u2019s backhand based on location during a point. To accomplish this, we need to project the player coordinates from the image onto a standardized court template viewed from above, aligning the perspective for spatial analysis.<\/p>\n<p>This is where homography comes into play. Homography describes the mapping between two surfaces, which, in our case, means mapping the points on our original image to an overhead court view. By identifying a few keypoints in the original image\u200a\u2014\u200asuch as line intersections on a court\u200a\u2014\u200awe can calculate a homography matrix that translates any point to a bird\u2019s-eye view. To create this homography matrix, we first need to identify these \u2018keypoints.\u2019 Various open-source, permissively licensed models on platforms like RoboFlow can help detect these points, or we can label them ourselves on a reference image to use in the transformation.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/proxy\/1%2ADNMKY2b2dTKy7_lEQjJ-ow.png?ssl=1\"><figcaption>As you can see the predicted keypoints are not perfect but we find small errors do not affect the final transformation matrix\u00a0much.<\/figcaption><\/figure>\n<p>After labeling these keypoints, the next step is to match them with corresponding points on a reference court image to generate a homography matrix. Using OpenCV, we can then create this transformation matrix with a few simple lines of\u00a0code!<\/p>\n<pre>import numpy as np<br>import cv2<br><br># order of the points matters<br>source = np.array(keypoints) # (n, 2) matrix<br>target = np.array(court_coords) # (n, 2) matrix<br>m, _ = cv2.findHomography(source, target)<\/pre>\n<p>With the homography matrix in hand, we can map any point from our image onto the reference court. For this project, our focus is on the player\u2019s position on the court. To determine this, we take the midpoint at the base of each player\u2019s bounding box, using it as their location on the court in the bird\u2019s-eye view.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/proxy\/1%2AnNrggu1pxTHrt4jWUoSA9Q.png?ssl=1\"><figcaption>We use the middle point at the bottom of the box to map to where each player is on the court. The illustration shows the keypoint translated to the tennis court seen from a birds eye view with our homography matrix.<\/figcaption><\/figure>\n<p>In summary, this project demonstrates how we can use GroundingDINO\u2019s zero-shot capabilities to track tennis players without relying on labeled data, transforming complex object detection into actionable player tracking. By tackling key challenges\u200a\u2014\u200asuch as distinguishing players from other on-court personnel, ensuring consistent tracking across frames, and mapping player movements to a bird\u2019s-eye view of the court\u200a\u2014\u200awe\u2019ve laid the groundwork for a robust tracking pipeline all without the need for explicit\u00a0labels.<\/p>\n<p>This approach doesn\u2019t just unlock insights like distance traveled, speed, and positioning but also opens the door to deeper match analytics, such as shot targeting and strategic court coverage. With further refinement, including distilling a YOLO or RT-DETR model from GroundingDINO outputs, we could even develop a real-time tracking system that rivals existing commercial solutions, providing a powerful tool for both coaching and fan engagement in the world of\u00a0tennis.<\/p>\n<ol>\n<li>\n<a href=\"http:\/\/twitter.com\/inproceedings\">@inproceedings<\/a>{faulkner2017tenniset,<br \/> title={TenniSet: A Dataset for Dense Fine-Grained Event Recognition, Localisation and Description},<br \/> author={Faulkner, Hayden and Dick, Anthony},<br \/> booktitle={2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA)},<br \/> pages={1\u20138},<br \/> organization={IEEE}<br \/>}<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=80bba73a4247\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/zero-shot-player-tracking-in-tennis-with-kalman-filtering-80bba73a4247\">Zero-Shot Player Tracking in Tennis with Kalman Filtering<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Derek Austin<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fzero-shot-player-tracking-in-tennis-with-kalman-filtering-80bba73a4247\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Zero-Shot Player Tracking in Tennis with Kalman Filtering Automated tennis tracking without labels: GroundingDINO, Kalman filtering, and court homography https:\/\/medium.com\/media\/6f735abc63f905de122bb8a0679f97fd\/href With the recent surge in sports tracking projects, many inspired by Skalski\u2019s popular soccer tracking project, there\u2019s been a notable shift towards using automated player tracking for sport hobbyists. Most of these approaches follow a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,221,301,1371,92,1370],"tags":[103,1373,1372],"class_list":["post-1299","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-computer-vision","category-object-detection","category-tennis","category-thoughts-and-theory","category-tracking","tag-model","tag-tennis","tag-tracking"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1299"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1299"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1299\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1299"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1299"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1299"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}