As mentioned in our previous blog post, we automatically collect data from other local taxi apps, as it helps us build a better understanding of the industry locally. From a technology background, we’re used to having data to back up business decisions, and luckily the solutions used locally make this relatively easy.
From an automation perspective, it’s nothing groundbreaking. As TaxiCaller provides anonymous access through their apps and web interface, we’re able to use their internal API to fetch pickup ETAs, along with vehicle locations (when enabled).
For reasons that haven’t been made clear, the JTDA have explicitly disabled the feature showing vehicle locations on the map when searching for a ride. This is a shame, as it’s a very intuitive UI to see driver availability (and is coming to goto in the next update), and allows you to make request a specific driver directly.
As we can’t get data on available vehicles the usual way, we were curious to see whether it could be approximated using the data we have available. As it turns out, it can – click here to jump to the demo.
If you're interested in how it works, our method is detailed below. It's slightly technical, but should be relatively easy to follow even without a statistics background.
Given that we can fetch pickup ETAs for any location on the map, the most obvious solution was to divide the island into a grid, and fetch the pickup ETA for each point on the grid (shown here in seconds).
Throughout the rest of this post I'll use the notation
[x, y] to describe points on this grid, where
[0, 0] is bottom left, and
[15, 15] is top right. Points with a value of
-1 encountered an error while fetching the ETA.
This works surprisingly well. You can identify vehicles around
[5, 2] and
[11, 5] (possibly) as the pickup ETA for these locations is significantly lower than the surrounding points. In order to do anything useful though, we need to be able to identity these vehicles programmatically, which is harder than it sounds.
If you knew the number of vehicles you're looking for, you could just sort the points by pickup ETA and take the smallest
n points – your vehicles are likely to be around there.
With an unknown number of vehicles, it's clear that the most important metric is the relative difference between the point and the surrounding points. We're looking for points where the ETA is significantly lower than those around it.
The first approach was to calculate the average for the neighbours of each point, and flag any locations where the ETA is
n% lower than the average. However, this approach generates many false positives. For example, point
[1, 7] has an ETA over 30% lower than that of its neighbours, but it's clear that there isn't a vehicle there.
Luckily, this is the perfect application for standard deviation. Rather than try to explain the concept here, if you're not sure on the below please check out the Investopedia resource on standard deviation.
For each point, we calculate the average and standard deviation of its neighbouring points, skipping the outermost 'ring' of points (as they have no neighbours). We then flag points where the item ETA is greater than
n * standard deviation from the average. The greater the value of
n, the more false negatives, and the lower the value, the more false positives. We've got the best results for values of
[10, 2] as an example, the average of the neighbouring points is
299, and the standard deviation is
n = 1, this point is flagged as
28 < 299 - (n * 94).
Repeating this for each point on the map, we can flag each vehicle automatically, and by adjusting the value of
n we can choose whether to show more false negatives or more false positives.
No blog post would be complete without a proof of concept, so check out the demo below which shows the live location of online JERSEY TAXIAPP drivers. You can use the slider to control whether you see more false negatives, or more false positives. This data is automatically updated every 10 minutes.
The results of this experiment reinforce the conclusion made in our first blog post – we believe the JERSEY TAXIAPP has a maximum of 6-10 active drivers.
We've reached out to the JTDA for comment on this analysis, you can read their response here.
While this is cool example, it has a few caveats. The most obvious being granularity – multiple vehicles within the same grid square are shown as a single cluster. As there is no way of uniquely identifying a single cluster, it's also hard to track movements. With ten minute updates, each cluster can move significantly so its difficult to track clusters across updates.
If a vehicle is in-between two grid points the ETA for both points can be affected equally, causing them both to be highlighted. Where you see two markers directly next to each other, usually it's a single vehicle. As you decrease
n, you'll see more instances of this.
Even though this isn't suited to use in production, it gives us an easy way to count the approximate number of vehicles online, along with their locations, which is fed into our stats dashboard.