A friend left his job in banking and decided to open a pizzeria with his mates. They were quite successful, and within a few years they went from 1 restaurant to 8. As this happened, they gradually went from running individual restaurants to macro managing them remotely. To help them do so, they built a realtime view of all the pizzerias.
Inspired by their experiences, an idea appeared when we came across the New York City taxi dataset. We thought: How would business analytics look for the NYC taxi business? Could we monitor it in real-time? How would it look?
To answer this, we constructed a pseudo-realtime dashboard to test the concept. The graph uses real data from New York City cabs, replayed with an offset of a few years to simulate streaming data. Let's see how it works.
See the NYC Taxi Grafana dashboards in real-time.
Unfortunately, we do not have access to true real-time data for NYC taxis. To simulate "real-timely-ness", we instead tune data from real historical files.
To do so, the data is loaded into QuestDB and then the queries are run against the current time, offset by 8 years.
To import the data into QuestDB, first download the parquet files for the intervals you are interested in from the above address. Then, we can use the python client to insert the rows into a QuestDB table in just a few lines of code:
We can simulate the offset to current time in two steps:
First we filter the tables using an offset against the current time. For example, to simulate the last 5 minutes of taxi rides:
Second, we adjust the Grafana time window to match our target. For example, for a chart covering the whole day up to the current time, but 8 years ago, we use
now-8y/d as start, and
now-8y as end points:
The first thing we wanted was a set of ticking metrics to show the health of our "business" so far "today" -- err 8 years ago. A strong start is to get the count of trips so far today:
Next we replicated the query to get total fare, average fare, total trip distance and so on:
After that, we wondered if we could simulate how many taxis and passengers are
currently in a trip. The data only shows one line per trip and the designated
If we wanted to filter directly on
dropoff_datetime it would make queries
significantly slower as this is not a designated timestamp field. Instead we
took all the trips that started within a 2 hour window but were not yet
finished... as of the current time, eight years ago:
In a similar way, we applied
sum(passenger_count) instead of
count() to get
the simulated number of passengers currently in a trip:
The next step is ambitious. Could we construct an overall view of the rides flow? Whenever a pickup or a dropoff is happening, alongside other metrics such as passenger count, the cab type and other key indicators?
Sure we can. We can use similar logic for filtering by
pickup_datetime and for
dropoffs, and then apply further filtering on
To highlight the recent rides, a status based on how recently the trip started
case statement will work. For instance,
filters all rides from today, then
ORDER BY gets the most recent rides.
From there, truncate to 100 results:
Similarily for dropoffs and with applying some value mappings to generate cell colors in Grafana, we end up with the following view that replays the flow of rides as if in real-time:
We're simulating business, here! What else is essential to our prospective taxi empire? What about cumulative metrics, such as the total amount of fares received today, tips, distance, and that sort of thing?
We apply two Grafana queries with different windows to compare the results for today and the previous day:
The above uses
sum() over() to calculate the cumulative sum of fares and tips
received. Since the query could return hundreds of thousands of rows, we use
SAMPLE BY 1m with the
last() point to down-sample to one-minute intervals
and reduce the points Grafana needs to draw on the chart.
The result is as follows, and it updates throughout the day:
The same logic can then be applied to multiple other metrics, like distance driven, average fare, time driven, cumulative rides and cumulative passengers:
Ahh... Tipping. Very contentious these days! The tip is most interesting to unpack. What influences tipping tendencies? Does a driver get more tips for driving faster or for taking night rides or for picking up rides in certain areas? There are many possible ways we can approach this, but an XY plot felt most visually impactful.
In the above query, we use some filters to remove outliers such as
There's many compelling insights:
- Some people tip 0, oof!
- Some people tip a percentage of the fare, shown via the straight diagonal lines. Though none of us have been to NYC, we assume that the payment systems in the cab suggest different tip amounts, and that these lines reflect the various possible tip percentages. Perhaps this sort of chart has influenced the ever-creeping minimum tip percentage...
- Some tip a flat round value, like $5 or $10
- There is another likely dimension, which is the 'round up' tip, so that the overall value is a round number, though it's harder to see on the chart below:
To check, we can separate the different groups with a few SQL filters. For example, the following will highlight the tips that round up the fare to a multiple of $5 or $10:
Again, similar logic can help us find tips that are a certain percentage of the fare amount, such as 12.5% and 20%:
If we operated a fleet of taxis, we'd be very interested in the passenger hot spots. A hot spot is an area with many pickups/dropoffs. There's the obvious, like an airport or a major tourist spot like Broadway. But there could be hot new venues or emerging restaurants, and getting there first gives us an edge.
A Grafana heatmap with the pickup and dropoff coordinates to highlight recent activity does the trick:
And finally... the dashboard. The location of every taxi, at every moment. The eye in the sky. Big brother! This is difficult to do with high precision because the NYC taxi dataset only contains pickup and dropoff coordinates. But we can get something representative through time-based interpolation. In a real-world scenario, GPS coordinates would provide the needed precision.
Calculate the ride duration and interpolate between the pickup and dropoff point based on how much time has elapsed:
Admittedly, this is a naive interpolation. But we do want to see the taxis moving. While we can expect some inconsistencies, such as cars going over water and through buildings…
... We can still get a rough sense of where cabs are gravitating. Given we lack real-time data, seeing taxis scoot through their routes is still very satisfying:
Our inspiration took us on an exploration of the New York City taxi industry. Using historical taxi data, we simulated a real-time business dashboard. Through time-based interpolation, we visualized the approximate locations and movements of taxis, offering a bird's-eye view of the city's taxi flow.
While clean and specific data would lead to stronger analysis, our demo showcases the power of data visualization in understanding and managing a business. It gave us a glimpse into the daily ebb and flow of New York City's iconic taxi industry.
For data ingest and analysis, QuestDB and Grafana make an excellent pair. If you're looking to interpret massive flows of data and to compose dashboards of your very own, consider downloading QuestDB. It's open source!