Pandas is one of the most popular open-source Python libraries used for data
analysis and manipulation. It provides standard data structures (e.g.,
Dataframe, Series) and utility functions that make it easier to sanitize,
transform, and analyze various data sets. Pandas also has great support for
timedelta64 types as well as integrations with other
popular libraries such as
scikits.timeseries, which makes for a great choice
to process and analyze time series data.
The extensive capabilities of Pandas for time-series data make it a great complementary piece for QuestDB for a data analytics tech stack. Pandas provides the data structure for working with various time-series data types and analytic libraries in memory, while QuestDB provides fast and reliable storage options. When used in combinations, users can easily manipulate data either in Pandas or QuestDB for insights.
In this article, we will explore crypto prices with Pandas and QuestDB, keeping in mind the various scenarios for integration. We will walk through several options to ingest data into a Python workspace or QuestDB, analyze the dataset using Pandas or SQL, and finally visualize the results.
Depending on the size and the format of the data, we have a few options to import data for analysis. The example crypto dataset from Kaggle is relatively small and in a nice CSV format that is well-supported by QuestDB. In this case, we can easily use the console UI to upload the CSV files.
Download and run the latest version of QuestDB:
localhost:9000, click on the “Upload” icon on the left-hand panel
and import the
csv files of interest.
This example will use the Bitcoin dataset, but any of the coins from the dataset will also work.
If you have datasets in formats that QuestDB does not natively support (e.g.,
parquet), you can also use the official
Python client to
load Pandas data structure into QuestDB tables. For example, we can build our
own historical coin pricing information with the
Once you have the parquet file, you can load it into Pandas and use the
questdb package to load
it into QuestDB.
Note that we are first loading the data into QuestDB prior to doing any analysis. We could also load it into Pandas and do some preliminary exploration and only import a subset of the data to QuestDB. Either option can work, but for large datasets, it’s often a good idea to load data first into QuestDB and downsample before analyzing via Pandas. This is because Pandas loads all the data in memory, making it inherently limited by the available memory of the machine it is running on.
Ultimately the order of operation to load data into Pandas or QuestDB first depends on the structure and source of the data, as well as the makeup of your data organization. For example, if a different team is responsible for data ingestion either via Kafka or daily CSV upload, it may make sense to load it into QuestDB first. On the other hand, if your data science team is closer to the source of the data, you could do some data cleaning first in Pandas before shipping sanitized datasets to downstream teams for visualization.
For simplicity's sake, we will use the Kaggle dataset in the rest of this tutorial.
Now install the packages we will use to explore the dataset:
If you used Anaconda Navigator, go under Environments > Search Packages to install:
Now we’re ready to launch the notebook and start exploring the data:
First, we need to use the
psycopg library to connect to QuestDB and import the
Now we can run some quick queries to make sure our import was successful. The
tail functions are useful in this case for a quick sanity check:
Alternatively, we can use the
describe commands to get a sense of
the data types and distribution:
For good measure, we can also check for
na values to make sure we’re
working with a clean dataset.
These queries should return
False for all the columns. If you have missing
values in your dataset, you can use the
dropna function to remove that row or
In this example, we are simply loading the entire dataset via a select * query. But in general, if your dataset is big, it’ll be more performant to select a smaller subset by running SQL queries on QuestDB and only exporting the final results to Pandas. This will circumvent the memory limitations of Pandas as noted before and will also speed up the export process.
Also, this approach may also make more sense if your team is more comfortable
with running SQL queries to narrow down the data first. Standard SQL queries to
filter/join/order columns can be combined with time series-related helper
functions such as aggregate functions (e.g.,
to manipulate and transform the data of interest. Then a smaller subset of this
query can be imported into Python for further analysis or visualization with
Now that we have our dataset in Jupyter, we can run answer some simple
questions. For example, we can find the five lowest prices of Bitcoin by running
nsmallest function on the column we’re interested in (e.g. High, Low,
Open, Close). Similarly, we can use the
nlargest function for the opposite:
We can also find days when the open price was lower than the closing price by
To get a better sense of the trends, we can resample the dataset. For example,
we can get the mean prices by week by using the
resample function on the
Now our data is aggregated by weeks. Similarly, we can get a rolling average
over a set window via the
seaborn chose some sane defaults for the x-axis. We can also specify
our own labels as well as legends like the plot below comparing the High, Low,
and Opening prices of Bitcoin over time:
From a quick glance, it is hard to tell the trends of any of these price actions. Let’s dive into a time period when the Bitcoin price starts to become more volatile (i.e. after Jan 2021). We can specify a subset of the dataset using the loc function and apply it to our full dataset, weekly mean resample, and seven-day rolling mean.
After giving each of the datasets a different marker, we can plot all of them on the same graph to see when the opening price was above or below the weekly or moving window average:
We can zoom in further in the May to July timeframe to capture the volatility.
Alternatively, we can also apply the popular 200-day moving average metric to
analyze price trends. Finally, we can also use the
ewm function in Pandas to
calculate an exponentially weighted moving average to gather different momentum
indicators. While past performance is not indicative of future performance,
these momentum indicators can be used to backtest and formulate new trading or
price analysis strategies.
Pandas and QuestDB make for a powerful combination to perform complex data analysis. In this article, we walked through an example of analyzing historical crypto data with Pandas and QuestDB. We also discussed some options for importing and analyzing the data in Pandas or QuestDB based on data size, format, and team structure.
At the end of the day, the most ideal workflow will depend on the size of your data, the expertise of your team, or where responsibilities are divided amongst various teams. The good news is that all of these workflows can be mixed together as your data or team shifts over time.
If you are interested in more financial tick analysis, make sure to check out: