Apache Spark and Time-Series Analytics
High-level instructions for loading data from QuestDB to Spark and back.
#What is Spark?
Apache Spark is an analytics engine for large-scale data engineering, well-known in the big data landscape. It is suitable for executing data engineering, data science, and machine learning on single-node machines or clusters.
#QuestDB Spark integration
A typical Spark application processes data in the following steps:
- Loading data from different sources
- Transforming and analyzing the data
- Saving the result to a data storage
Our example demonstrates these steps using QuestDB as the data source and storage. It loads data from QuestDB into a Spark Dataframe; then the data is enriched with new features, and eventually, it is written back into QuestDB.
Package manager: This depends on your choice of OS. The below instructions are for macOS using Homebrew.
QuestDB: An instance must be running and accessible. You can do so from Docker, the binaries, or Homebrew.
#Installing Apache Spark
Spark can be installed and set up in many ways, depending on requirements. Typically, it is part of a Big Data stack, installed on multiple nodes with an external cluster manager, such as Yarn or Apache Mesos. In this tutorial, we will work with a single-node standalone Spark installation.
Spark has a multi-language environment. It is written in Scala, runs on the Java Virtual Machine, and also integrates with R and Python. Our example is written using Python. By running the below commands Spark will be installed with all required dependencies:
The exact versions used for this example:
#Installing the JDBC driver
Spark communicates with QuestDB via JDBC, connecting to its Postgres Wire Protocol endpoint. This requires the Postgres JDBC driver to be present.
- Create a working directory:
- Download the JDBC driver from here into the working directory. The exact version used for this example:
#Setting up database tables
First, start QuestDB. If you are using Docker run the following command:
The port mappings allow us to connect to QuestDB's REST and PGWire endpoints. The former is required for opening the Web Console, and the latter is used by Spark to connect to the database.
Open the Web Console in your browser at
Run the following SQL commands using the console:
INSERT command generates 3 days' worth of test data, and stores it in the
#Feature engineering examples
Save the below Python code into a file called
sparktest.py inside the working
This Spark application loads aggregated data from the
trades table into a
Dataframe, then adds two new features, a 10-minute moving average and the
standard deviation. Finally, it writes the enriched data back into QuestDB and
saves it to the
#Run the example
Submit the application to Spark for execution using
The example requires the JDBC driver at runtime. This dependency is submitted to
Spark using the
After the execution is completed, we can check the content of the
The enriched data should be displayed in the Web Console.
For a more detailed explanation of the QuestDB Spark integration, please also see our tutorial Integrate Apache Spark and QuestDB for Time-Series Analytics.