Querying datasets¶

This guide describe how to query DataSet objects using SQL-like constructs. Basic query constructs are discussed and then common query examples are provided. For instructions on how to create datasets from different data sources see this guide.

Before starting with the description of the query generation process, it is important to make the following remark: a good understanding of the DBMS where the data is stored is required to write efficient queries. For example, in ArdaDB, which is implemented using Clickhouse, knowing how primary indexes are implemented will result in better performance when querying the data. See this article as a reference.

Now, we are ready to review the query generation process. First, we load a dataset to use in the examples:

import algoseek_connector as ac

manager = ac.ResourceManager()
data_source = manager.create_data_source("ArdaDB")
group = data_source.groups.USEquityMarketData.fetch()
dataset = group.datasets.TradeOnlyMinuteBar.fetch()

The Dataset class is built on top of SQLAlchemy tables and columns and supports most of the standard SQL operations. Dataset objects do not store data and are just a mean to retrieve data from an external data source.

Columns and simple select statements¶

Each dataset column is represented as a sqlalchemy.Column and is accessed by using either attribute or dict-like access:

# access by attribute
col = dataset.ASID

# access by index
col = dataset["ASID"]

The ColumnHandle class also allows access to a dataset columns and it can be created using the get_column_handle() method:

c = dataset.get_column_handle()

# access by attribute
c.ASID

# access by index
c["ASID"]

The columns of a dataset allow us to build select statements using the select() method:

Python

stmt = dataset.select()

SQL

SELECT
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.BarDateTime,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,
    USEquityMarketData.TradeOnlyMinuteBar.ASID,
    USEquityMarketData.TradeOnlyMinuteBar.FirstTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.HighTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.LowTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.LastTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.VolumeWeightPrice,
    USEquityMarketData.TradeOnlyMinuteBar.Volume,
    USEquityMarketData.TradeOnlyMinuteBar.TotalTrades
FROM
    USEquityMarketData.TradeOnlyMinuteBar

The SQL tab displays the equivalent select statement generated. A subset of columns can be selected by passing them as arguments:

Python

stmt = dataset.select(
    dataset.TradeDate,
    dataset.Ticker,
    dataset.Volume,
)

SQL

SELECT
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,
    USEquityMarketData.TradeOnlyMinuteBar.Volume
FROM
    USEquityMarketData.TradeOnlyMinuteBar

It is often the case that we want to exclude a subset of columns from the query. This is done by passing them as a sequence to the exclude parameter:

Python

exclude_columns = (dataset.ASID, dataset.TotalTrades)
stmt = dataset.select(exclude=exclude_columns)

SQL

SELECT
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,
    USEquityMarketData.TradeOnlyMinuteBar.BarDateTime,
    USEquityMarketData.TradeOnlyMinuteBar.FirstTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.HighTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.LowTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.LastTradePrice,
    USEquityMarketData.TradeOnlyMinuteBar.VolumeWeightPrice,
    USEquityMarketData.TradeOnlyMinuteBar.Volume,
FROM
    USEquityMarketData.TradeOnlyMinuteBar

Aggregating rows with GROUP BY¶

Row aggregation is available through the group_by() method, which accepts a sequence of columns that are used perform the aggregation. Arbitrary aggregation functions can be created with SQLAlchemy func function generator:

Python

from sqlalchemy import func

c = dataset.get_column_handle()
stmt = (
    dataset
    .select(
        c.TradeDate,
        c.Ticker,
        func.avg(c.HighTradePrice).label("MeanHighPrice")
    )
    .group_by(c.TradeDate, c.Ticker)
)

SQL

SELECT
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,
    avg(USEquityMarketData.TradeOnlyMinuteBar.HighTradePrice) AS MeanHighPrice
FROM
    USEquityMarketData.TradeOnlyMinuteBar
GROUP BY
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,

Note that column aliasing is supported through the label method. It is important to check the SQL reference to see which aggregate functions are available. For ArdaDB, check the supported functions here. The HAVING clause is also supported through the having method:

Python

from sqlalchemy import func

c = dataset.get_column_handle()
stmt = (
    dataset
    .select(
        c.TradeDate,
        c.Ticker,
        func.min(c.HighTradePrice).label("MinHighPrice")
    )
    .group_by(c.TradeDate, c.Ticker)
    .having(func.min(c.HighTradePrice) > 1000.0)
)

SQL

SELECT
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,
    min(USEquityMarketData.TradeOnlyMinuteBar.HighTradePrice) AS MinHighPrice
FROM
    USEquityMarketData.TradeOnlyMinuteBar
GROUP BY
    USEquityMarketData.TradeOnlyMinuteBar.TradeDate,
    USEquityMarketData.TradeOnlyMinuteBar.Ticker,
HAVING
    min(USEquityMarketData.TradeOnlyMinuteBar.HighTradePrice) > 1000.0

Examples¶

This section contain a series of typical queries to showcase library usage patterns in real scenarios. These examples use seaborn and matplotlib and seaborn for plotting query results. These libraries can be installed with the following command:

pip install matplotlib seaborn

The examples work with the US Equities Primary Exchange Daily OHLC dataset. As a first step, we will load the dataset:

from sqlalchemy import func

from algoseek_connector import ResourceManager

manager = ResourceManager()
ardadb = manager.create_data_source("ArdaDB")
us_equity_market_data_group = ardadb.groups.USEquityMarketData.fetch()
daily_ohlc = us_equity_market_data_group.datasets.PrimaryOHLCDaily.fetch()
c = daily_ohlc.get_column_handle()

In the first example, top 5 traded symbols are queried from ArdaDB:

top5_stmt = (
    daily_ohlc.select(c.Ticker)
    .where(func.toYear(c.TradeDate) == 2023)
    .group_by(c.Ticker)
    .order_by(func.sum(c.ListedTotalTrades).desc())
    .limit(5)
)

example1_result = daily_ohlc.fetch(top5_stmt)
top5 = example1_result["Ticker"]
print(top5)

Using these symbols, we can easily create a time series plot using Pandas and matplotlib:

top5_trades_stmt = (
    daily_ohlc
    .select()
    .where(func.toYear(c.TradeDate) == 2023)
    .where(c.Ticker.in_(top5))
)
top_5_trades_df = daily_ohlc.fetch_dataframe(top5_trades_stmt)

sns.relplot(
    data=top_5_trades_df,
    x="TradeDate",
    y="ListedTotalTrades",
    hue="Ticker",
    kind="line",
    aspect=2,
)

../../_images/top5-ts.png — time series of top 5 most traded symbols.¶

A candle chart can also be created with this data:

ticker = "TSLA"
mask = top_5_trades_df.Ticker == ticker
df_ticker = top_5_trades_df[mask]

fig = plt.figure(figsize=(10, 4), constrained_layout=True)
gs = fig.add_gridspec(4, 4)
ax = fig.add_subplot(gs[0:3, :])
ax2 = fig.add_subplot(gs[3, :])


# set color based on bullish/bearish
color = pd.Series("tab:green", index=df_ticker.index).where(df_ticker.ClosePrice >= df_ticker.OpenPrice, "tab:red")

# candle shadow
ax.vlines(df_ticker.TradeDate, df_ticker.LowPrice, df_ticker.HighPrice, colors="tab:grey", alpha=0.5)

# candle body plot
ax.bar(
    x=df_ticker.TradeDate,
    height=df_ticker.loc[:, ["ClosePrice", "OpenPrice"]].max(axis=1)
    - df_ticker.loc[:, ["ClosePrice", "OpenPrice"]].min(axis=1),
    bottom=df_ticker.loc[:, ["ClosePrice", "OpenPrice"]].min(axis=1),
    width=1,
    color=color,
    alpha=1,
)

# volume plot
ax2.bar(
    x=df_ticker.TradeDate,
    height=df_ticker.ListedTotalVolume,
    width=1,
    color=color,
    alpha=1,
)

# axis config
ax.set_title(ticker)
ticks = ax.get_xticks()
ax.set_ylim(50, 250)
ax.set_xticks([])
ax2.set_xticks(ticks)
ax2.set_yticks([])
ax.set_xlim(left=pd.Timestamp("2023-01-01 00:00:00"), right=pd.Timestamp("2023-06-01 00:00:00"))
ax2.set_xlim(left=pd.Timestamp("2023-01-01 00:00:00"), right=pd.Timestamp("2023-06-01 00:00:00"))

../../_images/candles.png — Candle chart for a selected symbol.¶

Finally, we can also cross-reference data using reference datasets:

reference_group = ardadb.groups.USEquityReferenceData.fetch()

lookup = reference_group.datasets.LookupBase.fetch()

lookup_stmt = lookup.select(lookup.SecId).where(lookup.Ticker == top5[0])
lookup_query_result = lookup.fetch(lookup_stmt)["SecId"]
top1_secid = lookup_query_result[0]

print(f"The most traded equity in 2023 SecId is {top1_secid}")

master = reference_group.datasets.SecMasterBase.fetch()

top1_master_stmt = master.select().where(master.SecId == top1_secid)
top1_ref_data = master.fetch(top1_master_stmt)

print(top1_ref_data)

Navigation

Related Topics

Querying datasets¶

Columns and simple select statements¶

Filter rows with the WHERE clause¶

Aggregating rows with GROUP BY¶

Examples¶