Big Data Analysis of Futures

September 27, 2017

Quick Intro

Machine Learning Engineer
Masters in applied mathematics and statistics from Johns Hopkins University
Bachelors in pure mathematics from the University of Pennsylvania
Spent years working in Healthcare, Finance, and Telecommunications industries in roles spanning data analyst, software engineer, and data scientist
Talk is based on my graduate thesis

Problem Statement

How can we accurately predict trade volume?

many trading algorithms depend on volume
accurate volume predictions over a given interval allows traders to be more effective
volume prediction increases trading strategy capacity, controls trading risk

How can we understand the relationships between trade volume and price volatility?

new information on the market causes agents such as hedgers and speculators to trade until prices reach a revised equilibrium which then changes price and trading volume
provides information on the efficiency of futures markets which regulators can then use to decide upon market restrictions

About the Data

Hundreds of millions of rows of futures trading in the form of order books
Comes from the Chicago Mercantile Exchange (CME)
Collected from May 2, 2016 to November 18, 2016
Raw data from the CME included extended hours trading
Fields included instrument name, maturity, date, time stamp, price, and quantity
Futures were comprised of 22 financial instruments spanning six markets
- foreign exchange, metal, energy, index, bond, and agriculture
Trades were recorded roughly every half second
Product defined as instrument/maturity pair
- 149 products in total

Raw Data

Methods

Processing Data

How can we efficiently process the overwhelming amount of data such that it is amenable to analysis?

We use Apache Spark to process the data

What is Spark?
- Spark is a fault-tolerant and general-purpose cluster computing system providing APIs in Java, Scala, Python, and R
One node cluster set up
- Linux Server
- 32 cores
Used Python API, i.e. Pyspark
Used SparkSQL for parsing and reformating, treating raw data as table
Data became manageable to visualize with pure python and analyze in R

Regression Data Set

Exploratory Data Analysis

Checking for Normality

Checking Normality at Market Level

Day Effects at Market Level

Day of Month Effects at Market Level

Time to Maturity Effects at Market Level

Time to Maturity Effects at Instrument Level

Regression

Spline Prediction

What is a spline?

Spline basis functions are piecewise polynomials used in fitting curves which are linear in terms of the basis function.
Cubic splines in particular have been found to have nice properties with good ability to fit nonlinear curves.
Cubic splines are made to be smooth at the knots, endpoints of intervals on the x-axis
- Force first and second derivatives of the function to agree at knots

Use Penalized Spline in GAM

Called a smoothing spline
Penalty is integrated second derivative
Cost function

\(\sum_{i=1}^N [y_i-f(x_i)]^2 + \lambda \int f''(t)^2 dt\)
GAM = Generalized Additive Model
- Instead of:
  \(E[Y|X_1, X_2,...,X_p]=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p\)
- We have:
  \(g(E[Y|X_1, X_2,...,X_p])=\beta_0+f(X_1)+f(X_2)+...+f(X_p)\), where g is the link function.

Models

One Model Paradigm: One model for all markets
Multiple Model Paradigm: One model for each market
Primes are before instrument is added
- We want to know if knowing the instrument makes model better
- "s" denotes spline basis function on predictor variable

Regression Results

Spline Curves

Day of month spline curves reinforce patterns seen in the exploratory analysis visualizations of essentially constant daily trade volume
Also see anticipated periodicity, but shows a dip around day 8 of 30 for all markets with the exception of FX.
The time to maturity spline curve for bond serves as reinforcement of the differences given that the curve begins to rise sharply around 150 days

Comparing All Models at Market Level

Consider the best model
Agriculture, energy,and index
- Relative errors of 0.182, 0.113, and 0.164 respectively
Metal, bond, and FX
- Relative errors of 0.379, 0.254, and 0.301 respectively

Comparing Models at Instrument Level

Relative error is under .20 for most instruments.
Exceptions are both metals (GC and SI) as on the market level, both FX instruments (6E and GE) as on the market level, and three bonds: 30 Day Fed Fund (ZQ), 30 Yr U.S. Treasury Bond (ZB), and 10Y Treasury Note (ZN).
Relative errors are 0.360, 0.405, 0.2676, 0.459, 0.257, 0.266, and 0.227 respectively.

Relationship Between Trade Volume and Price Volatility

Trade Volume Data Set

Price Volatility Data Set

volatility measured with standard deviation

Consistency of Correlations

We want to know if correlation doesn't change with time
Product must have at least 30 data points
Look at hourly data and calculate correlation between trade volume and price volatility.
Regress correlations on day
Test for \(\beta_1=0\)
Correct for multiple testing error using B-H Procedure

Look at Active Trading period

Relationship Results

Significant Products at FDR=.2

5 year treasury note (ZF)
Rbob gasoline (RB)
30 Day Fed Fund (ZQ)
Mini Dow Jones index (YM)
Heating oil (HO)

Significant Correlation Plots with Best Fit Line

Looking at Deemed Active Period

Dashes denote 60th-90th volume percentiles

Anything Truly Interesting?

ZQ and 5 year treasury note (ZF) showing idealized pattern of dense population above zero.
Moreover, none of the products manifest a sharp increase, which typically delineates low and high volumes.
We can safely exclude the effects of nascent and near maturity periods from all but Rbob gasoline (RB) since the selected area of this product is still very close to maturity.
Non-constant correlation significance is legitimate for ZQ, mini Dow Jones index (YM), heating oil (HO), and ZF
These products price volatilities will not be well predicted by trade volume over time because their correlation does not remain constant.

Checking Daily and Hourly Correlations

This is what we expect based on past research.

Positive correlation
Consistency accross time intervals
Most products manifest both

Deviating Products

Deviating Products Cont'd

What we Conclude

Apache Spark is an efficient way to process massive datasets.
Using a penalized regression spline is an effective way of predicting trade volume, granted products are similar and we have large set of data to fit models.
The covariates most indicative of trade volume are time to maturity and instrument name
Penalized regression splines are a useful tool for tasks such as building trading algorithms, improving traders' effectiveness, and controlling trading risk.

What we Conclude Cont'd

Predicting price volatility from trade volume is viable in the majority of cases, as long as forecasting is made over a trade's active period since correlations observed are constant over time.
Theory behind speculator and hedger behavior since positive correlation between trade volume and price volatility, validating
Positive correlation between trade volume and price volatility is not maintained across all time increments.
- Positive correlation is mainly observed in measurements made daily, but for the same product, the correlation is not necessarily observed hourly.
In cases of product seasonality, clusters exist and negative correlations may be seen for trades maturing in six or more months in the future.

Acknowledgements

Thanks to my graduate advisor, Professor Daniel Naiman.

Thanks also to the Johns Hopkins University Acheson J. Duncan Fund for their support of this research.