September 27, 2017

Quick Intro

  • Machine Learning Engineer

  • Masters in applied mathematics and statistics from Johns Hopkins University

  • Bachelors in pure mathematics from the University of Pennsylvania

  • Spent years working in Healthcare, Finance, and Telecommunications industries in roles spanning data analyst, software engineer, and data scientist

  • Talk is based on my graduate thesis

Problem Statement

How can we accurately predict trade volume?

  • many trading algorithms depend on volume
  • accurate volume predictions over a given interval allows traders to be more effective
  • volume prediction increases trading strategy capacity, controls trading risk

How can we understand the relationships between trade volume and price volatility?

  • new information on the market causes agents such as hedgers and speculators to trade until prices reach a revised equilibrium which then changes price and trading volume
  • provides information on the efficiency of futures markets which regulators can then use to decide upon market restrictions

About the Data

  • Hundreds of millions of rows of futures trading in the form of order books
  • Comes from the Chicago Mercantile Exchange (CME)
  • Collected from May 2, 2016 to November 18, 2016
  • Raw data from the CME included extended hours trading
  • Fields included instrument name, maturity, date, time stamp, price, and quantity
  • Futures were comprised of 22 financial instruments spanning six markets
    • foreign exchange, metal, energy, index, bond, and agriculture
  • Trades were recorded roughly every half second
  • Product defined as instrument/maturity pair
    • 149 products in total

Raw Data


Processing Data

How can we efficiently process the overwhelming amount of data such that it is amenable to analysis?

We use Apache Spark to process the data

  • What is Spark?
    • Spark is a fault-tolerant and general-purpose cluster computing system providing APIs in Java, Scala, Python, and R
  • One node cluster set up
    • Linux Server
    • 32 cores
  • Used Python API, i.e. Pyspark
  • Used SparkSQL for parsing and reformating, treating raw data as table
  • Data became manageable to visualize with pure python and analyze in R

Regression Data Set

Exploratory Data Analysis

Checking for Normality

Checking Normality at Market Level

Day Effects at Market Level