top of page

Does Volume And Length Of Data Impact Model Performance?

Writer's picture: Tian Khean NgTian Khean Ng


In our models (and in traditional Technical Analysis), the Input variables are Open, High, Low prices, and the Output is Close price. Another variable which is sometimes added is Volume which measures the trading liquidity of the ETF.  To construct models, we also need to specify the length of data we will use which are denoted as number of trading days. This article will examine if adding Volume as an Input variable has an impact on the performance of the data, and what is a suitable length of data to use in modeling.


The ETFs we will use for our study


We will choose two ETFs from the lists of 100 Best Performing ETFs for 52 weeks and 100 Worst Performing ETFs for 52 weeks as published by ETFDB.com. We will exclude leveraged and inverse ETFS as well as those with low liquidity (average daily trading volume). Using these criteria, our selected ETFs are:


Best Performance: BITO ProShares Bitcoin Strategy ETF (which is an Active ETF not tracking any market)

Worst Performance: UNG United States Natural Gas Fund LP



Mutual Information

To have an indication of whether Volume has significant impact on your model’s performance (the prediction of Close x days ahead), we can use a statistical measure called Mutual Information.


Mutual Information (MI) is a measure used in statistics to quantify the amount of information obtained about one random variable through observing another random variable. It is a key concept in information theory and is particularly useful in understanding the dependency between variables.


Quantification of Dependency: Mutual Information measures the reduction in uncertainty about one variable given the knowledge of another. It provides a numerical value indicating how much knowing one variable reduces uncertainty about the other.

Mutual Information is always non-negative. A value of zero indicates that the variables are independent (no information about one variable is gained from the other), while higher values indicate greater dependency.


In the two Mutual Information Tables below, we can see that Volume has less than half the amount of mutual information as compared to Open, High and Low. Thus, adding Volume as an input variable does not significantly reduce the uncertainty element in the model i.e. it will have no significant impact on the accuracy of the prediction of future values of Close.

 

 

BITO Mutual Information Table


UNG Mutual Information Table


What is a suitable length of data to use in your modeling?

In the modeling of financial instruments using their prices we intuitively know that it is no point using a data length of several hundred data points.  In modern financial markets investors have access to real-time information. The market reacts almost instantaneously to news, and information has a short shelf-life. Within a short period of time the information becomes outdated and irrelevant. On the other hand, we need a data length that is able to capture various trading ‘situations’ experienced by the ETF we are modeling. In modeling terms, it means we want to be able to capture more features that the data can learn from.


Most quantitative modelers do not use more than 200 data points for their models. For short term investing models 200 data points (200 trading days) is sufficient.  We can demonstrate this with the ACF (Autocorrelation Function) chart.


The ACF plot measures the correlation between the time series data and its lagged values. It tells you how the current value is related to its past values at different lags.

And of course, as the days progress, the impact of past values on the current value diminishes. Below are the ACF charts of UNG and BITO.

 

 

UNG ACF Chart



BITO ACF Chart



The BITO ACF chart shows that at the 24 lags (days) point, the autocorrelation drops below the UCI (Upper Confidence Interval) Red line which means the autocorrelation after 24 could be due to random noise.


The UNG ACF chart shows that at the 18 lags (days) point, the autocorrelation drops below the UCI (Upper Confidence Interval) Red line which means that the autocorrelation after 18 days could be due to random noise.


So, there is no point in having too much data for your models.

0 views0 comments

Recent Posts

See All

Comments


bottom of page