Partitioning, Rescaling And Scoring Of Data

Tian Khean Ng
Dec 11, 2024
3 min read

Before a quantitative model can be constructed there are 3 processes that need to be done.

PARTITIONING

In the Table above, we can see those 162 records of FXI has been partitioned in 81 for training, 49 for validation, and 32 for testing. This is in the proportion of 50:30:20, a ratio that is commonly used by quantitative modelers.

Partitioning the dataset into three distinct sets—training, validation, and testing—is essential for the following reasons:

Training Set: This is the portion of the data used to train the model. The model learns the patterns, features, and relationships from this data. The aim is for the model to generalize these patterns so that it can make accurate predictions.

Validation Set: After training, we need to tune the model's parameters (like weights or learning rates). The validation set is used to assess the model's performance and make adjustments. This helps prevent the model from simply memorizing the training data and encourages it to generalize better to new, unseen data.

Testing Set: This set is used to evaluate the final performance of the model. It is crucial that the testing data remains unseen by the model during training and validation phases. This set provides an unbiased evaluation of the model's ability to generalize to new data, ensuring that the performance metrics are accurate and reliable.

RESCALING

The variables in a model may be of different types and numerical values. For example, if we were to add Volume to our price variables (input=Open, High, Low, Output=Close), the column on Volume would have vastly different large values. Or, if we wish to add Assets Under Management (AUM) as an input it will be in millions of dollars. So, we need to rescale the data to a common numerator. For financial time series, there are two types of rescaling that can be used. (Our models use Standardization).

1. Standardization

Explanation: Standardization involves transforming data so that it has a mean of zero and a standard deviation of one. This is done by subtracting the mean and dividing by the standard deviation of the data.

Advantages:

Helps in handling features with different scales, making the data more comparable.
Useful for algorithms that assume normally distributed data (e.g., linear regression).
Reduces the impact of outliers by bringing all features to a similar scale.

Disadvantages:

Does not bind data within a fixed range, which might be required for certain algorithms.
Sensitive to outliers, extreme values can affect the mean and standard deviation, leading to a skewed transformation.

2. Normalization

Explanation: Normalization scales the data to a fixed range, typically [0, 1] or [-1, 1], by subtracting the minimum value and dividing by the range (maximum value minus minimum value).

Advantages:

Useful when the data has different scales and needs to be brought to a common scale for comparison.
Keeps the relative distribution of the data intact while constraining it within a fixed range.
Suitable for algorithms that require bounded input values, like neural networks and K-nearest neighbors.

Disadvantages:

Sensitive to outliers; a single extreme value can significantly affect the min-max range and skew the scaling.
May not be suitable for data that is not uniformly distributed, as it can distort the overall data structure.

SCORING

After a model has been Run, we need to score it so that we can see if its performance meets out expectations. Scoring is done on testing data for reasons stated above in Partitioning, and also to compare different types of models. There is no one fixed model to be applied. We may find example that while Boosting ensembles work best for FXI, Neural Nets have a better performance when used on SPY.

There are many metrics used for scoring (see Table above) and in our models we use MAD (Median Average Deviation) for the following reasons.

Simplicity and Intuitiveness: MAD is straightforward to understand and interpret. It provides a clear indication of the average error in predictions, making it easily communicable to stakeholders who may not have a technical background. Use Case: It is useful when you need a simple measure of error magnitude without involving complex calculations.
Not Sensitive to Outliers: Unlike squared-error metrics like Mean Squared Error (MSE), MAD is less sensitive to outliers because it does not square the errors. This makes MAD a more robust measure when the presence of outliers is expected or when outliers do not represent significant issues in the decision-making process. Use Case: MAD is suitable when the goal is to understand general error patterns rather than being disproportionately influenced by large errors.
Scale Independence: MAD provides an absolute measure of error, which is independent of the scale of the data. This makes it a versatile metric that can be used across different datasets and industries. Use Case: MAD is applicable for comparing different models or predictions across datasets with varying scales.

Partitioning, Rescaling And Scoring Of Data

Recent Posts

Comments

Quick Links:

Countries Stocks:

Let's Connect: