Aug. 20 2018 — There have been major advances in the application of Machine Learning (ML) in the recent past due to a plethora of industry drivers that have revolutionized the utilization of these techniques in the risk management sphere, and beyond. In this primer we will cover the key transformational drivers causing these high adoption rates, some of the techniques, and how to assess their utility within credit risk.
Firstly, data in general has experienced a large expansion in several dimensions; size, velocity and variety. Simultaneously the abilities to record, store, combine and then process large datasets from many disparate sources has experienced wholesale improvements. This is not limited to just traditional sources, but also alternative data which fueled the need to extract information value from these sources. However, the side effect of this data expansion is an elevated level of data pollution that needs to be contended with. Data pollution includes noisy, conflicting and difficult to link datasets.
Secondly, the ease of access to enhanced computational efficiency through hardware that can run specialized operations in large scale, and also in coding language enhancements which have moved towards functional programming, have transformed the game in terms of integrating Machine Learning techniques. Languages, such as R, become the hub for numerical computing using functional programming. They leverage a lengthy history of providing numerical interfaces to computing libraries. Supervised and unsupervised algorithms allow data scientists to process these datasets into actionable insights with relative ease and to code with cheaply executable hardware.
Thirdly, reproducible research and analysis has been widely adopted by the data science community. This is defined as a set of principles about how to do quantitative and data science driven analysis, where the data and code that leads to a decision or conclusion should be able to be replicated in an efficient and clear way.
Finally, the pervasiveness of Open Source libraries, packages and toolkits has opened doors for the community to contribute via teams of specialists, sharing code base and packaging them into easy and modular functions.
ML Techniques in Risk and considerations in their application
The typical phases of applying ML within a Risk context include the following pipeline:
Fig 1: Generalized Machine Learning Pipeline
Assessing which ML techniques to use and when is an important step that needs to be done thoughtfully with the target context in mind. There is no prescriptive method that is purely tied to a particular class of algorithms; the risk context always needs to be kept in mind in order to assess the tradeoffs.
A simple example to consider is the variance-bias tradeoff. Variance reflects the instability of the model to various factors. For example, if small changes to the data result in big changes to the model, then the technique has a high variance. Bias is the ability of the model to show fidelity to the underlying pattern. See Fig 2 for a simple example of this.
Fig 2: Demonstration of model fit comparison visualization
In the above figure we see that Random Forest exhibits low bias, but high variance to the dataset. Quadratic Regression exhibits low variance, but high bias to this data set. Nonlinear regression, in this trivial example with ex-ante known data generating process, seems to achieve low bias and low variance and provide appropriate fit. In the real world, however, finding a sweet spot between over-fitting and under-fitting is less trivial and requires appropriate definition of model selection criteria and exploration of different levels of model complexities. The key takeaway here is that none of these techniques are categorically wrong; it really depends on what tradeoffs we have to make to achieve as close to low bias and low variance as is possible. We need the model to adapt as the real-world adapts and ideally contend with polluted information with minimal supervision, while being as transparent as possible. These are all competing objectives and need to be accounted for within the applied risk domain.
Within a risk scoring context a simple example of being able to communicate to the business the supervision and complexity tradeoff is shown below.
Fig 3: Supervision and Complexity Trade-offs
Here we see that given the characteristics of the dataset, there is a trade-off between coupling the model with the data and the level of transparency of the ultimate model.
Another application of ML in credit risk is within sentiment analysis. A generalized sentiment analysis pipeline is provided below:
Fig 4: Generalized Sentiment Analysis Pipeline
Sentiment analysis methods can generally be split into either deterministic models that rely on a dictionary (bag of words) or neural network models that typically engage a deep learning exercise. The sentiment analysis can be further divided into ‘classification’ and ‘attribution’ where in each case given a target variable, a sentiment polarity label is assigned to a particular article (in the classification case) or attributes segmented within articles which are actually relevant and would impact the target variable.
Fig 5: Usage in Sentiment Analysis
Once again we see the considerable tradeoffs between supervision and complexity. Dependent on the risk context any of these techniques would be applicable.
We have covered the key drivers of the adoption of ML within a credit risk context and showed a few simple examples of the uses. It is important to consider the tradeoffs which are largely dependent on the actual final application. ML functions are a complementary class of techniques but they are not a panacea for every use case within credit risk. Ultimately, being able to communicate their value to the business audience and why they are being used in this context is of critical importance.
Senior Director – Innovation & Product Research
S&P Global Market Intelligence
Head of Relationship Management, Americas
S&P Global Market Intelligence
All figures are for illustrative purposes only. Source: S&P Global Market Intelligence as of July 2018. Content including credit-related and other analyses are statements of opinion as of the date they are expressed and are not statements of fact, investment recommendations or investment advice. S&P Global Market Intelligence and its affiliates assume no obligation to update the content following publication in any form or format.
The authors would like to express their thanks to Max Kuhn and Jonathan Regenstein from R Studio who provided their expertise and input into the article contents. R Studio is not affiliated with S&P Global or its divisions.