Ai, Machine Learning and Big Data in Financial Services | Chapter 4 | Part 2

Emerging risks from the use of AI/ML/Big Data and possible risk mitigation tools (Chapter 4 | Part 2)

3.4. Explainability 

Explainability is a significant challenge when it comes to Machine Learning models. It is difficult to understand why and how the model generates results, and this lack of transparency is often referred to as ‘explainability’. AI-based models are inherently complex, making understanding how they work even harder. Some market players intentionally conceal the mechanics of their AI models to protect their intellectual property, further reinforcing the lack of explainability. The gap in technical literacy among consumers also exacerbates this issue, as access to the underlying code is often insufficient to explain the mechanics of the model. Additionally, the complexity of AI models does not always align with the demands of human-scale reasoning or interpretation, further limiting our ability to understand these models (Burrell, 2016). 

I completely agree with you. It’s no secret that the lack of explainability of ML models is a significant concern for users and supervisors alike regarding AI applications. This is especially true for AI-powered approaches in finance, which are becoming increasingly opaque. Even if the underlying mathematical principles of such models can be explained, they still lack ‘explicit declarative knowledge’, which can further erode the trust of financial consumers and regulators/supervisors (Holzinger, 2018). Therefore, improving the explainability levels of AI applications is crucial to maintaining trust in critical financial services (FSB, 2017). From an internal control and governance perspective, it’s essential to ensure a minimum level of explainability so that a model committee can analyse the model brought to the committee and be comfortable with its deployment.

The lack of explainability of algorithms can cause issues with existing regulations that require interpreting the underlying logic and reporting such sense. Examples of rules that require algorithms to be understandable and explainable throughout their lifecycle include IOSCO and GDPR, which grant citizens the right to explanation for decisions made by algorithms and information on the logic involved in credit decisions or insurance pricing. In addition, the use of machine learning for calculating regulatory requirements like risk-weighted assets (RWA) for credit risk must also be explainable or at least subject to human oversight and judgement, as required by the Basel Framework for Calculation of RWA for credit risk – Use of models 36.33. 

The lack of transparency in ML-driven models utilised by financial market players may pose a significant risk on a macro level if not overseen correctly by micro-prudential supervisors. It becomes increasingly difficult for both parties to anticipate how these models could affect the market, thereby amplifying systemic risks related to pro-cyclicality. This is mainly because there is an elevated risk of herding and convergence of strategies among users of third-party provider models. Furthermore, without a good grasp of the underlying mechanics of a model, users have limited knowledge of how their models can impact market conditions and whether they contribute to market shocks. This situation makes it challenging for users to adjust their strategies in times of poor performance or moments of stress, leading to potential episodes of increased market volatility and bouts of illiquidity during periods of acute stress, exacerbating flash crash-like events. Additionally, without a solid understanding of the detailed mechanics behind a model, there are risks of market manipulation, such as spoofing and tacit collusions.

Financial market practitioners using AI-powered models are facing increased scrutiny regarding the explainability of their models. In response to this heightened attention, many market participants are working to improve the explainability of their models. This will allow them to understand better how the models behave in normal market conditions and in times of stress and to manage associated risks more effectively. However, incorporating explainability by design into AI mechanisms is challenging due to several factors. First, the audience may be unable to grasp the model’s logic. Second, some models, such as neural networks, are impossible to comprehend fully. Third, fully revealing the mechanism could mean giving away intellectual property. 

Explainability in AI raises a debate on whether it should be treated differently from the explainability required for other complex mathematical models in finance. There is a concern that AI applications are held to a higher standard and subjected to a more demanding explainability requirement than other technologies. This, in turn, might hinder innovation in the field. Instead, explainability analysis at the committee level should focus on the underlying risks the AI model might expose the firm to and whether these risks can be managed. The mathematical promise of the model should take a back seat in this analysis.

It’s essential to balance the model’s explainability with its accuracy and performance to ensure that financial services providers achieve the best results from their models. By providing some insight into the model’s workings and underlying logic, firms can prevent the model from being considered a “black box” and comply with regulatory requirements. This can also help build trust with consumers. It’s worth noting that some jurisdictions, such as Germany, do not accept completely opaque models that lack any degree of explainability.

It is important to note that there is no one-size-fits-all approach for explaining ML models, and the level of explainability required will depend on the context (Brainard Lael, 2020) (Hardoon, 2020). When assessing the interpretability of the model, it is necessary to consider who is asking the question and what the model is predicting. Moreover, ensuring the explainability of the model alone does not guarantee its reliability (Brainard Lael, 2020). Therefore, contextual alignment of explainability with the audience needs to be complemented with a shift of focus towards ‘explainability of the risk’, which means understanding the resulting risk exposure from using the model instead of the methodology underlying the model. The UK Information Commissioner’s Office has recently guided on using five contextual factors – domain, impact, data used, urgency, and audience (see Box 4.3) – to determine the type of explanation needed (UK Information Commissioner’s Office, 2020).

Guidance by the UK Information
Guidance by the UK Information

3.4.1. Auditability of AI algorithms and models 

The complexity of “black box” models presents challenges in the transparency and auditing of such models in many financial services use cases, such as lending. Auditing a machine learning model is impossible if one cannot break down the model’s outcome into its underlying drivers. The lack of explainability hinders the supervisor from following the process that led to the model outcome, reducing the possibility of auditing. Many laws or regulations in some jurisdictions have been created based on a basic expectation of auditability and transparency, which may not be easily achieved when AI-powered models are used. Audit trails can only be followed if there is evidence of the sequence of activities or processes, which is limited by the lack of interpretability of some AI models. Since decisions made by such models no longer follow a linear process, and the models themselves have limited interpretability, there is a need to improve the explainability of AI outcomes to ensure accountability and robust governance dynamics in AI-based systems. Research efforts are underway to improve the interpretability of AI-driven applications and make machine learning models more amenable to ex-ante and ex-post inspection in academia and the industry. 

Indeed, it’s great to see that there are ongoing research efforts to improve the interpretability of AI-driven applications and make machine learning models more easily inspected both before and after deployment. These efforts are being carried out by academic researchers, such as Vellido, MartínGuerrero, and Lisboa in 2012, as well as by the industry. 

3.4.2. Disclosure 

Transparency and responsible disclosure are essential to ensure that people understand AI-based outcomes and can challenge them, according to the OECD AI Principles. The opacity of algorithm-based systems can be addressed by providing clear information about the AI system’s capabilities and limitations, as argued by the European Commission. Separate disclosure about using AI systems in product delivery and interaction can inform consumers’ choices and help them make conscious decisions among competing products, such as robo-advisors.

It is still unclear what level of disclosure should be provided to investors and financial consumers regarding the use of AI. Market regulators suggest that transparency should vary depending on the investor type (retail vs. institutional) and the area of implementation (front vs. back office). Suitability requirements for the sale of investment products could help firms evaluate whether their clients understand how AI impacts the delivery of their services. 

Documentation of the operational details and design characteristics of models used in finance has been a requirement for financial firms even before the emergence of AI. Some regulators are now using documentation of the algorithm’s logic to ensure that the outcomes generated by the model can be explained, traced, and repeated (FSRA, 2019[85]). The EU is evaluating the possibility of introducing requirements around documentation disclosure of the programming and training methodologies, processes, and techniques used to create, test, and validate AI systems. This would also include documentation of the algorithm’s objectives, weight distribution, and parameters (European Commission, 2020). The US Public Policy Council of the Association for Computing Machinery (USACM) has proposed a set of principles aimed at promoting transparency and auditability in the use of algorithms. Specifically, the council suggests that models, data, algorithms, and decisions should be recorded to be audited in case of suspected harm (ACM US Public Policy Council, 2017). The Federal Reserve’s model risk management guidance also calls for detailed model development and validation documentation to help parties unfamiliar with a model understand its operation, limitations, and critical assumptions (Federal Reserve, 2011).

Financial service providers face challenges in documenting the model process of AI-enabled models used for supervisory purposes, regardless of their size. This is due to the difficulty in explaining how the model works. Some jurisdictions have proposed a two-pronged approach to AI model supervision: an analytical approach, which combines analysis of the source code and data with methods for documenting AI algorithms, predictive models, and datasets, and an empirical approach, which leverages practices providing explanations for an individual decision or for the overall algorithm’s behaviour. Additionally, two techniques can be used for testing an algorithm as a black box: challenger models and benchmarking datasets, both curated by the auditor. (Bank of England and FCA, 2020; ACPR, 2020).

It is not just about the explainability-related challenges that AI-based models face but also about setting numerous parameters that significantly affect model performance and results. These parameters may seem subjective and arbitrary, as they are often based on intuition and not validation, making them highly dependent on the model designer. While the transparency around the selection of parameters can address part of the issue, it is still difficult to explain how the model works with these parameters.

3.5. Robustness and resilience of AI models: training and testing performance  

AI systems play an essential role in our lives, and they must always function in a reliable, safe and secure manner. As per OECD’s guidelines, potential risks should be continuously evaluated and managed to ensure that AI systems operate effectively throughout their life cycles. Careful training of models and rigorous performance testing based on their intended use can help enhance AI systems’ robustness.

3.5.1. Training AI models, validating them and trying their interpretation

To capture higher-order interactions, models might require larger datasets for training. Higher-order effects are more complex to identify and need a more extensive data size for better understanding. The datasets used for training should be large enough to capture non-linear relationships and tail events in the data. However, it isn’t easy to achieve this in practice as tail events are rare, and the dataset may not be robust enough for optimal outcomes. In addition, using ever-larger datasets for training models could make the models static, reducing their performance and learning ability.

The financial system is at risk due to the industry’s inability to train models on datasets that include tail events. This weakens the reliability of such models during unpredictable crises and renders AI a tool that can only be used when market conditions are stable. It’s important to note that ML models risk over-fitting, which happens when a trained model performs exceptionally well on the samples used for training but poorly on new unknown samples, meaning the model does not generalise well. To mitigate this risk, modellers split the data into training and test/validation sets and use the training set to build the supervised model with multiple parameter settings. The test/validation set is then used to challenge the trained model, assess the accuracy of the model, and optimise its parameters. The validation set contains samples with known provenance, but these classifications are unknown to the model. Therefore, projections on the confirmation set allow the operator to assess model accuracy. Based on the mistakes in the validation set, the optimal model parameters set is determined using the one with the lowest validation error.

Scientists previously considered the measured performance of validation models as an unbiased estimator of the performance of such models. However, multiple current studies have shown that this assumption does not still hold. As highlighted by such reflections, having an additional blind test set of data not used during the model selection and validation process is necessary to estimate the model’s generalisation performance better. Such validation processes go beyond the simple backtesting of a model using historical data to examine ex-post its predictive capabilities and ensure that the model’s outcomes are reproducible. 

Synthetic datasets are increasingly being used as test sets for validation, offering a compelling alternative due to their ability to provide unlimited amounts of simulated data. This is particularly useful in cases where accurate data is scarce and expensive. Additionally, synthetic datasets can potentially improve the predictive power and enhance the robustness of machine learning models. In some instances, regulators require the evaluation of AI models in test scenarios set by supervisory authorities, such as in Germany, according to IOSCO’s 2020 report.

It is essential to monitor and validate models for effective risk management continuously (Federal Reserve, 2011[87]) (see Box 3.4). Once models have been trained, validation ensures they are used appropriately and performed as intended. This involves a series of processes and activities to confirm that the models are achieving their design objectives and meeting business needs while ensuring that the models are reliable. The process involves identifying potential limitations and assumptions and assessing their possible impact. All aspects of the model, including input, processing, and reporting, should be validated, whether the model was developed in-house or provided by a third party (Federal Reserve, 2011). Validation should be an ongoing process to track existing limitations and identify new ones, particularly during difficult economic or financial times, which may not have been reflected in the training set.

Continuous testing of machine learning models is crucial to detect and correct model drifts caused by concept or data drifts. Asfined by Widmer in 1996, the idea floats to situations where the statistical characteristics of the target variable studied by the model change, leading to a shift in the percept that the model aims to predict. For instance, the definition of fraud could evolve with new methods of conducting illegal activities, resulting in concept drift.

Data drifts can adversely affect the model’s predictive power as they occur when the statistical properties of the input data change. One such example is the shift in consumer attitudes towards e-commerce and digital banking, which wasn’t captured by the initial dataset on which the model was trained. As a result, this can lead to performance degradation.

Ongoing monitoring and validation of ML models can help prevent and address drifts, and standardised procedures can improve model resilience and identify if adjustments, redevelopment, or replacement are needed. To achieve this, it is crucial to have an effective architecture that enables models to be quickly retrained with new data as data distribution changes. This can mitigate the risks of model drifts and ensure the model remains accurate and effective. 

Box 3.3. Advice for model risk control in the US and EU that uses AI models 

The Federal Reserve issued a supervision and regulatory letter, SR 11-7, back in 2011. The letter guides model risk management, which has been technology-neutral and has stood the test of time. It is instrumental in managing risks related to AI-driven models (Federal Reserve, 2011). 

The letter focuses on guiding banking institutions regarding model development, implementation, and use. It delves into three main areas, namely (i) model development, implementation, and use; (ii) model validation and use; and (iii) governance, policies and controls.

Recently, the European Banking Authority (EBA) published guidelines on loan origination and monitoring, including rules for appropriately managing model risks. The EBA aims to ensure that these guidelines are future-proof and technology-neutral.

In addition to continuous monitoring and reviewing of the code/model used, some regulatory bodies have mandated the presence of ‘kill switches’ or other automatic control mechanisms that trigger alerts under high-risk circumstances. These kill switches are control mechanisms that can swiftly shut down an AI-based system if it deviates from its intended purpose. For example, in Canada, companies must have built-in ‘override’ functionalities that automatically disengage the system’s operation or allow the company to do so remotely if necessary (IIROC, 2012). It’s important to note that these kill switches must be tested and monitored to ensure companies can rely on them in times of need.

There is a growing need to reinforce risk management functions and processes related to models to account for emerging risks and unintended consequences associated with using AI-based models. One example is the need to test models in extreme market conditions to prevent systemic threats and vulnerabilities that may arise during stress. It is worth noting that the data used to train the models may not fully reflect stressed market conditions or changes in exposures, activities, or behaviours, which can limit the model’s performance and create other issues. Additionally, the use of these models is relatively new, and they remain untested in addressing risk under shifting financial conditions. To mitigate these risks, it is essential to use a range of scenarios for testing and backtesting that allow for consideration of shifts in market behaviour and other trends, hopefully reducing the potential for underestimating risk in such systems (FSB, 2017).

I find it interesting that research shows how important it is for explanations to be ‘human-meaningful’ to affect the users’ perception of a system’s accuracy, regardless of the actual accuracy observed (Nourani et al., 2020). This means that when less human-meaningful explanations are provided, the accuracy of the technique that does not operate on human-understandable rationale may not be accurately judged by the users. 

3.5.2. Correlation without causation and meaningless learning

The intersection of causal inference and machine learning is a fascinating field constantly evolving. According to Cloudera’s report from 2020, this area of research is rapidly expanding. One of the main challenges of machine learning is the lack of understanding of cause-and-effect relationships, a crucial element of human intelligence. However, researchers in deep learning are now acknowledging the importance of such questions and using them to inform their research, even though this type of research is still in its early stages.

I agree with what you have said. Users of ML models need to understand the difference between Correlation and causation to avoid interpreting meaningless correlations as causal relationships and getting questionable model outputs. Causal inference is crucial for understanding the conditions under which a model may fail and ensuring that the pattern remains predictive over time. Additionally, causal inference is essential for replicating empirical findings of a model in new environments, settings, or populations. The concept of transportability, or the ability to transfer causal effects learned in the test dataset to a new data set, is also fundamental for the usefulness and robustness of ML models. For supervisors, understanding the casual assumptions that AI model users make can help them better assess potential associated risks. 

I completely agree with the statement that the outputs of ML models need to be evaluated appropriately and that human judgment plays a fundamental role in achieving this goal, particularly when determining causation. It’s crucial to approach the results of AI-based models with a certain degree of scepticism or caution, as relying solely on correlations without establishing causation may lead to biased or false decision-making. Some studies indicate that models are likely to learn suboptimal policies if they don’t consider human advice, even if human decisions are less accurate than the models’ own (Zhang and Bareinboim, 2020).

3.5.3. AI and behind risk: the sample of the COVID-19 problem

Although AI models are adaptive, as they evolve by learning from new data, they may not be able to perform under idiosyncratic one-time events that have not been experienced before, such as the COVID-19 crisis, and which are therefore not reflected in the data used to train the model. As AI-managed trading systems are based on dynamic models trained on long time series, they are expected to be effective as long as the market environment has some consistency with the past. It has been found that during the pandemic, around 35% of banks experienced a negative impact on ML model performance based on a survey conducted in UK banks (Bholat, Gharbawi and Thew, 2020[101]). This is likely because the pandemic has created significant shifts in macroeconomic variables such as rising unemployment and mortgage forbearance, which required ML (as well as traditional) models to be recalibrated.

Tail events and unforeseen circumstances, like the recent pandemic, can cause disruptions in datasets, leading to model drift that compromises the models’ predictive capabilities (see Section 3.5.1). These events can trigger sudden changes in the behaviour of the target variable that the model is trying to forecast and previously unrecorded variations in the structure and underlying patterns of the dataset the model uses, all due to changes in market dynamics during such events. These changes are typically not captured by the original dataset on which the model was trained and can result in a decline in performance. To overcome this issue, synthetic datasets can be created to train models that incorporate tail events of similar nature, as well as data from the COVID-19 period, to retrain and redeploy redundant models.  

Ongoing testing of models with validation datasets that incorporate extreme scenarios and continuous monitoring for model drifts is crucial to minimising risks during stress. It’s worth noting that models based on reinforcement learning, where the model is trained in simulated conditions, are expected to perform better during one-off tail risk events. This is because they can be prepared based on scenarios of unexpected market conditions that may not have been seen before.

Picture of Hoa

Leave a Comment