Ai, Machine Learning and Big Data in Financial Services | Chapter 4 | Part 1

Emerging risks from the use of AI/ML/Big Data and possible risk mitigation tools (Chapter 4 | Part 1)

The use of AI/ML technology is expanding in the financial markets, and as a result, various challenges and risks associated with this practice are being revealed. These challenges are visible at different levels, including data, model, firm, social, and systemic. Industry participants, users, and policymakers must consider and address these challenges appropriately. 

In this section, we will take a closer look at some of the challenges that arise when using AI-driven techniques in finance. As AI grows in the financial sector, it is essential to consider these challenges and potential mitigants. The challenges we will discuss include data management and concentration, the risk of bias and discrimination, explainability, the robustness and resilience of AI models, governance and accountability in AI systems, regulatory considerations, employment risks, and skills.  

3.1. Data management 

Data is the core of any AI application. AI, ML model deployment and big data offer cost-efficient opportunities for higher-quality service/product delivery and customer satisfaction. 

This section delves into the potential risks of using big data in AI-powered financial products and services. These risks include challenges related to data quality, data privacy and confidentiality, cyber security, and fairness considerations. There is a risk of unintended bias and discrimination when data is misused or unsuitable data is used by the model, such as in credit underwriting. The importance of data in training, testing, and validating ML models and the models’ capacity to retain their predictive powers in tail event situations are also discussed. In addition to financial consumer protection considerations, potential competition issues may arise from using big data and ML models, particularly about high concentration among market providers. It is important to note that the challenges related to data use and management discussed below are not unique to big data or alternative data but rather apply to data more broadly.

3.1.1. Representativeness and relevance of data  

One of the industry-defined ‘Vs’ of big data is veracity, which refers to the level of uncertainty regarding the truthfulness of the data (IBM, 2020). This uncertainty may arise due to unreliable sources, poor quality, or inadequate nature of the data used. In the case of big data, certain behaviours, such as those found in social networks and noisy or biased data collection systems like sensors and IoT, can affect the integrity of observations. This, in turn, may lead to disparate impact dynamics, which may prove difficult to prevent or mitigate. 

When it comes to data used in AI applications, data representativeness and relevance are more important than data integrity. Data representativeness means that the data used should provide a complete and balanced representation of the studied population, including all relevant subpopulations. This is particularly important in financial markets, where it can prevent over- or under-representing certain groups and ensure that models are accurately trained. In credit scoring, it can also help to promote financial inclusion for minority groups. Data relevance, conversely, is about ensuring that the data used helps describe the phenomenon being studied and does not include any irrelevant or misleading information. For example, in credit scoring, it’s essential to carefully assess whether information about a person’s behaviour or a company’s reputation is relevant before including it in the model. While it can be challenging to determine datasets on a case-by-case basis, doing so can help to improve the accuracy and appropriateness of the data used, even if it may reduce the efficiencies of AI deployment due to the sheer volume of data involved.

3.1.2. Data privacy and confidentiality 

The use of data in AI systems is vast, widespread, and continuous, which raises concerns about data protection and privacy. In addition to concerns about collecting and using personal data, there are additional issues regarding AI. These include the ability of AI to make inferences from large datasets, possible issues applying privacy protections in machine learning models, and questions about data connectivity and cross-border data flow. Data connectivity is crucial for development in the financial sector, and the ability to aggregate, store, process, and transmit data across borders is essential. It is vital to have appropriate data governance safeguards and rules in place to ensure that privacy is protected (Hardoon, 2020).

Combining multiple datasets can offer new possibilities for aggregating data and create new analytical challenges. Databases collected under different conditions, such as other populations, regimes, or sampling methods, provide new opportunities for analysis that cannot be achieved through individual data sources. However, combining such diverse environments may result in potential analytical difficulties and pitfalls, including confounding, sampling selection, and cross-population biases (Barenboim and Pearla, 2016).

Cybersecurity risks such as hacking and other operational risks are prevalent in digital financial products and services. These risks have a direct impact on data privacy and confidentiality. While AI does not introduce new opportunities for cyber breaches, it can amplify existing ones by connecting fabricated data and cyber breaches, leading to recent attacks that can modify the algorithm’s operation by introducing falsified data into models or changing existing ones (ACPR, 2018).

Financial and non-financial data of consumers are being increasingly shared and used, often without their understanding and informed consent. While informed consent is crucial for any use of data, consumers may not be well-informed about how their data is being managed and where it is being used. This lack of knowledge increases the risks associated with the advanced modes of tracking and data sharing by third-party providers. The chances of possible violations of the privacy policy and data protection laws grow even more when observed data not provided by the customer, such as geolocation or credit card transaction data, is used. These datasets are particularly vulnerable to privacy breaches. 

Interestingly, the industry proposes new methods to ensure consumer privacy in non-disclosive computation. These approaches include using tailor-made synthetic datasets for machine learning modelling and utilizing Privacy Enhancing Technologies (PETs). PETs aim to maintain the overall properties and characteristics of the original data while ensuring that individual data samples remain undisclosed. Some examples of PETs are federated analysis, homomorphic encryption, differential privacy, and secure multi-party computation. Differential privacy, in particular, offers mathematical guarantees on the desired privacy level, allowing better accuracy compared to synthetic datasets. The benefit of these techniques is that models trained on synthetic data instead of actual data do not show a significant performance loss. However, data anonymization approaches do not provide rigorous privacy guarantees, especially considering inferences made by AI-based models.

Using big data by AI-powered models can expand the scope of sensitive data since these models can become highly skilled in identifying individual users. Facial recognition technology and other inferred data, such as customer profiles, can be used to identify users or infer other characteristics, like gender, when combined with other data. AI models can even re-identify anonymized databases by comparing them with publicly available databases and narrowing down matches to attribute sensitive information to individuals ultimately. Furthermore, the higher dimensionality in ML data sets, meaning the possibility to consider an unlimited number of variables compared to conventional statistical techniques, increases the likelihood of sensitive information being included in the analysis.

Regulators are increasingly focusing on data privacy and protection in response to the growing digitalization of the economy. For instance, the EU implemented the GDPR law to reinforce consumer protection across markets. The main objective is to rebalance the power relationship of corporations and individuals by shifting power back to the consumers. Ultimately, this will increase transparency and trust in how companies use consumer data. One of the Principles of the G20/OECD High-Level Principles on Financial Consumer Protection is the ‘Protection of Consumer Data and Privacy’ (OECD, 2011). Similarly, the Monetary Authority of Singapore promotes fairness, ethics, accountability, and transparency through its principles, which place the protection of individuals’ data at the heart of the use of AI in finance (MAS, 2019). 

From the financial industry’s perspective, one of the significant challenges in achieving better data governance is the perceived fragmentation of regulatory and supervisory responsibilities regarding data. There is confusion about which institutions should be held accountable for implementing best practices in data governance, particularly in data quality, definitions, standardization, architecture, deduplication, etc. This fragmentation is further complicated in cases of cross-border activities.  

How data is used in finance is changing rapidly, particularly in deploying machine learning (ML) models. As a result, a small number of companies that specialize in alternative datasets have emerged. They are taking advantage of the growing demand for datasets that can inform AI techniques. However, there is limited visibility and oversight over their activity at this stage. This raises concerns about the lawful purchase and use of datasets by financial service providers. Compliance costs associated with regulations aimed at protecting consumers may further redefine the economics of using big data for financial market providers. Consequently, their approach to using AI and big data may also change.

3.2. Data concentration and competition in AI-enabled financial services/products

Advancements in AI could create competitive advantages that harm efficient and competitive markets if consumers cannot make informed decisions due to high market provider concentrations. Using proprietary models and AI techniques may provide an edge against the competition, leading to limited participation by smaller financial service providers with insufficient resources for adoption. BigTech’s potential dominance in sourcing big data and unequal access to information could further limit smaller players’ capacity to compete in the AI-based products/services market. 

The concentration of power and reliance on a few prominent players poses a significant risk, especially considering the potential for network effects. This could lead to the emergence of new, systemically essential players. Big tech companies are a prime example of such risks, and operating outside the regulatory perimeter only adds to the challenges involved. The main driver behind this issue is the access and use of data by BigTech, which is further amplified through AI techniques to monetize such data. Additionally, a few alternative data providers are increasingly shaping the economics of database provision, which carries a risk of concentration developing in that market.

It can be more challenging for smaller firms regarding data-driven barriers to entry in the AI market. They may have to bear disproportionately high costs in deploying such technologies. They need expensive complementary assets such as advanced data-mining and ML software and physical infrastructures like data centres, which require a significant upfront investment.

Moreover, AI algorithms require access to diverse data collected from multiple sources, which involves economies of scope. This can make it hard for small firms that don’t have the necessary complementary assets or aren’t simultaneously present in multiple markets. As a result, they may face barriers to entry, which can prevent them from developing algorithms that can exert competitive pressure effectively. (OECD, 2016).

Healthy competition in the market for AI-based financial products and services is crucial for providers to fully unleash the benefits of this technology, especially in trading and investing. Using outsourced or third-party vendor models could potentially eliminate the advantages of such tools for firms that adopt them, leading to one-way markets and herding behaviour by financial consumers or convergence of trading and investment strategies by finance practitioners. Therefore, financial organizations must maintain their competitive edge by developing and implementing AI-based solutions rather than relying solely on external vendors.

3.2.1. risk of tacit collusions 

The use of AI-based models may pose a challenge to competition, as it can make tacit collusion easier without the need for a formal agreement or human interaction. In an implicit collusion scenario, each participant independently decides its profit-maximizing strategy, leading to a non-competitive outcome. This means that using algorithms facilitates market participants to sustain profits above the competitive level without agreeing, effectively replacing explicit collusion with tacit coordination.

Collusion among firms, known as tacit collusion, is often found in markets with few players and high transparency. However, using algorithms in digital markets with frequent interaction has made collaboration easier to sustain and more likely to be detected, according to evidence cited by the OECD in 2017. 

The capacity of self-learning and deep learning AI models to adapt dynamically can increase the risk of these models recognizing mutual interdependencies and adapting to the behaviour and actions of other market participants or AI models. This increases the possibility of reaching a collusive outcome without human intervention, sometimes even without awareness of such collusions (OECD, 2017). Although such collaborations may not necessarily be illegal from a competition law perspective, they raise questions about how enforcement action can be applied to the model and its users in such cases. 

3.3. Risk of bias and discrimination 

AI methods can help prevent discrimination based on human interactions or encourage biases, unfair treatment, and discrimination in financial services. By allowing the algorithm to make decisions instead of humans, the user of an AI-powered model can avoid biases associated with human judgment. However, AI applications may also risk prejudice or discrimination due to existing biases in the data, training models with such biased data, or identifying spurious correlations (as stated by the US Treasury in 2018). 

The accuracy and fairness of AI systems depend heavily on the quality of the data they are trained on. Using flawed or inadequate data can lead to inaccurate, biased and discriminatory decision-making. Poor-quality data has two primary ways of influencing decision-making. Firstly, ML models trained with insufficient data risk producing incorrect results even when fed good-quality data. Secondly, high-quality ML models can produce questionable output if provided with unsuitable data despite being trained on a well-structured algorithm. It is important to note that even well-intentioned ML models can have biased outcomes that discriminate against certain groups of people. The use of incorrect, inaccurate, incomplete or fraudulent data in ML models can lead to poor results, and ultimately, the quality of the data determines the quality of the model’s output. 

It is important to note that biases can be present in the data used as variables. Machine learning models train themselves on data from external sources that may have already incorporated certain tendencies. This perpetuates historical biases. Additionally, ML models’ biased or discriminatory decisions are not necessarily intentional and can occur with solid quality, well-labelled data. This happens through inference and proxies or because correlations between sensitive and non-sensitive variables may be difficult to detect in vast databases. As big data consists of enormous amounts of data reflecting society, AI-driven models may perpetuate biases that already exist in society and are reflected in such databases. 

Box 3.1. marking and structuring of data utilized in ML samples

Labelling and structuring data is crucial for accurately performing Machine Learning (ML) models. However, it can be a tiresome job. AI can only differentiate between the signal and noise if it successfully identifies and recognizes the password. To recognize patterns in data, models require well-labelled data. Supervised learning models – the most common form of AI – require pre-tagged examples classified consistently for software stacks to independently learn to identify the data category. 

Although observation data points can provide some insight, determining the correct data labels is not always straightforward. Data labelling is a meticulous process that requires the analysis of vast amounts of data. This task is outsourced to specialized firms or distributed to a remote workforce (The Economist, 2019). Human analysis and data labelling offer opportunities to identify errors and biases in the data used. However, some experts argue that this process may inadvertently introduce other preferences since it involves subjective decision-making. 

AI-powered solutions have emerged as data cleansing and labelling are prone to human error. Ensuring the quality of data and its level of Representativeness can help avoid unintended biases in the output. Thus, it is essential to consider these factors while developing AI-based solutions. 

Given the high dimensionality of data, it is essential for users of ML models to correctly identify the relevant features of the data that are pertinent to the scenarios being tested by the model. To improve the performance of ML models, various methods are being developed to reduce the prevalence of irrelevant features or ‘noise’ in datasets. One promising alternative is to employ artificial or ‘synthetic’ datasets explicitly generated for this purpose and testing and validation purposes (see Section 3.5). 

Humans play a critical role in the decision-making processes that AI informs. They can identify and correct biases in the data or the model design. However, explaining the model’s output and determining the feasibility of this process remains an open question, as per the US Treasury 2018 report. The human element is crucial at the data input stage and when querying the model. It is essential to approach the model results with scepticism to minimize the risk of biased output or decision-making. 

The design and audit of a machine learning model can significantly enhance its robustness by avoiding potential biases. Poorly designed and controlled AI/ML models risk reinforcing existing prejudices and making discrimination even more challenging to detect (Klein, 2020). To ensure fairness and accuracy in AI/ML models, auditing mechanisms and algorithms that sense check the results of the models against baseline datasets can be employed (see Section 3.4.1). Users and supervisors should be able to test scoring systems, and tests can be run to identify and rectify discrimination in ML models, such as whether protected classes can be inferred from other attributes in the data (Feldman et al., 2015). It is also important to assign accountability to the human parameter of the project and govern AI/ML models to safeguard prospective borrowers against any possible unfair biases. When measuring for potential biases, it is essential to avoid comparing ML-based decision-making to a hypothetical unbiased state but use realistic reference points, comparing such methods to traditional statistical models and human-based decision-making, which are imperfect or entirely fair.

Picture of Hoa

Leave a Comment