Data quality issues are not always easy to differentiate from outliers that are attributable to a specific business reason. That’s why it’s important to have solid background knowledge of the business before attempting any data quality management exercise. In our experience, data quality issues tend to fall into the following categories:
- Missing data, i.e., data gaps with random, completely random or non-random causes
- Global duplicates, i.e., double entries for what should be one distinct entry
- Local (field-related) quasi-duplicates, i.e., unintentionally near-identical input in free text fields
- Local (field-related) outliers, i.e., values which deviate significantly from what we can reasonably expect
- Global outliers/anomalies, i.e., atypical data points other than those with an underlying business cause
Artificial intelligence (AI) promises exciting potential not only in mining data to deliver valuable insights but also in addressing the quality of data. One subset of AI is machine learning (ML), an algorithmic system that can recognize patterns and learn without the need for explicit programming. Machine learning is a useful way to address data quality problems and support active data quality management at every stage of the data lifecycle.
At the creation phase, for example, data quality machine learning (DQ-ML) methods may be applied to onboarding when the basic data (i.e., client name, client type, client gender, address, country, client specific features) is collected and entered into the financial institution’s IT system. If the missing values corresponding to required information should normally not be allowed by the process, machine learning methods can be applied to detect inconsistencies in the client profile (e.g., between the provided addresses and countries), therefore improving the quality of data for any subsequent aspect of customer relationship management or flagging any suspicious/erroneous client data.
Later in the data lifecycle, DQ-ML methods, using various unsupervised and supervised techniques as well as natural language processing, may be developed in order to detect inconsistencies, duplicates and missing values in transaction data. This enables more accurate monitoring of suspicious transactions, for example, and can help financial institutions flag potential compliance issues in areas like money laundering, market manipulation or insider trading.
Another example is the area of credit risk models, where the discriminatory power of a model computing the probabilities of default can be improved by applying DQ-ML methods. By detecting (and potentially also remediating) data quality issues, organizations benefit from significantly improved model performance and can also, for example, calculate their regulatory capital more accurately.
AI enables computers to perceive, learn, reason. Today, the prerequisite of good data quality remains but the possibility of addressing quality issues directly at source using AI is increasing.