How Can Machine Learning Improve Data Quality and Data Integrity to produce Better Analytics

According to Yahoo the global data & business analytics market size was $198.08 Billion in 2020 and is projected to reach  $684.12 Billion by 2030. However, Gartner’s Data Quality Market Survey estimates that the financial impact caused by data quality issues resulted in organizations losing approximately $15 million. The reason is that monitoring data quality and ensuring data integrity is difficult. However, the good news is that when data integrity is high, Facebook can predict which posts will be most interesting to you, Uber can predict pick up and drop off locations, and law enforcement can better direct resources to prevent crimes. This white paper discusses how The Brite Group has successfully developed machine learning algorithms to streamline data clean-up, improve data integrity, which results in more effective use of Data.

Correcting and using large sets of data have plagued scientists for decades. In the 16th century, when the legendary astronomer Tycho Brahe began painstakingly recording and tracking the positions of numerous celestial objects, data collection had challenges. The types of errors introduced into these astronomical records—perhaps an early version of a Data Warehouse—affected the accuracy of future predictions. Centuries later, data and analytics teams continue to work on the same issue: identifying anomalies in data that affect the overall integrity and reduce the accuracy of the raw data and subsequent analyses and predictions.

Keeping data in good shape is increasingly problematic as companies automate their data collection workflows and combine data from a myriad sources. Traditionally, good data management required a host of quality assurance rules, processes, and people to monitor for—and mitigate the introduction of—questionable data into an organization’s ecosystem. Today, however, quality assurance rules and procedures are not be consistently followed and lack good implementation because there is a more pressing need to incorporate more data faster to take advantage of the Data.

One method to streamline data cleansing we have developed incorporates Machine Learning strategies that adapt to changing data and scale to accommodate large datasets. Much like human cognitive thought process, Machine Learning can also learn as it gains experience by being exposed to more data. Machine Learning has been around for decades but has become more important in the last decade after success with Facebook’s facial recognition program, Netflix’s movie recommendations, and Google’s speech.

Once we have fine-tuned, our Machine Learning processes, they can operate unsupervised and iteratively to improve new data that is added to the dataset. Since they remove the need for extensive manual labor and can be easily scaled up to accommodate any size dataset, such approaches are ideal for data management. A good machine learning approach can be cost-effective in identifying and repairing data quality issues while reducing the reliance on manual, error-prone processes.

Central Issues with Data Quality management
The main steps of data quality management are 1) identifying errors, anomalies, and inconsistencies in the source data and 2) fixing the issues identified in the first step. The first step, identification, can often be difficult and time-consuming. These difficulties are compounded by the decentralized data most commonly found in large business enterprises. In terms of identification, there are often no simple rules that can be consistently applied across all possible datasets.

A Data Warehouse may include information scraped from an application programming interface (API), manually entries from an external database, and single records saved as small segments of formatted text (i.e., CSV. files). Each of these types would have had a different input schema. However, data extraction algorithms can be used to restructure the input data to fit a standard template before storage. This usually requires the implementation of standards and error checks during the data entry and/or extraction step combined with automated processes that build off the known data schema.

Examples of Data Quality Problems
After working on automating data cleansing, we have discovered a few basic types of data quality issues that crop up repeatedly. These include missing data, duplicate data, and inconsistent data. All of these issues can be considered a type of anomaly. Thus, the first step for any data quality issue is to identify the root cause of anomalous data and specify what type of problem it is.

Ideally, missing data is represented by a value such as “NULL” or “NA” and is consistently labeled with the same indicator. However, in practice, missing data is often represented differently, such as entries like “Don’t know” or a blank field. Some of the simplest methods to identify missing data are to search for “NULL” or “NA” or to create an ever-growing list of possible missing data indicators.

In theory, duplicate data is another type that should be easy to detect but can prove much more difficult in practice. Theoretically, duplicate data should be identified by looking for repeated entries. Yet, the same data may initially look like distinct data. For example, if a database contains an entry for “Building #1” and “Bldg 1,” this could be a duplicate record but in two different formats.

Inconsistent data is perhaps the most difficult to identify, as data can be “inconsistent” in a number of ways. One example is when a single measurement, say temperature, is recorded using several different types of units or labels, i.e., temperature readings in Celsius in one set of records and Fahrenheit in others. These inconsistencies can be solved by transforming or applying formulas to the data, but that requires foreknowledge of the types of data entry.

Machine Learning Solutions

Luckily, modern machine learning techniques can be brought to bear on many data quality problems. Machine learning is often defined as automated analytical model building. An initial set of “training data” is used to identify the best-fit model or algorithm for that data. Then, that model is applied to “live” data, and with each new set of data, the model is refined and fine-tuned in order to increase accuracy. This approach requires building initial models of “clean” data for data management. The machine learning algorithm can then be iteratively improved with each additional run.

Machine learning algorithms are effective data quality management techniques because they flip the traditional paradigm of computer programming on its head. Instead of explicitly outlining how a piece of software should behave, machine learning algorithms allow a computer to teach itself which patterns to look for.

Currently, machine learning approaches have been developed for several aspects of data management, including anomaly detection, imputing missing data, and deduplication.

Data Management Case History: Anomaly Detection
It is particularly advantageous to use automated machine learning algorithms for data cleaning where the volume of data is so large as to make it impossible to clean the data manually. However, larger data volumes are advantageous, as machine learning programs improve when trained and tested with more examples. As for anomalies, typical examples include fraudulent or unanticipated behavior or data systems going down. Multivariate statistics or autoencoding neural networks are modern approaches for detecting potential threats to data quality and can be used to trigger alerts for intervention. These models predict whether data will be likely to fall outside of an acceptable threshold based on prior training data.

Proof of Concept
The Brite Group has experience identifying and repairing bad data. In 2020, a leading federal agency responsible for collecting survey responses made it easier for the public to respond to their questions anytime and anywhere by offering more response options and Internet-based submissions. The impact of bad data on this agency was that these changes could lead to an increase in fraudulent responses. The Brite Group assisted the agency in establishing a Survey Response Quality Assurance platform that can detect patterns of inconsistent behavior in the survey response and take appropriate actions in real-time.

This work includes custom-built statistical models and machine-learning algorithms to see fraudulent responses and other data quality issues. Core survey response data is stored in Hadoop and extracted to a SAS-based analytic platform via daily batches and low-latency jobs. At various points in this process, custom machine learning models, written in R and Python and trained on thousands of records, analyze the data to achieve fraudulent cases.]

We also help clients by creating systems to automatically impute and predict missing data. The Department of Justice (DOJ) maintains the National Incident-Based Reporting System (NIBRS), an incident-based reporting system for collecting and reporting data on crimes. Local, state, and federal agencies generate NIBRS data from their records management systems. The FBI compiles and distributes this data, including to the DOJ-BJS (Bureau of Justice Statistics). However, the NIBRS records have several missing data values. The Brite Group has assisted the Department of Justice’s Bureau of Justice Statistics by building machine learning solutions to represent multiple years’ worth of missing historical record values better. The solution utilizes custom-trained machine learning models maintained in Python and established on a cloud platform.

Summary
The benefits of Data Analytics are undeniable, and government agencies and business enterprises are desperately searching for methods to improve this process and achieve better results. The main stumbling block is that most companies have not overcome the main data quality challenges. Evidence of these problems comes from a study published in MIT Sloan Management Review, reporting that companies are losing around 15% to 25% of their revenues due to poor data quality. IBM estimatedthat the annual impact on the US economy is a staggering $3.1 trillion.

Working with several agencies, we have delivered proof of concepts that demonstrates that Machine Learning can clean data, streamline data input, and result in more accurate data and better analysis. These case histories prove the benefits of the Machine Learning processes. If you are interested to learn more about how Machine Learning can benefit your data, your processes, and your desired results, please email us at info@TheBriteGroup.com.

Visit us at www.thebritegroup.com to learn more about our capabilities.