Of course, the vastness of such data makes it impossible for individuals to meet the challenge of analyzing and interpreting this data using conventional methods. This has caused an explosion in AI technology, and it is in the use of AI that the excitement about the potential for data lakes in the life sciences has really skyrocketed.
But what happens when a data lake becomes “polluted”? What happens when missing or incorrectly entered data skew AI models, or when malicious actors introduce malicious information into data?
It doesn’t matter if it’s a single string in a malicious data file or just harmless inconsistencies in the way data is entered – it leads to data lake pollution. It could completely change the way AI models information, costing companies millions of dollars and making their AI and machine learning unreliable.
The problem of reproducibility
To refine the problem of contaminated data lakes, let’s look at the example of clinical trials. Trials of emerging treatments in the pharmaceutical industry can be significantly accelerated by using larger data sets, and those data sets can be collected in data lakes from many different research organizations. According to at least one software manufacturer, this amount of data interpreted through AI modeling could reduce development costs and time to market by as much as 30%.
However, just imagine how serious clinical trials could be derailed because of small but significant data quality issues that have worsened over time. Reliability and reproducibility are essential for drawing reliable conclusions about data, and this has already proven to be a problem in life sciences due to incomplete data.
end of 2021, studies by cancer biologists from the Center for Open Science found that 59% of experiments in 23 studies failed to replicate, largely due to missing or unavailable data. Extend that problem across multiple data sources in a data lake, and it becomes clear that the industry may need to place less trust in AI modeling of data than the technology had initially promised.
This has prompted the software and IT industry to create a new category of technology known as ‘data sensing’. However, because this technology has emerged so quickly, most players in the field have a somewhat backward approach to the problem.
Technology can often reveal issues with the quality of the data lake once the data is collected, but this is akin to closing the stall door after the horse has already escaped. Fewer companies are able to gain in-depth understanding of aspects of data as it is introduced into the data lake, but that’s exactly where it’s important to draw conclusions – before AI modeling can be questioned.
Potential data problems for AI
There are several ways in which a data lake can become polluted. These underline the need for data observation from the very early stages of creating a data lake:
Data volume deficiencies† Data pipelines and batch Extract-Transform-Load (ETL) processes are typically consistent in the number of files produced on an hourly, daily, or even monthly basis. Monitoring the number of files across these windows is a fairly simple way to determine the status of your flows.
Knowing whether a certain number of files are missing over a measured period of time makes it easier to understand whether the full volume of expected data has been received. If fewer files are received than expected, it indicates a problem upstream in the data supply chain. On the other hand, getting significantly more data than expected can mean data has been duplicated.
Potentially corrupted data. Structured data schema should be checked for compliance. Receiving data where the schema is not as expected (such as extra columns due to formatting errors) can lead to significant operational problems.
Data quality monitoring and data observation should be able to detect quality flaws as early as possible in the data transformation process. Schedule within the data supply chain can change over time. If those changes are not noticed in advance, it can lead to problems with downstream production analysis, with subtle but fundamentally flawed analysis.
Incomplete data. Data can often be densely packed, with millions of rows and hundreds of columns—in other words, potentially billions of “data points” per file.
Such densely packed data can contain empty or null values, which can wreak havoc on machine learning models. Hundreds of such sparsely populated datasets, each containing billions of data points, will definitely lead to changes in training data models. This can drastically influence the analytical outcomes and predictions.
It is vital to be able to detect how many null or empty values can be expected in dense files. That requires users to establish a known baseline of such problem values as a means of monitoring performance. If a null value exceeds a certain percentage of the total data content, that should be a warning that a significant problem is occurring.
Duplicate data. By some estimates, companies without data quality initiatives can experience data duplications of 10% to as much as 30%.
Storing duplicate data can be expensive and can lead to biased results when used for analytics or machine learning training initiatives. Knowing if your data is duplicated can dramatically reduce operational costs.
Late data. While this is more common in areas other than life sciences, it’s important to understand if data is received later than expected. Late data can become a real problem if organizations have already performed transformations, aggregations and analytics without the late data.
For ongoing training of machine learning models, late data may require models to be retrained, which is obviously a time consuming process that should be avoided if possible.
Simple principles for observing data
So, what should be done to limit the potential for data lake contamination? How can data observation help?
The technology underlying data observability is vast and has a range of potential applications. To make it simple, there are two things that are of utmost importance:
Start before implementing the extraction process. As described earlier, ETL is the most fundamental process that occurs with data before it is included in a data lake. The process can become more complex when used to combine multiple data sets into one.
It’s important to check data for quality metrics (or embedded threats) before that data is extracted and loaded into the data lake. When combined with other data in the ETL process, it can lead to the introduction of upstream quality defects and malicious payloads. The more complex the ETL process, the more difficult it becomes to identify and maintain quality standards.
Don’t wait to monitor data quality after ETL processes have been run. At that point, the data lake is already polluted.
Be alert for missing data. As mentioned before, large data sets can contain significant amounts of missing or empty fields. When data, especially AI training data, starts to contain an unexpected number of missing values, it will almost certainly impact the performance and outcomes of the AI model.
Establish threshold benchmarks that identify missing values outside a specific minimum or maximum. By minimizing how much missing data inadvertently ends up in a data lake, it reduces the chance of errors when training accurate AI models.
AI holds the promise of unlocking discoveries in massive amounts of data that individuals could never manage alone. But if data lakes, inadvertently or maliciously, are contaminated, AI’s promise may not be fulfilled and the discovery process may be set back. Data observability is an important element in preventing data lake pollution and the first step towards a successful implementation of AI.
Dave Hirko is founder and director of Zectonal. He can be reached at [email protected]