Big Data

Data Quality: Why It Matters in Machine Learning?

By Liudmyla Taranenko

Posted on December 28, 2020

It’s likely that you’ve come into contact with some of the hype and excitement related to how machine learning and data science can enhance businesses. You may have heard buzzword-laden phrases talking about ‘actionable insights’ and ‘competitive edges’. It’s absolutely the case that machine learning and data science can improve your business, but only if you’ve prepared the data and know what you’re doing.

Data Sources

For a retail demand forecasting system, we can collect data over multiple years, meaning we’ve got ample data for our model. To begin, it’s necessary to analyze and eventually extract our data. After extracting it, we’ll need to determine the quality of the data, and remedy any issues that may exist.

We’ll then need to identify the data’s overarching structure. Finally, a machine learning engineer will put that data into the context of the intersection of machine learning algorithms and the business point of view.

It’s worth noting that in some cases, there’s simply no existing data available. Fortunately, there are still solutions, but they take a bit more planning and work. A machine learning engineer will need to evaluate the likely use cases and what the client wishes to accomplish.

From here, we can work toward a technical roadmap to reach those targets. Possible solutions would include purchasing an existing data set or using a pre-trained machine learning model. It’s even possible to create artificial data that fills the need.

As an example, a biometrics-based security system can potentially make use of a previously trained model using facial or voice recognition data sets. In that case, the model will need to tweak its accuracy as more specific data filters in.

Role of a machine learning engineer

It’s vital that any machine learning engineer must have an understanding of the needs of their client and the client’s customers. The primary role a ML engineer fulfills is understanding and solving the issues a client faces, and how to deliver value to the customers and end-users.

This means a business needs machine learning consulting to build a comprehensive roadmap detailing the precise way in which machine learning can fit into that business landscape.

When machine learning engineers begin to process data, it’s possible that they’ll need to consult with an expert in domains while labeling and categorizing the data. However, many machine learning projects are undertaken in the absence of any such domain experts. The result is that a project may experience problems as a result of faulty categorizing of the data, operator error, and/or mistaken assumptions about the form of the output the machine learning model delivers. The model can even produce incorrect values from the start, which compounds issues down the chain over time.

Machine learning engineers devote more than 75% of their efforts on processing the starting data, all before doing any training of the machine learning model. However, even that effort doesn’t preclude there being bias or error within the data. It’s frequently challenging to produce data of the desired quality and reach the requirements set as goals at the beginning. Here, we come to the concept of unsupervised machine learning.

Unsupervised Machine Learning

With supervised learning, the machine learning process uses examples of pairs of input/output in order to learn a function mapping an input to its corresponding output. The training examples, which consist of labeled data, allow the process to infer the function. The training data delineates and distinguishes the categories in data more accurately, making it extremely useful to the machine learning model. With the supervised learning model, the model’s performance can be measured from the start, and we know what an accurate output looks like.

The unsupervised machine learning model, in contrast, does not outline any data labels, and it’s not possible to directly measure how the algorithm performs.

With unsupervised machine learning, the goal is to discover the underlying structure of the data and to separate out the data into various categories. Often, the algorithm is designed based on overarching business goals, and if the process executes with no issues it will create a solution far more powerful and helpful than a machine learning solution using supervised learning.

Unlike with supervised learning, unsupervised machine learning algorithms have the ability to uncover patterns in the data that human users may not have even been familiar with. It’s also worth noting that an unsupervised learning approach can use future data to uncover brand new, unknown data patterns, which has the effect of improving the solution from the beginning.

When determining whether to use an unsupervised or supervised approach, the primary deciding factor should be your specific business case, not whether one approach is more popular than another.

Business Use Cases

In one example, a project made use of unsupervised machine learning algorithms in evaluating how virtual machines performed. It was difficult to evaluate the project at the onset, meaning the best approach was not to use any labels that might impose a subjective bias on the project.

The resulting project started with creating a cloud-based virtual machine. It was then benchmarked by running through a suite of tests to establish baseline performance, resulting in approximately 2,000 characteristics within the raw data. The next step was to pull out the most relevant and valuable benchmarks from the data and to compress the benchmarks into coefficients like one core, database, RAM, stability, and parallelization. Now, it was possible to calculate a custom coefficient determining an optimal balance between price and performance, allowing for the selection of the most suitable instance type.

This project was an instructive instance of the technique of dimensionality reduction. Rather than using all characteristics, it was preferable to collapse information into something representative of desired results.

A Final Summary:

In conclusion, unsupervised machine learning is capable of delivering more precise business insights by processing data for AI-based projects. However, there’s no one-size-fits-all pattern that can be applied to a given business case. Unsupervised machine learning is one tool for getting results, correct for some instances but not for others.