by Michael Zammuto

You won’t clean all that data, so let AI clean it for you

Opinion
Dec 08, 2017
Artificial IntelligenceData QualityMachine Learning

By focusing machine learning on systematically getting smarter about how it analyzes, rates and utilizes data, we can not only reduce coding-hours but also worry less about imperfect data.

analyze data graphs laptop user man worker
Credit: Thinkstock

Don’t cheat and Google the answer to this question:

How many zeros are in a zettabyte?

A zettabyte is one sextillion bytes and one of them is a tenth of all data created each year. Picture the number 10 followed by 21 zeros. If that sounds like an unnecessary factoid, you would be forgiven.

The reason you should care is that market intelligence firm IDC says that by 2025 we will create 180 zettabytes of data annually. This massive acceleration of data creation means that all the enterprise efforts to make use of their data will continue to grow exponentially. It is time to give up on trying to organize and clean it all.

The big project before the big project

A data architect friend says that every enterprise’s data is great until you try to use it. Analytics, business intelligence, systems integration and collaboration projects tend to be envisioned and approved based on great business cases. But they also reveal the scary realities of enterprise data.

So, it is common for there to be a data cleaning project before the project that the business really wants done. These projects are often brute force, trial-and-error approaches to cleaning data and the level of uncertainty, lack of big data skills and the limitations of the tools meant these efforts tend to be high in expense and headache but often light in measurable business ROI.

Don’t clean your data for AI – let AI clean it for you

Unlike other data-centric initiatives, instead of pitching a project to corral and clean your data before launching your artificial intelligence initiatives, instead use machine learning to get you there faster and easier.

Traditional enterprise data strategies suffer from a central flaw of scalability. The more data you throw at them, the more that the inconsistencies in the original data collection, storage and manipulation come through. So, you must keep developing new rules for how to handle this. It’s the worst kind of project scope-creep because, too often, you realize that every time you find a new exception, you are getting farther from your goal, not closer to it. Essentially, you have trapped yourself into trying to spend money to clean data faster than your enterprise can generate data.

Leveraging the power of user-generated content and user-activity content

Most data will always be unstructured and the amount of user-generated data, or raw material, is increasing constantly. It’s amazing what has been accomplished in the past just by using SQL.

I have worked on some massive, internet-scale data platforms which included user-generated content from sixty million people a month, feeds of data from sensor devices, a search engine and social media platforms. Working on massive flows of data quickly teaches you that cleaning that data is like trying to mop up the ocean. Clean some up and more just rushes in faster than you can even react to it.

Investing in machine learning  

In recent years I turned to machine learning to address data challenges. You can automate data matching by using a learning model to predict matches. The more data that is submitted to the model for standardization the better it gets. So unlike traditional data management and cleaning strategies, machine learning algorithms do better with scale.

By using this approach, machine learning enabled us to accomplish much in a short time even though we have a small team. Some challenges need to be faced when implementing machine-learning. There needs to be an understanding of the process, including the different algorithms available and the kinds of problems to which they can be applied. But if it is implemented correctly, it can help to solve all kinds of problems and effectively propel a business forward.    

It is important to have a well-designed database but generating, integrating, identifying and aging data requires a great deal of code. Most large organizations have huge quantities of data, all of which can be used to understand the way their customers behave and offer insights leading to strategic decisions that can help them to grow. But analyzing and understanding this data is near to impossible.   

Traditionally, the impact of bad data and the effort to clean it up once going live was so painful that it encouraged an old-fashioned approach to development. Many agile-minded technologists seem to get increasingly ‘waterfall-oriented” when coming up with large scale data solutions. You cannot plan for every contingency or situation so naturally systems that learn are a better fit.

Focusing on machine learning offers a more agile approach to development than traditional data-driven solutions. Machine learning makes it possible to analyze the data, make predictions, learn and adjust based on the accuracy of the predictions. As more data is analyzed, so prediction improves.

When it comes to powering specific functions, machine learning can do most of the work for us. By focusing machine learning on systematically getting smarter about how it analyzes, rates and utilizes data, we can not only reduce coding-hours but also worry less about imperfect data. While there are some challenges to using machine-learning, the benefits to a business outweigh any disadvantages.