16 Oct, 2019

Avoiding Garbage in Machine Learning

“Garbage in, garbage out” – it’s a cliché in machine learning circles. Anyone who works with artificial intelligence (AI) knows that the quality of the data goes a long way toward determining the quality of the result. But “garbage” is a broad and expanding category in data science – poorly labeled or inaccurate data, data that reflects underlying human prejudices, incomplete data. To paraphrase Tolstoy, great datasets are all alike, but all garbage datasets are garbage in their own, unique and horrible ways.

People believe in machine learning. Israeli philosopher and historian Yuval Noah Harrari coined the term “dataism” to describe a blind faith in algorithms. This faith extends beyond machine learning’s ability to analyze data. Many people believe machine learning is auto-magically able to predict the future. This is inaccurate. Machine learning is excellent at identifying patterns in large well-labeled datasets. In certain cases, those patterns will continue to unfold in the future. In other cases, they won’t. However, the machine-learning researcher must assume responsibility for the perception that AI is predictive. Mitigating the effects of garbage data becomes a moral imperative when your algorithm is being used to make parole decisions or to invest billions in hard-earned pension dollars.

Faith in machine learning isn’t misplaced. The reason machine learning is considered more objective than other methods of data analysis is because it is demonstrably superior. Machine learning is able to produce models that can analyze larger, more complex datasets than conventional methods, and deliver more accurate results, all at scale. When used properly, machine learning can help companies to identify profitable opportunities and avoid risks. But even a technology so advanced can’t perform the alchemy of turning garbage into gold. Learning to identify “garbage” data is the first step in unlocking machine learning’s vast potential.

“Machine learning is not magic,” cautions Nick Cafferillo, chief data and technology officer at S&P Global. “It’s about augmenting our thought processes to help prove – or disprove – a hypothesis we have; a hypothesis that is founded on real business questions that we understand in the abstract.”

Garbage data is a concern when you’re working with machine learning because there are two opportunities for the garbage to mess up your results. First, if you train your machine learning model with garbage data, you have baked bad data into the underlying algorithm. Feed good data into a neural network trained on garbage and your results may be inaccurate. Alternately, you could train your neural network on great data and then run garbage data through the well-trained algorithm. Either way, your output will be questionable.

Content Type

Articles

Language

English

S&P Global Offerings

Market Intelligence

Ratings

Energy

S&P Dow Jones Indices

Featured Topics

Featured Products

S&P Capital IQ Pro

S&P Global Energy Core

S&P Global ESG Scores

Ratings360

SPICE: The Index Source for ESG Data

Events

Careers

Market Intelligence

Ratings

Energy

S&P Dow Jones Indices

S&P Capital IQ Pro

S&P Global Energy Core

S&P Global ESG Scores

Ratings360

SPICE: The Index Source for ESG Data

S&P Global Offerings

Market Intelligence

Ratings

Energy

S&P Dow Jones Indices

Featured Topics

Featured Products

S&P Capital IQ Pro

S&P Global Energy Core

S&P Global ESG Scores

Ratings360

SPICE: The Index Source for ESG Data

Events

Careers

Market Intelligence

Ratings

Energy

S&P Dow Jones Indices

S&P Capital IQ Pro

S&P Global Energy Core

S&P Global ESG Scores

Ratings360

SPICE: The Index Source for ESG Data

Featured Products

S&P Capital IQ Pro

S&P Global Energy Core

S&P Global ESG Scores

Ratings360

SPICE: The Index Source for ESG Data

Ratings & Benchmarks

Overview

Find a Rating

By Topic

S&P Capital IQ Pro

S&P Global Energy Core

S&P Global ESG Scores

Ratings360

SPICE: The Index Source for ESG Data

Overview

Find a Rating

Market Insights

Energy & Commodities

Capital Markets

Sustainability

Artificial Intelligence

Economy

Global Trade

Technology & Innovation

S&P Global Institute

Events

Featured S&P Global Events

Webinar Replays

Inland26

CERAWeek

World Petrochemical Conference

Energy & Commodities

Capital Markets