articles Corporate /en/research-insights/articles/avoiding-garbage-in-machine-learning-shell content
Log in to other products

Login to Market Intelligence Platform

 /


Looking for more?

Request a Demo

You're one step closer to unlocking our suite of comprehensive and robust tools.

Fill out the form so we can connect you to the right person.

If your company has a current subscription with S&P Global Market Intelligence, you can register as a new user for access to the platform(s) covered by your license at Market Intelligence platform or S&P Capital IQ.

  • First Name*
  • Last Name*
  • Business Email *
  • Phone *
  • Company Name *
  • City *
  • We generated a verification code for you

  • Enter verification Code here*

* Required

In This List

Avoiding Garbage in Machine Learning

S&P Global Platts

China petrochemicals sector braces for global recession as coronavirus leaves economies reeling

Consumer, autos sectors drive February’s fall in U.S. trade activity

S&P Dow Jones Indices

Distressed in Distressing Times

S&P Global

COVID-19 Daily Update: March 31, 2020


Avoiding Garbage in Machine Learning

“Garbage in, garbage out” – it’s a cliché in machine learning circles. Anyone who works with artificial intelligence (AI) knows that the quality of the data goes a long way toward determining the quality of the result. But “garbage” is a broad and expanding category in data science – poorly labeled or inaccurate data, data that reflects underlying human prejudices, incomplete data. To paraphrase Tolstoy, great datasets are all alike, but all garbage datasets are garbage in their own, unique and horrible ways.

People believe in machine learning. Israeli philosopher and historian Yuval Noah Harrari coined the term “dataism” to describe a blind faith in algorithms. This faith extends beyond machine learning’s ability to analyze data. Many people believe machine learning is auto-magically able to predict the future. This is inaccurate. Machine learning is excellent at identifying patterns in large well-labeled datasets. In certain cases, those patterns will continue to unfold in the future. In other cases, they won’t. However, the machine-learning researcher must assume responsibility for the perception that AI is predictive. Mitigating the effects of garbage data becomes a moral imperative when your algorithm is being used to make parole decisions or to invest billions in hard-earned pension dollars.

Faith in machine learning isn’t misplaced. The reason machine learning is considered more objective than other methods of data analysis is because it is demonstrably superior. Machine learning is able to produce models that can analyze larger, more complex datasets than conventional methods, and deliver more accurate results, all at scale. When used properly, machine learning can help companies to identify profitable opportunities and avoid risks. But even a technology so advanced can’t perform the alchemy of turning garbage into gold. Learning to identify “garbage” data is the first step in unlocking machine learning’s vast potential.

“Machine learning is not magic,” cautions Nick Cafferillo, chief data and technology officer at S&P Global. “It’s about augmenting our thought processes to help prove – or disprove – a hypothesis we have; a hypothesis that is founded on real business questions that we understand in the abstract.” 

Garbage data is a concern when you’re working with machine learning because there are two opportunities for the garbage to mess up your results. First, if you train your machine learning model with garbage data, you have baked bad data into the underlying algorithm. Feed good data into a neural network trained on garbage and your results may be inaccurate. Alternately, you could train your neural network on great data and then run garbage data through the well-trained algorithm. Either way, your output will be questionable.

View the Full Article
Read More