The terminology would not be out of place in a science fiction novel: “Machine Learning” and “Artificial Intelligence”; “Deep Learning” and “Neural Nets”—Silicon Valley capitalists are pouring vast amounts of money into technologies going by these names. Universities are outpacing each other to create new departments and programs for “Data Science.” College students are rushing into any course with “Machine Learning” in its title. Corporations, political parties, and intelligence agencies are anxious to maintain an edge in the use of these technologies.
Statisticians might be forgiven for thinking that this hype simply reflects the success of the marketing speak of Silicon Valley entrepreneurs vying for venture capital. All these fancy new terms are just describing something statisticians have been doing for at least two centuries. The vast majority of machine learning methods deal with what is now called “supervised learning,” and is traditionally called “prediction.” Statistical prediction methods were introduced to astronomy and the physical sciences in the early 1800s; somewhat later the social sciences followed suit.
Prediction works like this: Suppose you have a dataset with a bunch of variables, called covariates (or features, or predictors), and an outcome variable (or label) of interest. The goal of the game is to come up with a good guess (prediction) of what the outcome will be for new observations, based only on the covariates for those new observations. For instance, you might try to predict which party someone is going to vote for in an election, based on your knowledge of their demographics, address, income, etc., after having analyzed a dataset including both these variables and historical voting behavior for a number of people.
Our skeptical statisticians are only partially right, though, in thinking that there is nothing new under the sun. The problem of predicting an outcome given covariates is indeed old, and statisticians have had a deep understanding of this problem for a long time. Nonetheless, recent years have seen impressive new achievements for various prediction problems. This is true in the case of image recognition, where the covariates correspond to the pixels of an image file and the outcome might be the name of a person appearing in the image. This is also true for speech recognition, where the covariates correspond to the amplitudes in a sound file and the outcome is a written sentence. And this is true for the use of written text as data.
How do machine learning algorithms achieve their feats? Beyond the central ingredients of “big data” and computational power—both of which have advanced exponentially in recent years— most machine learning algorithms rely on two key ideas, regularization and tuning. And to go from prediction to causality, they furthermore rely on (randomized) experimentation. (For some technical lecture notes on the theory of machine learning, click here.)
Regularization – automated skepticism
Any association you observe between predictors and outcomes in a data set might be due to coincidence. This is a particularly important problem whenever you have many predictors to work with. With many predictors, the laws of probability dictate that in your data a fair number of predictors will be associated with the outcome by pure coincidence. If you were to assume that the same associations are going to hold for new data, you would get poor predictions. To avoid this problem, you regularize by discounting associations observed in the data to some extent, assuming for instance that the associations observed in the data are exaggerated.
Tuning – how skeptical to be
But how much regularization is optimal? Any machine learning algorithm has to try to answer this question in order to perform well. At this point tuning comes into play. In order to tune a prediction method, some “validation” data are initially put to the side. After training on the main data, the algorithm then tries to predict outcomes for these validation data, which it has not yet looked at. Tuning compares the predictions for the validation data to the actually observed outcomes. Based on the relationship between the two, the amount of regularization is chosen to get the best possible predictions for the validation data.
Prediction, causation, and multi-armed bandits
Being able to make accurate predictions is a good start, but by itself prediction does not solve anyone’s problem. Prediction only becomes useful when it translates into recommendations for actions—actions that serve whichever goals the algorithm’s designer might have, such as making you buy stuff or vote in the intended way. To figure out how to make you buy stuff, how to make you vote in the intended way, and so on, the machines need to go from correlation to causation, which requires another feature added to machine learning, namely experimentation—usually randomized experimentation.
Any time you surf the Internet, many experiments are being performed on you. The content and design of websites are varied randomly to figure out which versions make you most likely to engage in the intended way. As the algorithms learn more about you and people like you, they adapt their behavior to get what they want. Such algorithms, which adapt their behavior over time, go by the colorful name of multi-armed bandit algorithms, in reference to their one-armed cousins living in Vegas. The idea is that you are a slot machine, and the algorithm is the calculating gambler. The algorithm tries out different arms to see which of them best achieves its goals. Over time, it focuses more and more on pulling the winning arms.
What are machine learning prediction methods used for? Suppose for a moment we had no idea. An educated guess would be that new technologies—including machine learning—will be more or less consistently to advance the objectives of the powers that be: Profit maximizing corporations competitively optimizing their interactions with consumers and workers; political parties seeking to win elections; the penal system making incarceration decisions; intelligence agencies seeking to identify targets of their operations, and so on. Such an educated guess would not go amiss.
Maybe the single largest application of machine learning methods to date is the targeting of advertisements. It is quite mind-boggling to consider: the entire business model of several corporations that are worth billions of dollars is to collect data on users in order to better choose which ads to show them. These companies are paid for each click on an advertisement, so they focus their efforts on displaying ads that have the highest probability of being clicked by each user.
A related application, likely to expand a lot over the next few years, is differentiated pricing. Airlines already do it on a large scale, and many others will follow suit. The idea is to predict how sensitive a given individual is to prices in their purchasing decisions. Profit maximizing companies want to charge higher prices to less price-sensitive customers, and lower prices to more price-sensitive customers. What will happen as more data on individuals becomes available to companies, and prices become more differentiated? Standard economics predicts that corporate profits will go up, and that some individuals will face higher prices than before, while others will face lower prices. As we become more perfectly predictable, each of us will be charged the maximum amount that we are able and willing to pay for the products we purchase.
A third corporate use of machine learning, somewhat more limited to date, is its use in hiring and promotion decisions. Job candidate CVs and past employee performance might be automatically processed to predict future performance, and to hire or fire accordingly.
Lastly, banks and insurance providers have long relied on credit scores and risk assessments to set interest and insurance rates. These scores, used in innumerable aspects of American society, are effectively predictions of default risk, and they are becoming increasingly refined with new data and prediction methods.
Not only corporations leverage the new predictive capabilities; so do political parties. One of the key innovations that allowed the Obama campaign to win the White House in 2008 was that it was the first political campaign to systematically target its resources and efforts using statistical prediction methods. To maximize the chance of winning, this approach predicts which states are most likely to be pivotal, which voters in these states are most likely to switch their voting decision in response to which campaign messages, and which sympathetic individuals might be persuaded to vote rather than stay at home. The end result is that huge databases on voters are used to deliver individually tailored messages to a subset of potentially pivotal voters. Since the first Obama campaign, this approach has become widespread—used by both parties in the US, including in lower-profile races, and in other electoral democracies around the world.
Law enforcement and militarism
Predictive methods are also used in various ways by agencies of the state. Bail setting in US courts is based on an assessment by the judge of whether the defendant might commit a crime while awaiting trial. Various private companies now sell tools that provide such predictions to judges, using a host of information about defendants. These predictions, which have caused the largest controversies effectively determine whether defendants are incarcerated or free.
Beyond the US border, intelligence agencies are involved in the most far-reaching mining of data, to identify suspects of terrorism. Among other things, the resulting predictions are used by the army and intelligence services to carry out targeted killings with drones—which have seen a massive expansion over the last decade.
There is a thread that ties all these widely disparate applications of machine learning together: the use of data on individuals to treat different individuals differently. In part two, we will look at some of the effects such differentiated treatment.