By now, political scientists have gotten pretty good at predicting the outcomes of national elections in the United States, especially presidential elections. The fact of the matter is that fundamentals-based modeling ahead of time and polls-based modeling as an election draws nearer result in very accurate predictions of how a given electoral season is going to shake out; Emory's Alan Abramowitz has a fundamentals based model that explains 94 percent of variation in presidential election results, for example. What's more, the fundamentals models tell us something important about what explains electoral outcomes: the fact that economic variables track results so closely is a strong indication that voter swings are caused in large part by economic fluctuations.
We haven't gotten nearly that far in predicting court cases. But three scholars — South Texas College of Law's Josh Blackman, Michigan State's Daniel Martin Katz, and Bommarito Consulting's Michael Bommarito — have built a model that comes close. As Blackman noted in a blog post announcing the model, it "correctly identifies 69.7% of the Supreme Court’s overall affirm and reverse decisions and correctly forecasts 70.9% of the votes of individual justices across 7,700 cases and more than 68,000 justice votes."
How it works
Blackman, Katz, and Bommarito built the model off data compiled for Washington University at St. Louis's Supreme Court Database, plus additional variables they generated not present in the database. It is meant to cover decisions from 1953 to the present. The latter point is key, because past attempts at building Supreme Court models have been more constrained time-wise. Blackman notes that the best previous model, one by researchers at Washington University and Harvard and published in the Columbia Law Review, was only effective for the eleven-year period from 1994 to 2005 in which the Court's composition was unchanged. "About three years after that study was done, we had O’Connor and Rehnquist leave the Court and we had Chief Justice Roberts and Justice Alito come on board," he says. "At that point their model wasn’t very useful anymore."
Blackman and his coauthors wanted to achieve something more generally applicable, and, indeed, the accuracy rate of the resulting model remains high for every year since 1953:
The model itself is exceptionally complicated. It uses a total of about 95 variables with very precise weights ("to four or five decimal places," Blackman says), and each justice's vote is predicted by creating about 4,000 randomized decision trees. Each step in the tree asks a question about the case — is it about employment law?; what is the lower court the case was appealed from? — and then funnels the answers into conclusions about the justice's ultimate vote.
"Have you seen some of those memes that go around where there’s a big flow chart where you're supposed to answer 'Yes,' or 'No,' to a bunch of questions and there’s a sassy response at the bottom?" Bommarito asks. The model is composed of thousands trees like that. The modal prediction out of the 4,000 trees is taken as the model's prediction.
What it tells us
The process seems extremely complicated, but Bommarito and Blackman note that you can still draw conclusions about the way the court behaves from it. For one thing, Bommarito notes that ideological variables seem to make a major difference, which seems to refute the naive view that the Court is somehow above politics. "If there were an argument ongoing between political scientists and lawyers as to what mattered, as to whether judges are really independent judicial reasoning machines on high, or whether they're just political animals like anyone else, then in terms of the features that the model uses to successfully predict, it appears they're just political animals," he concludes.
Blackman caveats that a bit. For one thing, a lot of the Court's decisions are uncontroversial 9-0 reversals of lower courts: a lower court got it wrong, every justice agrees about it, and they act together. And the model gets those right very often, and struggles with the one in three cases where the court ultimately affirms the lower court's ruling:
The model can only do that well if it brings non-ideological variables into play. "The set of 'case information variables' — which includes the lower court where the case the originated, the issue, who the petitioner and respondent are, etc — contributed 23% of predictive power," Blackman explains. "These were among the most predictive factors, and are factors that most people in the press don’t think about."
All the same, if anyone still labors under the misimpression that the Court's political views don't matter, the model should give them reason to reconsider. Bommarito puts it in statistical terms: "The null hypothesis for legal academia is that ideology doesn’t matter; we've rejected that hypothesis."
Where to go from here
To keep improving the model, the authors plan on pitting it against FantasySCOTUS, a case prediction league that Blackman founded. FantasySCOTUS's prediction accuracy has reached 75 percent in some years, and they expect humans to win their first bout against the model. But they kind of want that to happen. "We actually are looking forward to the humans winning, because what we really want is lots of data about how humans predict to contribute to our ensemble analysis," Blackman says. "Where did the humans perform better and where did the algorithms perform better?"
From there, it's on to lower courts. Supreme Court cases, while high-impact, are pretty few in number and are already widely predicted. The authors plan on using the model as private consultants, and the real growth market there is in predicting outcomes in district and appellate court cases. There isn't as strong of a database in that area yet, but Blackman and Bommarito are optimistic. "When you look at the average law firm, they're swimming in data that’s not very well collected and not very well structured," Bommarito says. Collating that data into practical models could make a real difference for lawyers plotting their court and negotiation strategies. It could also potentially help legislators get a sense of how vulnerable laws they pass could be to a legal challenge. And of course, academics studying lower courts could find an effective model valuable too.
Want to learn more about the model? Check out David Kravets' post at Ars Technica on it, Blackman's blog post, the article the authors wrote describing it, or the model's Github page. And don't forget to click the toggle above to read my full interview with Blackman and Bommarito.
First of all, I appreciate just how well documented your model is. I was wondering how many questions I was going to have to ask, just given how much detail you put into answering everything on the Github page and in your blog post.
I appreciate that. We've been working on this for a really long time to make it as foolproof as humanly possible, to try to anticipate all the possible objections that people might have.
Effectively, this upsets a lot of previous stuff on this topic and takes things to a different level. We were very deliberate in getting everything to be as strong as we could. Hopefully, we’ll see this get to publication in the very near future. And it’ll go through the peer review process and hopefully we'll be vetted and approved. We're pretty confident in our chances.
Just for readers who may not be familiar with the literature, what did past attempts to build models like this look like, and where does your approach differ from them?
There have been a couple major efforts at Supreme Court prediction. One of them was my own. In 2009, I launched FantasySCOTUS which is a crowd-sourced group Supreme Court prediction model. At the time we had 5,000 people, now it’s already over 20,000. So at the time we had about five thousand people making predictions.
I limited my analysis to what I called the "power predictors," those who predicted in a majority of the cases each term. In other words, I weeded out people who came and made one or two predictions and never played again. The "power predictors" accuracy percentage was somewhere around the mid-70s, which was pretty good.
A few years before that, a number of professors wrote an article in the Columbia Law Review, which we cite. They got a series of Supreme Court experts, people who argue before the court or were Supreme Court clerks—very well-respected people. They were able to get an accuracy rate of just about fifty-nine to sixty percent, which is interesting. It’s an instance of the wisdom of the crowds: FantasySCOTUS averaged together lots of people from different areas, and it beat the experts. That might be counterintuitive but makes lots sense if you follow the wisdom of the crowd literature.
The same article, the Columbia article, also did something pretty cool. They built a series of decision trees that were able to predict cases. They had a very unique circumstance. Justice Breyer was appointed to the Supreme Court in 1994. The court then remained exactly the same until Chief Justice Rehnquist died in 2005. The composition of the court was the same for over a decade. This was the second longest "natural court" after the Marshall Court, which was consistent from 1812-1823. They were able to use that consistent set of the same justices, the sustained cohort, to put together a very sophisticated decision tree of how the justices will decide cases. That decision tree hit seventy-five percent accuracy, which was, at the time, amazing. No one had come even that close.
That’s probably the best model I'm aware of and this was over ten years ago. They hit seventy-five percent accuracy. The problem with that decision tree was that it was based only on that one unique set of justices at that one particular time. About three years after that study was done, we had O’Connor and Rehnquist leave the Court and we had Chief Justice Roberts and Justice Alito come on board. At that point their model wasn’t very useful anymore. Since then we've had two new justices, we've have Kagan and Sotomayor and the model won’t predict much.
In terms of how you went about doing that, was your model based on data spanning that whole period, or did you try to fit it to certain time periods and then test the resulting model in other periods to make sure you weren’t over-fitting?
Basically, we start with the gold standard. This is something called the Supreme Court Database, based at Washington University in St. Louis. They've been curating data on Supreme Court cases for sixty years. This is NSF-funded, top of the line research. They code every single case along almost two hundred dimensions: who are the Justices, what are the issues, who’s the plaintiff, who’s the defendant, what did the lower court do, what was the lower court’s decision like? They do very rigorous political science coding of every case.
We started with that as our data set. Instead of making this backward-looking, we wanted to try to be forward looking. Say a case was decided on March 12th 1954. What information was known to the world on March 11th 1954, the day before it was decided (which was actually Antonin Scalia’s eighteenth birthday — sorry, I’m a nerd). What did we know on March 11th 1954? Based only on that data we started using a machine learning process called extremely randomized trees.
In the same way that other studies generate a decision tree, our algorithms basically spit out lots of trees, randomly designed. Each tree put different weights on different variables. Then we checked the trees. Some trees happen to work better than others. The weights of the variables are calculated to four or five decimal places. They use very precise weights. By creating enough trees, we were able to figure out which ones did best.
When we first started, our accuracy rate was maybe in the sixties, and then went to the mid-sixties. Then we hit around sixty-nine, which was our Maginot line for a while. Where we really took off was when we broke seventy. That’s when we felt confident enough to release this. Our goal was seventy. That was our mental block. We were able to beat that in the last month or so with our data. That’s when we ran with the paper.
This extremely randomized trees method is really sophisticated stuff. This is machine learning, where the computer trains itself to figure what is the best way of predicting cases, which variables work best in which trees, and then we're able to spit it out. All of our code is up on Github and we encourage other people to play with it and to read it, and hopefully to make it better.
We've already had forty people who have checked out, who have forked our source code and started a replication process. It’s been exciting to see it.
Even in just about seven days here, we've had people from all over the world, who've already pulled all of our coding data and are working on their own, either replication or trying to improve the model themselves.
Dylan MatthewsAre there other technical points that you think are interesting, that you wanted to get into?
The general practice here was to take, the workflow from something like quantitative or algorithmic trading, which is where most of my background is, and apply it to this space.
So, when you're out in the market you want to make sure that your model will work, not just for the task at hand but going forward. We call that kind of training a "walk-forward" approach to building models.
They key is that, as Josh was explaining, all of the information that you use to build that model is known at the point in time that it would be fair to know. Josh made the analogy to almost going back in a time-machine. You have to build these models as if you were in a time-machine. Even though you know the data, you don’t want to let yourself see the data in advance.
We had 60 variables. This was information that was coded per Justice and per case. Those are a large number but not necessarily enough to get the performance that we wanted.
So we did something called feature engineering or feature generation, which is a process of taking human knowledge about the domain and adding it into the model. A good example of this is if I'm looking at a particular Justice that has had five hundred decisions in their past, about five hundred record samples. I know whether they've voted in a liberal or conservative fashion based on the Supreme Court database’s coding. I might care not just about what their vote was in each case per se. I might also care about the average direction.
This is really intuitive, because it’s just the average ideologies of the Justice. That’s not in the Supreme Court Database (SCDB). That is a feature that we have constructed, generated, engineered, those are all the terms of art if you will, that we used in this. So, we've taken about 60 features and added 35 to them. Many of those you'll see on the appendix of the paper. The end result there is that we have not just the material available in SCDB, but also these features that we've added.
The next step is to take those features and build these trees. Are you familiar with decision or classification trees? Have you seen some of those memes that go around where there’s a big flow chart where you're supposed to answer "Yes," or "No," to a bunch of questions and there’s a sassy response at the bottom?
Decision trees or classification trees work in exactly that way, but they have a statistical basis for each of those branches or questions that they ask.
So a question you might ask is in this context, "is it a case about employment law?" If the answer is yes, then go left. If the answer is no, then go right. Now down in that left branch, you might have something like, "Is the Justice that you're trying to predict on average liberal or conservative with respect to a petitioner or respondent of this type?" Maybe you say if it’s an employment case, in which the government is the respondent, then on average we should see this Justice vote conservatively.
These trees are often sixteen to thirty-two branches deep. You're asking and answering somewhere between sixteen and thirty-two questions to get each prediction.
Now, a forest is exactly what it sounds like. It’s a combination of trees. Whereas the prototypical example like the sassy flowchart that I referenced earlier is a single decision tree or classification tree. Forests and the extremely randomized trees are many, many of these trees. That means that each of these classification or decision trees is like a person voting.
So we construct actually four thousand trees per model and all four thousand of those trees individually ask and answer sixteen to thirty questions and then they vote. Some of them say this Justice is going to vote conservatively, some of them say it’s going to vote liberally. We tally up all of the votes. We take the more likely vote based on the sum of them and that’s our prediction.
In understanding these trees, does the ordering matter? Is it somewhat similar to when you do a multiple regression, and there are bunch of different factors but it doesn’t really matter when do you factor each of them in? Or does the placement of different variables along different points up or down the tree make a difference?
No, it absolutely does, and that’s a really good insight. One of the differences between random forests, which is one of the more common industry terms that you may have heard of if you've seen much machine learning, is that extremely randomized trees take a slightly different approach to deciding the way in which those features are selected.
To come back to our previous example, you might get a different answer if you asked, "is this matter about employment law?" first or second. If you start out by asking "is it employment law?" and only then you ask is the respondent the government, you might get a different answer than if you switched the order.
The two mitigating or improving pieces of the method is that, one, there’s not a single decision tree. There’s four thousand of these trees. On average there will be some that starts with that question first and put that question second.
So in multiple regression you get one model, one set of coefficients. Then you take those coefficients and if you want to predict, you only run one equation. With the forest, or the extremely randomized tree set, you're seeing a much more granular or gray answer.
That makes sense. In interpreting the model you got out at the end, given the sheer quantity of trees and the sheer quantity of variables, it seems like it’d be tough to interpret it and see what it tells us about what matters to Justices. Have you done any analyses like that, and is there anything that has stood out as surprisingly influential or surprisingly non-influential?
If there were an argument ongoing between political scientists and lawyers as to what mattered, as to whether judges are really independent judicial reasoning machines on high, or whether they're just political animals like anyone else, then in terms of the features that the model uses to successfully predict, it appears they're just political animals.
I'll caveat that a little bit. Ideology was one of the more determining factors, but it wasn’t everything.
For a large majority of the cases, the set of "case information variables"—which includes the lower court where the case the originated, the issue, who the petitioner and respondent are, etc— contributed 23% of predictive power. These were among the most predictive factors, and are factors that most people in the press don’t think about.
While Mike is exactly right, that ideology is an important factor for a number of cases, a large majority of the cases are 9-0 reversals. On average, the court reverses almost seventy percent of the cases it takes each year, and a huge chunk of those are 9-0 reversals. So that’s it. We perform very well to 9-0 reversals; we have something like seventy-seven percent accuracy.
This is where it gets a little bit more complicated. The nature of the court since 1953 has changed dramatically. In the 40’s and 50’s, pre-Warren the balance of the affirms and reversals was very different. Back then it was almost a coin flip as to whether or not the court was going to reverse, which didn't make sense now. Why would the court take a case if it didn't want to reverse?
In the past, it was much different and we do have some statistics in the paper that show once upon a time it was much more of a coin flip than today where it’s. Today, you would do a much better job just assuming they're going to reverse, even if you knew literally nothing about it the matter at hand.
So it makes a difference as far as what decade you're talking about, what period of the court.
I do think the large number of 9-0 reversals, where the court unanimously agrees, indicates that often the ideology of the court might be as odds with the ideology of the lower courts. If all nine Justices, all along the political spectrum, unanimously agree that the lower court was wrong, that may also be a commentary on how the lower courts decide cases.
Mike has a degree in political science, I'm a law professor, so maybe we have slightly different perspectives on this. I think we both agree that ideology is a significant factor, but based on our variables, a lot of other factors that are quite influential on the outcome of the case.
This will probably annoy a lot of law professors who take more of a formalist approach. It does put ideology at a fairly high peg in our order of variables.
Yes that’s exactly fair. Here’s how I would try to refine what I said. The null hypothesis for legal academia is that ideology doesn’t matter; we've rejected that hypothesis.
I don’t think anyone actually holds that hypothesis, clearly, but we're able to actually quantify it. We can put some teeth into it.
That makes sense. I don't know if you've gotten this granular yet, but I'm curious whether you’ve identified some really seminal decisions that the model either got right or whiffed on. Were there ones that popped up when you were looking through the results that you found striking?
We actually have not done as much of that granular case-study work yet. That’s going to be a part of the rest of the research plan here. At this point there are still many thousands of cases ,and tens of thousands of individual votes that the model predicts. It does them in a really complicated way, that we have not gone through and for every model analyzed some of these seminal decisions.
Definitely, a part of translating this into the law review market, where we're going to have to go, is doing these case studies, because that’s honestly what we care about day to day. But we have not done that in this paper.
What I will add is that this year we're running a tournament. We're going to do a man versus machine competition. Let me explain that a little bit.
What we want to do is have the FantasySCOTUS league, roughly twenty thousand people, some who don’t play it all that much but some who are very engaged in the league, make their predictions. At the same time, our algorithm will generate predictions, and we're going to compete and see who wins. Now we fully expect the humans to win the first year. We actually are looking forward to the humans winning, because what we really want is lots of data about how humans predict to contribute to our ensemble analysis. Where did the humans perform better and where did the algorithms perform better?
With that knowledge we can actually tweak our variables and make our trees even smarter. This year we will be predicting all of the hot issues: gay marriage, Obamacare, national security, etc. We're going to have real-time predictions for every single case. That way, for every big case this year, we will have insight into what the Justices will do and all of our data will be open. We're making our prediction and we will be right or we will be wrong. I mean that’s the best way of testing how good the model is in the most controversial cases of the day.
Josh made a point there that I think would be great for technical readers and gives everybody warm, fuzzy feelings. Josh mentioned the word "ensemble," which is a technical term for models that are built out of multiple models underneath. The random-forest and extremely randomized trees methods are ensembles of decision trees or classification trees. When we talked about that, all of them together voting, that’s an ensemble.
The idea behind this tournament is to create a man plus machine ensemble where we're not just a bunch of these machine learning models and we're not just a bunch of humans, we're both machines and humans in this ensemble trying to come up with the best model possible through both this structured human intelligence and knowledge and the raw brute force of the computing models.
I should add, we're offering a significant grand prize. The grant prize, we haven’t announced this officially yet, but will be in the neighborhood of ten thousand dollars.
You might remember Netflix did a competition a couple of years ago to predict what videos people would like. We're doing a two-year tournament in the same vein, where this year we're going to have man versus machine competition, and next year we want to invite data scientists to take our code and improve upon it and try to beat us.
We're going to try and beat data scientists. We have talked to a couple of the legal tech startups in the Silicon Valley area who are interested in playing. We want this actually to become a collaborative effort to figure out the best way to predict outcomes. We hope that next year, the numbers go above seventy, whatever they will be, but they’ll be even better.
It’s like a Kaggle. We’re starting a little bit of a legal walled garden Kaggle, if you will.
Where do you see this being used outside of academia, if at all? Do you think this is at a point where it’s used litigation consultants?
Yes. To be honest this is a part of a larger thrust. One of the reasons we picked the Supreme Court is because studying it is the favorite pastime of legal academia, and Josh and Dan are legal academics. This is one of the best ways for them to speak and communicate and have their work impact their peers.
The other reason is the Supreme Court is, most of its decisions are already priced into markets. There are some decisions here and there that have a very large impact on specific stocks like the Aereo for example, or in biotech cases, intellectual property cases. Sometimes they do have real impact, but for the most part they're very public, very widely predicted. There’s not that much arbitrage, if you will, for any market participants.
We do have a consulting entity separately from any of our academic ventures, where we do commercialize and consult on this. In a large part, it is exactly what you said. It’s litigation consulting, it’s insurance consulting, it’s risk management for firms. When you look at the average law firm, they're swimming in data that’s not very well collected and not very well structured. It’s not big data by any stretch of the imagination, with he exception of a few insurers out there, and even the larger law firms have far less data than the smallest startup in big data.
There’s a huge sea of opportunity for all of the buzzwords that have been used in the data world in the past five to ten years to be applied to legal.
Part of what we're doing is trying to bring the actual purchasers of these services into the fold. There’s no way you can talk to a managing partner at a law firm about big data if he doesn’t understand what it means for him. We want to start this conversation and raise public awareness in the legal community. There’s no way to improve the delivery of legal services if the purchasers of the services don’t fully understand what’s capable of being done.
Mike’s exactly right. Starting with the Supreme Court is useful because everyone knows about the Supreme Court. But it’s not particularly valuable. There are only eighty something cases to add to the court each year. It’s usually the same handful of law firms—the Supreme Court bar as its known—that litigate them.
Where real potential for growth exists is in the lower courts: the trial courts and the courts of appeal. Some of the work we're doing now will lay the foundation for offering analytics of how lower courts decide cases and there is significant potential there.
Imagine if you're litigating a case. You have a decision. You can either push it to trial or you can settle it, and this is a decision lawyers make thousands of times every day.
Invariably, they have to make decisions, "What’s the court going to do on some issue of law? How does the court resolve a particular issue?" That’s a variable. Perhaps a lawyer can make a prediction based on his own knowledge. Perhaps he can ask his friend, "How has this judge ruled in the past?" Or you can have an algorithm that analyzes every single opinion that judge has ever issued, breaks down particular issue you're working on, and generate the likelihood that a certain judge will rule a certain way.
Human beings can't do that effectively—it would take so many hours, it wouldn't be productive and probably wouldn't be accurate. But you could use natural language processing program, to break down many opinions, pick out certain relevant factors, we can figure out how a judge is likely to rule.
That’s not to say that you follow precisely what the computer says. That’s only to say that now you have one more piece of information, that when you're advising your client on how to litigate her stuff and say, "Well here’s how the judge has ruled in the past. We have some numbers on this," and this is something that clients will love. This is the type of data that can be drilled down to assist decision-making in the courtroom.
The data collection there is much more intensive, because I imagine there aren’t databases as rich as SCDB reaching down to lower courts.
That’s exactly right. That’s why we're starting with the Supreme Court Database, which is really highly curated and very well-maintained. Once you start going into the lower courts, it’s something of a mess but the data is there. If the market is there, then it will make sense.
Got it. Well this has been really great and like I said most of what I wanted to ask about. Are there aspects that any of you wanted to touch on before we wrapped up?
Two things. One, we want this to be emblematic of open science. We don’t want this to be locked up in walled gardens, we don’t want this to be incapable of reproduction, or very difficult to reproduce. We want this to really embody science in the open source, reproducible world.
Two, we don’t want this to be man versus machine forever. We want this to be about what really hard industries, what really difficult industries to crack, like law, can do when they work side by side or together with some of the other currents that are running around us in the modern world.
I think that the analogy that I used is assisted decision-making. We're trying to help lawyers make more informed decisions. There are certain things that machines do very well, and there are certain things that machines do terribly. There are certain things that humans do terribly, and there’s certain things that humans do very well. I think that going forward into the future the goal is to try and figure out what is it that humans do best? What is it that machines do best? And we can create an ensemble to collaborate these strengths to our comparative advantage.
That’s what we're trying to do with this competition. I've been running FantasySCOTUS for almost six years now. I know there are certain cases which my players are going to get better than the computer. I know that. I know there are also certain cases that my computer will get better than players. How can we find a bridge between those two? This would help to build a deeper understanding of the psychology of a lawyer’s work, how lawyers affect cases.
We're very excited about this website. We're very excited about the tournament. We'll be posting feeds throughout the entire year about every single case. We'll also be offering analytics, breaking down argument transcripts. There’s literature that says if a Justice asks one side more questions, that Justice is more likely to vote against a certain client. Well we can count the number of questions being asked. If you do that manually it’s very difficult, but with a machine learning technology that’s not so cumbersome.
So, there’s a lot of different things we can do that will lend itself well to gaining a deeper insight into how the justice works and how the courts work more broadly.