Tuesday, July 24, 2018

Baseball Analytics - A Knowi Case Study

Knowi Baseball Stats Analytics

Baseball Stats and Analytics

How do Runners on Base Impact a Batter's Ability to get a Hit?

Often times what we perceive as logic contradicts reality. In psychology, this tendency of believing we know more than we actually do is known as overconfidence. Too frequently do people solely rely on their intuition and logic to make decisions rather than basing them on data and analysis. During an observational study, I discovered a contradiction between my own logical reasoning and what the data actually suggested in regards to baseball.

When considering the correlation between runners on base and batting average, you would think when runners are on base, batters will have a lower batting average than when the bases are empty.

Quick Summary of American Baseball

For those who are unfamiliar with baseball, there are three bases a runner can occupy: first, second,
and third. The goal of the offense is for the batter to hit the ball and get to first base before the defense can get them out. Once on first base, the batter becomes a runner and attempts to reach each base before running home to score. The goal of the defense is to get three outs by striking out the batter, catching the ball in the air, tagging a runner between bases, or beating a runner to the base with the ball. Batting average is a stat that measures a batter’s percentage of getting a hit by dividing a player’s number of Hits by At-bats. 

A Hit is defined as when a batter hits the ball into fair territory and then reaches first base without the defense making an error or getting a runner out. If the batter hits the ball but the defense gets a runner out but the batter is safe it is called a Fielder’s Choice and is not counted as a hit. At-bats differ from Plate Appearances, a Plate Appearance is any time a batter is up to bat, whereas an At-bat is only counted if the player gets a hit, an out, or a fielder’s choice.

The Question

Knowi Baseball Analytics Logically, it would make sense that if a runner is on base, the batter would have a lower chance of getting a hit because the defense could get the runner or the batter out. This would maximize the defense’s odds of making an out and preventing the batter from obtaining a hit. As a result, this would lead to lower batting averages when there is a runner on base, and higher batting averages when there is nobody on base.

The Data

I chose a sample size of 270 MLB players, the nine players from each team with the most plate appearances in 2017. Using Baseball Reference, I was able to collect stats on each player and created a database of 216 stats per player in Google Sheets, totaling over 62,000 data points. The sample of 270 players are only 28% of the players who had an At-bat in 2017, but they account for 74% of the season’s At-bats.

All my charts and calculations were exclusively derived from the data I collected, team statistics only use data from the 9 players sampled from that team in the calculation. I then accessed the data in Google Shets via an API using the Knowi REST-API integration in order to build queries and visualize my data. One of the advantages of using the Knowi API method to connect to my data was it allowed me to edit my spreadsheet to fix errors and add new statistics and the changes were immediately reflected in my visualizations.  Not having to upload my data every time I changed it was a huge time saver.

What the Data Shows

After collecting all of my data, I created a dashboard (click to play around with it) and visualizations using Knowi in order to make observations and analyze the data. The first visualization I created compared team batting averages when there are runners on base and when there are no runners on base. 

The results were surprising, only two teams: Philadelphia and Kansas City, had a higher team average when the bases were empty than when there were runners on base. Philadelphia was only 1 batting average point off. This means that only 6% of the teams fit my hypothesis of having a higher batting average when the bases are empty.

Next, I decided to look at a graph of the players’ averages. Only 37% of the sample of players had a higher batting average when the bases were empty compared to when there were runners on base. 

Again, the data did not fit my hypothesis. I also observed that most players had a large deviation between each of their averages. In other words, most players had either a really high average when runners were on base and a really low average when the bases were empty, and other players had the exact opposite result. 

I then created two new charts: one that only contained players who had higher averages when runners were on base which I called Group A, and one that only included players who had higher averages when the bases were empty, called Group B. By comparing the two groups to one another, I observed that players in Group A had greater deviations in their averages than players in Group B.

To model this, I created another chart where I calculated the difference between each player’s highest average and their lowest average. The top eleven players with the highest differential were all in Group A. The average deviation of Group A players was 41 batting average points and the average for Group B players was 30.

My Conclusion

It turns out that the results did not support my hypothesis whatsoever if anything it showed the opposite. Only 6% of the teams and 37% of players fit my hypothesis. Not only did more players have a higher average with runners on base, the players who did had a large deviation between their averages than players who did not. 

Overall, the League average for when runners are on base was .275 and when the bases were empty was .261. A difference of 14 points seems insignificant, but to put it into context, given the sample size of more than 123,000 At-bats, it results in a difference of 1,800 hits. I also tested the statistical significance of the data. Although the name can be misleading, Batting Average is actually a proportion rather than an average, so in order to conduct hypothesis testing, I would need to use the difference between two proportions formula. 

With the formula below, I came to the conclusion that with over 99% confidence, batting averages when runners on base are greater than averages when the bases are empty. Additionally, with 97% confidence, averages, when runners are on base, are higher by 10 points than when runners are not on base. 

The data clearly rejects my original hypothesis that batting averages would be higher when the bases are empty and goes as far as to provide solid evidence towards the exact opposite.

A Real-World Application

Without applying it to real life, data is useless. Using players’ situational batting averages when setting a team’s batting order can help increase the number of hits the team gets in a game, in turn leading to more runs. The leadoff hitter, the first batter in the lineup, is most likely to come up to bat with the bases empty, roughly 65% of the time. The second batter is the second most likely to come up to bat with the bases empty. By batting players with the highest average when the bases are empty in the 1 or 2 spot, those players will have a greater chance of getting a hit. However, only 47% of leadoff hitters and 24% of batters in the 2 spot have higher batting averages when the bases are empty. 

Additionally, the 4 and 3 spot in the lineup have the highest chance of coming to bat with a runner on base, so a team should put their best hitters with runners on base in those spots. A really good example of a team who bats their players in the best spots is the World Champion Houston Astros. Jose Altuve and Carlos Correa bat 3rd and 4th and when runners are on base have batting averages of .350 and .340 respectively. To put those averages into perspective let me remind you that the league overall batting average is .267. The Astros also bat either Alex Bregman or Josh Reddick 2nd, the players with the two highest averages when the bases are empty. Now compare that to a really bad example, the last place Detroit Tigers. The Tigers 4th hitter is Victor Martinez, being the 4th hitter Martinez is the most likely player on the team to come up to bat with runners on base. However, Martinez is the worst on the team with runners on base, and the second best player on the team when the bases are empty. The Tigers best hitter with runners on base is Nicholas Castellanos with an average of .341. Despite his high average with runners on base, Castellanos only has an average of .223 when the bases are empty. Castelllanos’ average drops by a whole 118 batting points when the bases are empty, yet he bats 2nd and is second most likely on the team to come to bat with the bases empty. Being able to utilize data in a way to make data-driven decisions is the most important reason to use data analytics.

Through this observational study and visualizing my data using Knowi, I was able to reshape the way I think about baseball and use real data to disprove my own logical reasoning. All it takes to bridge the gap between what you know and what you think you know is investigation, observation, and analysis. It can be done with baseball and it can be done with anything. How can data analysis transform your perception and more importantly your decision making?

Sports Reference LLC. Baseball-Reference.com - Major League  

Statistics and Information. https://www.baseball-reference.com/. (5/21/18)


Post a Comment

Please share your thoughts. If you have a question, please reference our docs section at https://docs.knowi.com.