Author: Nate Reed
Date: August 18, 2016
The Lahman Baseball Database (available here: http://www.seanlahman.com/baseball-archive/statistics/) provides an extensive historical database of baseball statistics which I will use to answer the following interesting questions:
This report uses some informal techniques (eg. exploratory data analysis) and basic statistics to find answers to these questions.
import urllib.request as request
request.urlretrieve('http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip', "baseballdatabank-master_2016-03-02.zip")
from zipfile import ZipFile
zip = ZipFile('baseballdatabank-master_2016-03-02.zip')
zip.extractall()
import pandas as pd
batting_df = pd.read_csv("baseballdatabank-master\core\Batting.csv")
print("%d observations" % len(batting_df))
batting_df.head()
The data is indexed by playerID, yearID and teamID. Each entry contains the aggregate statistics for a player's stint on a team. A player can have multiple stints per year.
To understand batting, we will look at one of the common metrics for batting ability, the batting average, which is calculated as Hits / "At Bat's". I will select the stint and the team, in addition to hits and "at bats":
batting_df = batting_df[['playerID', 'yearID', 'AB', 'H', 'stint', 'teamID']]
batting_df.head(5)
batting_df['batting_average'] = batting_df['H'] / batting_df['AB']
batting_df.head(5)
We will look at the distribution of data to see if we notice any interesting anything interesting. First, we look at the distribution of batting averages, which we plot below in the density plot.
batting_df = batting_df.dropna() # Drop missing values
batting_df['batting_average'].dropna().describe()
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline
# Plot the distribution
batting_df['batting_average'].plot.kde()
plt.title("Batting Avg. Distribution")
plt.xlabel("Avg.")
plt.show()
This is an interesting distribution. It doesn't quite look normal, as quite a few batters have zero averages. Also, there are a few with unusually high averages (50% and 100%).
batting_df.groupby('yearID')['batting_average'].mean().plot()
plt.title("Batting Average Time Series")
plt.xlabel("Year")
plt.ylabel("Avg.")
plt.show()
There are some interesting changes in batting average over time. There was a spike leading up to the 1890's, followed by a steep decline, then a couple spikes leading up to a peak in the 1920's, followed by a long, slow decline into the 1960's. After that batting averages trended up into the 80's and to year 2000. We might want to explore why this happened using other data in our data set, but that is outside the scope of this initial investigation.
Pitching skill is measured by a few different stats, but one of the most common is Earned Run Average, or ERA. The ERA is calculated as the number of Earned Runs divided by the number of innings pitched. The definition of "Earned Run" is rather long. In short, it is a run for which the pitcher is held accountable (See http://www.baseball-almanac.com/rule10.shtml#anchor11198 for the full definition).
The variable "IPouts" is the number of innings pitched times 3. We can get the number of innings pitched simply by dividing this number by 3.
For batting, we refer to the batting average, although there are other statistics we could consider. For brevity, I just use the batting average we calculated above
import pandas as pd
pitching_df = pd.read_csv("baseballdatabank-master\core\Pitching.csv")
pitching_df['innings_pitched'] = pitching_df['IPouts'] / 3
pitching_df['ERA'] = pitching_df['ER'] / pitching_df['innings_pitched'] * 9
pitching_df = pitching_df[['playerID', 'yearID', 'stint', 'teamID', 'IPouts', 'ER', 'ERA', 'innings_pitched']]
pitching_df.head()
pitching_df = pitching_df.dropna() # Drop missing values
pitching_df.describe()
import numpy as np
pitching_df[np.isfinite(pitching_df['ERA'])]['ERA'].plot.kde()
plt.title('Distribution of Earned Run Average')
plt.xlabel('ERA')
plt.show()
In the above plot, there is a wide range of values, but most observations are between 0 and 5. A "good" ERA is below 4. Between 4 and 5 is OK, but not great. Over 5 is considered unsustainable, as a pitcher with this ERA will likely be replaced.
For this analysis, I am interested in all skill levels. Like the batters, some of these pitchers have exceptionally good or bad metrics, mostly due to a small number of games pitched. For the scatterplot below, I've included those outliers. In addition, I will re-load the batters and include those batters with few at bats. Skills are likely correlated with the number of attempts, as players who perform poorly will be given fewer opportunities.
players_df = pd.merge(pitching_df, batting_df, on="playerID", how="inner")
players_df = players_df[players_df['AB'] > 0] # Require at least 1 at bat
players_df = players_df[players_df['innings_pitched'] > 0] # Eliminate no innings pitched
players_df.describe()
plt.scatter(players_df['ERA'], players_df['batting_average'])
plt.title('Pitching vs. Batting Skills')
plt.xlabel('ERA')
plt.ylabel('Batting Average')
plt.show()
Most of the values are clustered between 0 to 10.0 ERA, and 0 to 0.4 batting average. It is hard to discern any clear relationship between great batting averages and pitching skill using this plot. Also, we see quite a few outliers.
The "Designated Hitter" rule allows a team to use one non-fielding player as a batter, typically in place of the pitcher. This rule was introduced in 1973 in the American League. In the Lahmann database, Appearances.G_dh is the number of games as designated hitter.
The common understanding is that pitchers are not typically good at batting. This makes me curious: can we quantify how poorly pitchers perform in comparison to other players at batting?
To compare their respective batting skills, we will divide the batters into two groups -- those who have pitched, and those who haven't.
Some of these values should not be included in analysis because they're not representative of the populations we want to compare. For example, some players that batted relatively few times had unusually low or high averages. There is a disproportionate number of such entries in the Batting data, as we saw in section 1.5, which skews the distribution. We will throw those values out:
players_df = players_df[players_df['AB'] > 20]
batting_df = batting_df[batting_df['AB'] > 20]
# Population 1: Pitchers that also bat
pitchers_that_bat = players_df[players_df['innings_pitched'] > 0.0]
pitchers_that_bat_grouped_by_playerID = pitchers_that_bat.groupby('playerID', as_index=False)
batting_averages_for_pitchers = pitchers_that_bat_grouped_by_playerID.sum()['H'] / pitchers_that_bat_grouped_by_playerID.sum()['AB']
print("Batting pitchers: %d" % len(batting_averages_for_pitchers))
# Population 2: All other batters
non_pitching_batters = batting_df[~batting_df['playerID'].isin(pitchers_that_bat['playerID'])]
non_pitching_batters = non_pitching_batters[non_pitching_batters['AB'] > 20]
non_pitching_batters_grouped_by_playerID = non_pitching_batters.groupby('playerID', as_index=False)
batting_averages_for_non_pitchers = non_pitching_batters_grouped_by_playerID.sum()['H'] / non_pitching_batters_grouped_by_playerID.sum()['AB']
print("All other batters: %d" % len(batting_averages_for_non_pitchers))
We look at the mean batting average for pitchers that bat vs. all other batters:
batting_averages_for_pitchers.mean()
batting_averages_for_non_pitchers.mean()
It does appear, at first blush, that pitchers are worse at batting based on the means we calculated for both groups. We can use a t-test to test the null hypothesis that batting pitchers bat no better or worse than all other batters.
The t-test assumes a normal distribution. Recall that we observed outliers in the baseball averages due to players that infrequently batted and had either very low or very high batting averages. Similarly, some pitchers pitched few innings. We removed those outliers in section 3.1 in order to get a more accurate confidence interval for the t-test:
stderr = st.sem(batting_averages_for_pitchers)
interval1 = (batting_averages_for_pitchers.mean() - stderr * 1.96, batting_averages_for_pitchers.mean() + stderr * 1.96)
stderr = st.sem(batting_averages_for_non_pitchers)
interval2 = (batting_averages_for_non_pitchers.mean() - stderr * 1.96, batting_averages_for_non_pitchers.mean() + stderr * 1.96)
batting_averages_for_pitchers.plot.kde(label='Avg. for Pitchers')
ax = batting_averages_for_non_pitchers.plot.kde(label='Avg. for Non-Pitchers')
ax.vlines(x=batting_averages_for_pitchers.mean(), ymin=-1, ymax=15, color='red', label='95% CI (Pitchers)')
ax.vlines(x=interval1[0], ymin=-1, ymax=15, color='red')
ax.vlines(x=interval1[1], ymin=-1, ymax=15, color='red')
ax.vlines(x=batting_averages_for_non_pitchers.mean(), ymin=-1, ymax=15, color='purple', label='95% CI (Non-Pitchers)')
ax.vlines(x=interval2[0], ymin=-1, ymax=15, color='purple')
ax.vlines(x=interval2[1], ymin=-1, ymax=15, color='purple')
ax.set_ylim([-1,12])
ax.legend()
plt.title("Batting Average for Pitchers vs. Non-Pitchers")
plt.xlabel("Avg.")
plt.show()
We see that the confidence intervals for the mean batting average of each population do not overlap. We will confirm this with a t-test below:
import scipy.stats
scipy.stats.ttest_ind(batting_averages_for_pitchers, batting_averages_for_non_pitchers, equal_var=False)
Since the p-value is very small, we can reject the null hypothesis. The mean batting average for pitchers is clearly lower than the mean batting average for all other players.