Predicting an MLB Player's Performance In Fantasy Baseball

Michael Eisenberg

Fantrax Fantasy Baseball

Introduction

As the National Pastime, baseball has remained a staple of American culture ever since the 1840s. As the years have progressed and technology has advanced, baseball found itself as the early and current center of statistics in sports. This emergence as the leader of the statistical revolution slowly began in 1913, with the establishment of the record keeping company, Elias Sports Bureau, and in 1947 with the hire of Allan Roth by the then Brooklyn Dodgers (who later moved to their current home in Los Angeles). While the Elias Sports Bureau and Roth were invaluable in promoting the use of statistics in the sport, baseball starting using traditional and advanced statistics in earnest in the 1970s, with the establishment of the Society for American Baseball Research (SABR) in 1971 and Bill James's first publication of The Bill James Baseball Abstracts in 1977. The growth of Major League Baseball's use of statistical accelerated even more with the success of the Oakland Athletics in the early 2000s, who used player data to help them adhere to the Moneyball philosophy.

At a similar time of the exponential increase of statistical use in MLB, fantasy sports, including fantasy baseball, has grown in its popularity. In fantasy baseball, any number of people can form a fantasy league together, with a commissioner to oversee the league operations. Each member of the league, including the commissioner, assumes control of their own team and attempts to select, through a league draft, as many of the best players in baseball as possible. In fantasy baseball, "best" or "good" players can be determined in any number of ways, with the most common being through the points system. In the points system, players can earn a fantasy team points through positive individual actions in a game, such as getting a hit or scoring a run, and can lose points through negative individual actions in a game, such as a batter striking out or a pitcher giving up a hit. Throughout the season, a team must accumulate as many points as possible. This can be achieved by trading players with other league members (team owners) or by adding a player from the league free agent pool and dropping a player currently on the team, giving each owner as close to a feel of running a real life team as possible. At the end of the season, one team and its owner is crowned the fantasy league champion.

Many fantasy baseball leagues are single-season leagues, meaning every year the team owners select new teams from all of the players playing in MLB, with some players from the minors, being available in a draft. However, an increasing trend in fantasy sports is the concept of dynasty leagues, which provide an even closer feel of controlling a real-life team. A dynasty baseball league is long-term, meaning that there is an initial draft at the inception of the league and at the end of a season, each owner selects a certain number of players on their team they want to keep for the following season. There is then a draft of the remaining players in the player pool and teams can even trade current and future draft picks, just like teams do in professional sports. Fantasy baseball is perfect for anyone interested in baseball and statistics, as it relies on statistics to determine the point values of players, which helps determine the strongest fantasy league team at the end of the season. The points settings used to calculate the number of points produced by a player is listed under the section "Points System".

As someone interested in statistics and a fan of the Baltimore Orioles, I have recently joined a 14-member dynasty baseball league with my twin brother and both of our friends. This league, called We Are Sabermetrics, runs on the Fantrax website and relies on the aforementioned points system to determine the "fantasy value" of players. Throughout the season, teams compete in weekly matchups where the team who scores the most points in the matchups wins. At the end of the season, the top 4 teams compete in the playoffs for the league championship. Although my team made the playoffs for the shortened 2020 MLB season, much can be done to improve the team, as it struggled to score points at times throughout the season. This tutorial will provide valuable insight not just for my improvement of a team, but for anybody looking to build a dominant fantasy baseball team. It will look at traditional statistics such as Home Runs (HR) and Runs (R) as well as more advanced statistics such as Runs Created (RC) and On Base Plus Slugging (OPS). Through an analysis of t years of data for batters, the tutorial will attempt to use statistics of a player's performance in previous seasons to predict the expected performance for the season as well as the performance for the following season, which would then help determine the baseball players most likely to produce the most points for a fantasy team owner. At the end of the tutorial, you should have a better understanding of how to predict a batter's future performance and hopefully a desire to join your own fantasy baseball league, if you are currently not in a league.

Points System

The following table lists the ways in which a player can earn or lose points as well as the values. This is the points system that will be used throughout the tutorial.

Scoring GroupScoring CategoryPoints
HittingDoubles (2B)2
HittingErrors (E)-2
HittingHit By Pitches (HBP)0.5
HittingHome Runs (HR)4
HittingRuns Batted In (RBI)1
HittingRuns Scored (R)1
HittingSingles (1B)1
HittingStolen Bases (2B)2
HittingStrikeouts (SO)-0.5
HittingTriples (3B)3
HittingWalks (BB)1

The data dictionary for the dataframe containing all of the statistics used in the tutorial is provided with this link. There is some other data contained in the csv files, but they are not included in the data dictionary because they are immediately dropped from the ecompassing dataframe and never used in the tutorial.

Python Libraries Used

The following Libaries, along with Python 3, will be used in the tutorial. Each library is linked to its documentation.

Data Import and Processing

First, we must obtain the data, which involves fantasy baseball statistics for both batters from the 2015 season to the 2020 season. This tutorial does not analyze data prior to 2015 due to the limited number of player data contained in the pre-2015 files. Since the tutorial is in the context of my dynasty baseball league, We Are Sabermetrics, the statistics and points data must come from the league website. When accessing the player data, there were three categories of statistics that were of interest: Standard, Extra, and Sabermetric. There is data on pitchers and the process followed in this project to analyze batter performance could also be used to analyze pitcher performance. To access and manipulate the data, for each year, the proper csv files were downloaded from the league website and then read into Pandas DataFrames.

The following image shows an example of the league interface for viewing and downloading the data. 2020 Batters Standard

After uploading the data, we first had to combine the data for batters as appropriate. For each year, there were three csv files for batters, with each file providing some different set of player statistics. For ease of future analysis and to avoid storing duplicate data, the dataframes containing batter data must be merged together and the duplicate columns must be eliminated. At the end of this step, we have five (key, value) pairs in the stats dictionary, where each key is a unique year and each value is a list containing the merged batter data for that year.

Now, we must tidy and clean the data. First, it was necessary to add a column to each dataframe representing the year from which the data came. Then, we need to filter out the batters who did not receive enough at-bats in a season. For the years 2015-2019, any player with fewer than 125 at-bats was removed. However in 2020 there was a shortened season due to the coronavirus pandemic, so the threshold for at-bats was lowered to 50. After this, the age column was modified. Initially, the age column represented the current age of the player, regardless of the value in the year column. Thus, a quick modification changed the values in the age column to represent the true age of a player during the season of the year from which the data was taken. Finally, we combine all of the batter data into one dataframe, all_batters, and converted each column type to be more representative of the true column type of the data in the column.

Exploratory Data Analysis

To be able to predict which batters will perform well, it would be helpful to understand the distribution of points scored as well as what statistics most heavily influence the number of points scored. In this section, we will explore the relationships between many statistics and fantasy points per game to determine if each statistic analyzed plays a large role in determining the quality, determined by the amount of points per game scored, of a player. The statistics that are determined to impact the points per game of a batter will be considered in a linear regression model predicting the performance level of players.

We start the analysis with the following histogram that displays the distribution of fantasy points per game using the all_batters dataframe.

In the above histograms, we immediately notice that the distribution of fantasy points per game for batters has a slight right skew and a median of around 2.0 points per game, with the mean being higher due to the right skew. The right skew indicates that there is usually a few players who have excellent batting seasons and perform much better than the rest of the players.

The amount of points per game seems rather low when evaluating the histogram, especially when taking into considering the league's points system. However, one must realize that batters play almost every day and fantasy baseball matchups typically last one full week, so batters will quickly accumulate points. Additionally, an owner who has a good grasp of the players, their situations, and their abilities will be able to pick players who easily clear the average points per game for qualifying batters in a season. As a matter of fact, the team that won the championship had all but three of its batters score at least 2.47 points per game and five of its batters score more than 3 points per game. While this may be difficult to consistently achieve, it does highlight that a league winning fantasy team has most of its players score well above the approximately 2.0 median points per game as shown by the histogram.

Even though the above distribtuion does not directly relate to the question of predicting a batter's expected or performance, it does provide a basic background into the fantasy points scored each game as well as the performance level required for a baseball player to be considered good for fantasy. This, as well as an understanding of positional value, allows fantasy owners to determine how much to generally invest in a given player in terms of draft capital or by making a trade.

Another important thing to point out is the relationship between playing time and points per game. Playing time is considered by both games played and the number of at bats. When evaluating both scatter plots of points per game v. playing time, the relationship between points per game and playing time is unclear. In the plot of Points Per Game v. At Bats, there appears to be a moderately strong and positive linear relationship between the two statistics. However, the plot of Points Per Game v. Games Played shows what appears to be a non-linear relationship. While this relationship is not strong, it does have more of a quadratic appearance than a linear appearance.

The linear relationship between At Bats and Points Per Game should be expected, however, a player having more At Bats does not significantly impact their points per game. While more At Bats does lead to more in-game opportunities for a batter, these number represent the value over the course of a season. A batter will likely earn more at bats if they are playing well, so we should say that a higher value of Points Per Game leads to an increase in opportunities, or At Bats, not the other way around. Thus, At Bats should not reliably be used to predict the points per game earned by a batter. That same idea would apply to the amount of games played, regardless of whether the relationship is quadratic or linear, as a player will get more opportunities to play if they play well, a basic idea of any sport.

Now, we must determine what factors best predict a player's performance, as we cannot use all of the statistics in our model to predict player performance. In the linear regression model, one of the assumptions is that there does not exist perfect collinearity between the variables used in the model. For example, if both hits and batting average have a strong linear relationship with each other, then the linear regression model cannot include both of hits and batting average. Thus, it is important to explore what might impact a player's performance in fantasy baseball, what of those impactful variables have a linear relationship with each other, and select the variables that appear to most strongly impact performance.

We first start with the age of a batter. A player's age, in any sport, could impact their performance because as players get older, they gain more experience and knowledge in the game. However, their bodies are expected to be less able to handle the rigors of a full 162-game season as they get older. With that in mind, we plotted the violin plots above. Contrary to expectations, there does not seem to be, with a few exceptions, much of an impact of age on points scored per game. Most of the ages have a mean points per game of around 2 except for players aged 20, 21, 39, or 40, the two youngest and two oldest ages. A likely explanation for this is since few baseball players earn significant playing time at those young ages and the ones that do are typically expected to be among the best players in baseball in their careers, the sample size is too small and biased towards those uber-talented players. Additionally, the distribution for players aged 39 and 40 contains a small number of players such that we cannot properly determine the distribution of points per game for players of those ages.

It is also important to point out the distribution of the points per game as seen in the above plots. There appears to be two sets of ages, 22-28 and 29-34, with similar distributions while the rest of the years have their own unqiue distributions. Although we could attempts to quantify the changes in distributions of fantasy points per game by age, the distributions do not change in a distinctive pattern nor do many of the ages have a different mean points per game. Therefore, it is unlikely that age plays much, if any, of an impact on a player's performance.

Now, we determine the impact of four related statistics, hits (H), home runs (HR), runs (R), and runs batted in (RBI) on a player's points per game. Immediately, we notice that there appears to be a positive relationship between each of the above statistics and points per game. Intuitively, this makes sense, as each provide positive points to a fantasy team when produced by a player and more productive players should have higher amounts of hits, runs, home runs, and RBIs. Since each of these statistics plays a large role in a player's points per game, each could potentially be used to predict the performance level of a player.

Another thing that jumps out from the above plots is that the scatter plots of FP/G v. each of H, R, and RBI all look very similar. This result is obvious once we realize that in order to score or to drive in a run, a batter will typically need to walk or get a hit. Since players get hits far more often than they walk, R and RBI are highly dependent on the amount of hits by a batter. Without the hits, a batter will score or drive in a run very rarely. This means that in a linear model, we cannot use all of hits, runs, and runs batted in to predict the points per game produced by a player.

For the above plots, each of H, R, HR, and RBI were standardized prior to being plotted because as previously mentioned, the 2020 season was only 60 games instead of the typically 162. This could lead to outliers or leverage points, which can impact the analysis in an undesired manner. The standardization does not change the overall insights of the analysis. Instead, it, along with the plots, tells us how much better than the mean a player typically produces of a certain statistic to reach some points per game level.

The impact of runs, RBI, home runs, and hits on a player's points per game as seen in the previous scatter plots is further confirmed here. The above chart plots the mean points per game generated by singles (1B), doubles, triples, home runs, runs, RBIs, walks, and stolen bases. Hits was broken up into 1B, 2B, 3B, and HR to visualize the impact of each type of hit on the points per game because otherwise most of the lines would appear at the bottom of the chart with a lot of space between those lines and the line representing points per game generated by hits. The amount of hits clearly has a significant impact on the points per game for players. What is interesting is that of the four hit categories, home runs and singles contribute in a similar amount and much more significantly than doubles and triples. As a matter of fact, triples contribute very little to the points per game for a batter, even though a player produces 3 fantasy points for a triple and only 1 for a single and 2 for a double, due to the rarity of the triple in modern baseball.

Another interesting part of the graph is how runs and RBIs, even though there is a difference between the two in their overall contributions to points per game, increase and decrease by nearly the same amount every year. As seen in the scatter plots above for R v. FP/G and RBI v. FP/G, both of these had plots that looked very similar. Additionally, the discussion of those plots mentions how both depend highly on a batter's number of hits. Both runs and RBIs can clearly help predict a player's performance, so they should be used, albeit separately in a linear model predicting a batter's performance.

It is important to note that 1B represents singles and is not in the original dataset. However, singles can be calculated using the following formula:
1B = H - HR - 3B - 2B
This is not included in the original dataframe because singles can easily be calculated when needed and they are not used in any of the regression models.

Now that we understand how the counting statistics such as runs, hits, and home runs impact player performance, let's investigate how statistics incorporating those counting statistics predict performance. Even though these statistics (such as batting average) do not directly impact the amount of points earned by a player, they incorporate many of those statistics and are commonly used in any discussion or evaluation of the quality of a batter. The top left plot above plots the relationship between points per game and batting average. There appears to be a moderate and positive linear relationship between the two variables, which is consistent with the above results. Since batting average is simply the expected number of hits every time a batter steps up to the plate and there is a positive linear relationship between hits and points per game, the plot of batting average v. points per game should also show a positive linear relationship.

The subsequent plot, FP/G v. OBP, displays an even stronger linear relationship than the previous plot of FP/G v. AVG. As a matter of fact, the plots get increasingly linear and show less variability up to the last plot, FP/G v. OPS, which displays a powerful linear relationship between fantasy points per game and OPS. The graph using OBP displays a stronger relationship than the graph using AVG because OBP takes into account the amount of walks and other statistics beyond hits. The plot using SLG displays a stronger linear relationship than avg because SLG takes into account the type of hit a batter had, accounting for whether more hits were singles, doubles, triple, or home runs. Since the different hit types result in a different amount of points added, SLG will always be at least as good at predicting a players performance as the player's batting average.

When creating a model, OPS will be the chosen statistic out of these four because of the strength of its linear relationship with points per game. FP/G v. OPS displays the strongest relationship because OPS is simply the sum of OBP and SLG. Since OPS incorporates both of those statistics and both OBP and SLG incoroprate hits, OPS will always provide a stronger predictive ability for points per game than AVG. It is not initially clear, however, why OPS displays a stronger relationship than both OBP and SLG if it is just the sum of those two statistics. An explanation of this result could be that by combining two statistics that evaluate slightly different aspects of a batter's performance, OPS account for more factors that could impact a batter's points per game, which leads to a stronger linear relationship.

Now, we evaluate the relationship between 6 different advanced statistics and points per game. The first two plots, of walks per plate apppearance and strikeouts per plate appearance display some rather surprising results. Since walks contributes positively to points per game and strikeouts contribute negatively to points per game, a strong linear relationship for the plots of both BB/PA v. FP/G and K/PA v. FP/G was expected. However, there appears to be almost no linear relationshp between walks per plate appearance and points per game and a weak but negative relationship between strikeouts per plate appearance and points per game. Especially as the "Three True Outcomes" become even more common, a stronger relationship in the first two plots was expected. However, upon analysis of the two plots, we can see why the relationship is weak or nonexistent. In the BB/PA v. FP/G plot, we have most of the data points between 0.05 and 0.2 and in the K/PA v. FP/G plot, most are the data points are between 0.1 and 0.4. Since the endpoints of the intervals for K/PA v. FP/G are twice as large as the endpoints for BB/PA v. FP/G, strikeouts and walks have opposite effects on points, and the value of the impact of strikeouts is 1/2 of the value of the impact of walks, the relative lack of a linear relationship between the two plots is much more understandable.

Although the first two plots did not show much of a linear relationship, the plots of ISO v. FP/G and WOBA v. FP/G do show a linear relationship. This result was not surprising because ISO is calculated as (AVG - SLG) and WOBA is determined by walks, hit by pitches, and each of the four types of hits. Both are determined by statistics previously determined to be related to a batter's performance, so ISO and WOBA should at least have a similar strength of relationship with points per game as those statistics like average and hits. While the plot of ISO v. FP/G shows a clear relationship, the plot of FP/G v. WOBA shows a stronger relationship between the two statistics. As a matter of fact, WOBA has a similar shape and relationship strength as the previous graph of FP/G v. OPS.

The most interesting result is the plot of points per game v. runs created (RC). In this plot, there appears to be a hard lower bound for points per game given the amount of runs created and overall there appears to be a strong linear relationship. This result was expected because runs created attempts to estimate how much a player contributes to his professional team's ability to score and accounts for hits, walks, and total bases. There are many different forms of this statistic, and each is a powerful indicator of a batter's performance. Lastly, the graph for BABIP v. FP/G shows zero relationship between the two statistics, an unsuprising result given that BABIP varies wildly, even for the best performing batters.

Here, we compare the relationship of OPS and Runs Created per Plate Appearance (RCPA) and fantasy points per game. RCPA was calculated by dividing the value for runs created by a player's number of plate appearances. This was done because a batter cannot produce a higher number of runs created given a certain number of plate appearances. Thus, RCPA was created to allow for runs created data across different players and seasons to be more easily compared. When evaluating the above scatter plots, RCPA has a very strong linear relationship with points per game, as expected due to the relationship between runs created and statistics such as hits. We also notice that while both appear to have a very strong relationship with points per game, OPS appears to have a stronger, albeit very slightly stronger, relationship with fantasy points per game than RCPA. As a result, OPS should better predict a players performance than RCPA.

Predicting Batter Performance

We have seen how many of the statistics play an important role in determining the fantasy baseball points per game produced by a batter as well as the relationships betweeen points per game and advanced statistics such as runs created. Now, we must combine the knowledge we have gained and created linear regression models that predict a player's expected fantasy output, keeping in mind that we cannot violate the multicollinearity assumption, which says that we cannot have multiple independent variables in the model having a clear linear relationship with each other.

There will be a total of 6 different multi-linear regression models created, with each relying on a different subset of statistics to predict a player's expected points per game, with the independent variables in each model confirmed to not violate the multicollinearity assumption. The models will be created by running on a training set, which will be 75% of the all_batters dataset, and then tested on the test set, which is the remaining 25% of the all_batters dataset. The subset of independent variables were selected through an exploratory analysis of the impact of different sets of independent variables as well as an effort to include as many of the above statistics that showed a linear relationship with points per game as possible. This will allow for a comparison of the predictive power of each set of independent variables. The different sets of independent variables (statistics) used to predict points per game are as follows:

Above, we see the predictors as well as the corresponding coefficients for Model 1 and Model 3, the two strongest models.

Conclusion

Once the models were created, it was important to determine how well each model predicted a player's performance based on certain statistics. These results, in terms of the r^2 score and the mean squared error, are plotted in separate bar charts, allowing for easy comparisons of the different models. When evaluating the bar charts, two models, Model 1 and Model 3, stand out as the most effective in predicting a players points per game. These two models had higher r^2 scores than the rest of the models and the difference between the two models in terms of r^2 score was less than 0.01, meaning that the difference in the variability explained by the two models was less than 1%, a very low number. The high performance of Model 1 and Model 3 can further be seen in the bar plot for mean squared error, where both models have a clear lower mean squared error than the others and again the differece in the two values was less than 0.01. For mean squared error, this means that the difference of errors in predicting points per game between Model 1 and Model 3 was exceedingly low. Based on these results, it is difficult to truly say which of Model 1 or Model 3 is the best of the six total models. Even though Models 1 and 3 are the clear two best models, the other four models still performed well, all having low mean squared error values and r^2 values of greater than 0.7.

We can say that using a combination of RCPA, RBI, and K/PA or a combination of WOBA and R is best to determine a player's performance. The models predict a player's expected performance, and can potentially identify where players underperformed or overperformed by a significant amount. By understanding which players underperformed expectations, fantasy owners can see which players may be undervalued and allocate some resources to target the portion of players who should perform better the following season. By understanding which players who performed as or close to expected, fantasy owners can answer questions such as "Is this player's good or bad performance a true reflection of their abilities or should my expectations of their production be adjusted?" Correctly ansewring this question, as well as others, is vital to the success of a fantasy team, as any fantasy owner needs to be able to identify which players will continue their higher or lower performance levels in the upcoming season. Finally, by determing which players overperformed expectations, a fantasy owner can allow other teams in the league to spend resources on acquiring that player, leaving better and potentially cheaper players to still be available for acquisition. Overall, a combination of advanced and traditional statistics, with the specific statistics used up to the discretion of each fantasy player, are required to determine the expected points per game generated by a batter.

Final Remarks and Additional Reads

While much insight was gained throughout this tutorial, more analysis and data could be used to create more powerful models and provide fantasy baseball owners with even stronger tools to predict which players will perform the best. As mentioned previously, for each year analyzed, not all of the needed player data was available, mostly for the earlier year used in the analysis. Additionally, the value in the Position column of the all_batters dataframe only represented the current position the batter plays in the field and it is not a reflection of the position for which the batter qualified (played) in that year. In a subsequent analysis, we could pull data from a site such as Statcast to pull an even larger amount of data than seen here. While that data would have to be modified to ensure that the league settings and scoring system are properly followed and accounted for, it could provide information on more advanced statistics such as exit velocity or wins above replacement (WAR). Additionally, analysis and future prediction of performance can be made on pitchers, separating starting pitchers and relief pitchers, as well as by the fielding position of batters. Or, in a very powerful model, we can predict career trajectories in fantasy baseball based on minor league and major league statistics. There are many ways in which to go about this analysis, and the insight provided here is just a small portion of what can be done.

To read more about interesting topics or stories in baseball, click on any of the following links. While some require knowledge about baseball and fantasy baseball, each article can be read and understood by anyone.