I have a long-standing personal interest in the sport of baseball. To “pinch hit” in baseball is to replace the starting player at a position in the batting order with another player who, presumably, has a better chance at getting a hit against the current pitcher. A pinch-hitter is often called on in “high-stakes” situations, where the outcome of the game is on the line.
The specific question I want to explore:
“Do pinch hitters in Major League Baseball, as a group, have a greater ability to get a hit than the starting players they have replaced?”
An extension that I might pursue – does the answer to this question change when data from different spans of time in baseball history is analyzed?
A second extension I might pursue – is there any difference in results when we examine National League games as compared to American League games? It should be noted that in National League games, the player being replaced for an at-bat is usually the pitcher (who is often a poor hitter).
This question, and the possible extensions, would be useful to anyone who enjoys the sport of baseball: fans, players, coaches, and team staff. The population of interest is Major League Baseball players.
H0: Major League Baseball players during the 2014 baseball season who are pinch hitting have the same ability to get a hit as starting players.
Ha: Major League Baseball players during the 2014 baseball season who are pinch hitting have a greater ability to get a hit as starting players.
Load functions that make it easier to create numerical and graphical summaries:
source('http://russellgordon.ca/rsgc/r/functions.r')
First, load packages that make it possible to automatically scrape data from the web.
# We need the 'XML' package. Install if not present.
if (!require(XML)) {
install.packages("XML")
require(XML)
}
## Loading required package: XML
# We need the 'stringr' package. Install if not present.
if (!require(stringr)) {
install.packages("stringr")
require(stringr)
}
## Loading required package: stringr
I have chosen to work with the data from Baseball Reference.
The data at Baseball Reference provides pinch-hit statistics with more detail than ESPN. Specifically, the following situational hitting stats are provided by Baseball Reference:
In my proposal, I had mentioned that the primary statistic I would use to evaluate pinch-hitting effectiveness was batting average. Although batting average is not provided at the Baseball Reference data source, Major League Baseball defines batting average as the “number of base hits divided by the total number of at-bats”. With the information provided by my Baseball Reference source I can calculate batting average myself, and then plot the results to explore whether any of my hypotheses are supported by the data.
On the same Baseball Reference page, overall batting statistics (in any situation) are provided. This will permit me to a comparison based on one of the categorical variables I identified in my proposal (at-bat type: “pinch hit” or “regular”). I plan to subtract the pinch-hit batting statistics from the overall batting statistics so that I can identify batting results in “regular” situations.
Here, I set the data source - this is data for the 2014 regular season:
url <- "http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml"
Now we read data in all the table(s) found at that URL (web site address):
tables <- readHTMLTable(url)
My Baseball Reference data source contains multiple tables of data.
The first table at the page appears to contain the data that I need, so I will load that into a data frame.
situational_batting_2014 <- tables[[1]]
First, I will review the data that I have found:
view(situational_batting_2014)
## Loading required package: knitr
Tm | R/G | PA | Ptn% | H | Inf | Bnt | AB | H | HR | RBI | PHlev | All | GS | GSo | vRH | vLH | Hm | Rd | IP | Att | Suc | % | Opp | DP | % | Opp | Suc | % | BR | BRS | BRS% | <2,3B | Scr | % | 0,2B | Adv | % | PAu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ARI | 3.80 | 6089 | 50% | 1379 | 149 | 15 | 227 | 56 | 7 | 18 | 1.25 | 118 | 4 | 107 | 91 | 27 | 62 | 56 | 0 | 73 | 56 | 77% | 1113 | 115 | 10% | 586 | 187 | 32% | 3523 | 483 | 14% | 303 | 153 | 51% | 212 | 126 | 59% | 0 |
2 | ATL | 3.54 | 6064 | 44% | 1316 | 155 | 13 | 190 | 34 | 2 | 7 | 1.18 | 123 | 2 | 130 | 91 | 32 | 62 | 61 | 0 | 77 | 53 | 69% | 1061 | 121 | 11% | 563 | 155 | 28% | 3514 | 441 | 13% | 272 | 137 | 50% | 218 | 112 | 51% | 0 |
3 | BAL | 4.35 | 6130 | 46% | 1434 | 157 | 16 | 67 | 21 | 3 | 11 | 1.85 | 211 | 6 | 149 | 155 | 56 | 107 | 104 | 0 | 53 | 35 | 66% | 1108 | 112 | 10% | 504 | 143 | 28% | 3533 | 485 | 14% | 257 | 133 | 52% | 228 | 115 | 50% | 0 |
4 | BOS | 3.91 | 6226 | 50% | 1355 | 140 | 4 | 86 | 20 | 2 | 8 | 1.75 | 123 | 5 | 151 | 84 | 39 | 49 | 74 | 0 | 33 | 20 | 61% | 1227 | 138 | 11% | 563 | 165 | 29% | 3837 | 503 | 13% | 326 | 164 | 50% | 224 | 105 | 47% | 0 |
5 | CHC | 3.79 | 6102 | 57% | 1315 | 131 | 18 | 244 | 45 | 1 | 14 | 1.26 | 157 | 1 | 108 | 119 | 38 | 69 | 88 | 0 | 81 | 57 | 70% | 1076 | 94 | 9% | 568 | 169 | 30% | 3441 | 447 | 13% | 288 | 143 | 50% | 207 | 98 | 47% | 0 |
6 | CHW | 4.07 | 6077 | 51% | 1400 | 164 | 16 | 75 | 23 | 1 | 10 | 1.61 | 155 | 4 | 124 | 110 | 45 | 74 | 81 | 0 | 28 | 19 | 68% | 1096 | 127 | 12% | 528 | 141 | 27% | 3540 | 497 | 14% | 299 | 157 | 53% | 233 | 122 | 52% | 0 |
7 | CIN | 3.67 | 5978 | 52% | 1282 | 145 | 25 | 203 | 50 | 6 | 17 | 1.30 | 131 | 6 | 106 | 96 | 35 | 77 | 54 | 0 | 92 | 76 | 83% | 1003 | 88 | 9% | 551 | 186 | 34% | 3335 | 449 | 13% | 318 | 164 | 52% | 225 | 129 | 57% | 0 |
8 | CLE | 4.13 | 6222 | 74% | 1411 | 155 | 31 | 105 | 23 | 1 | 11 | 1.36 | 142 | 1 | 134 | 111 | 31 | 72 | 70 | 0 | 61 | 51 | 84% | 1117 | 126 | 11% | 542 | 180 | 33% | 3698 | 516 | 14% | 315 | 168 | 53% | 246 | 147 | 60% | 0 |
9 | COL | 4.66 | 6164 | 47% | 1551 | 199 | 20 | 233 | 61 | 5 | 30 | 1.19 | 186 | 4 | 121 | 126 | 60 | 119 | 67 | 2 | 78 | 59 | 76% | 1096 | 121 | 11% | 576 | 201 | 35% | 3608 | 556 | 15% | 334 | 189 | 57% | 268 | 145 | 54% | 0 |
10 | DET | 4.67 | 6202 | 45% | 1557 | 140 | 16 | 74 | 14 | 2 | 6 | 1.66 | 155 | 3 | 149 | 109 | 46 | 76 | 79 | 0 | 40 | 24 | 60% | 1181 | 137 | 12% | 598 | 212 | 35% | 3842 | 594 | 15% | 393 | 209 | 53% | 292 | 171 | 59% | 0 |
11 | HOU | 3.88 | 6055 | 61% | 1317 | 153 | 13 | 76 | 9 | 0 | 4 | 1.38 | 163 | 5 | 124 | 120 | 43 | 90 | 73 | 1 | 35 | 22 | 63% | 1118 | 122 | 11% | 506 | 141 | 28% | 3479 | 451 | 13% | 290 | 151 | 52% | 218 | 118 | 54% | 0 |
12 | KCR | 4.02 | 6058 | 48% | 1456 | 181 | 23 | 43 | 9 | 2 | 5 | 1.50 | 95 | 3 | 110 | 64 | 31 | 43 | 52 | 1 | 54 | 33 | 61% | 1119 | 131 | 12% | 560 | 186 | 33% | 3604 | 540 | 15% | 332 | 190 | 57% | 242 | 139 | 57% | 0 |
13 | LAA | 4.77 | 6285 | 52% | 1464 | 166 | 22 | 105 | 25 | 1 | 12 | 1.36 | 155 | 1 | 143 | 113 | 42 | 73 | 82 | 0 | 39 | 26 | 67% | 1175 | 112 | 10% | 569 | 189 | 33% | 3844 | 605 | 16% | 336 | 184 | 55% | 272 | 142 | 52% | 0 |
14 | LAD | 4.43 | 6231 | 46% | 1476 | 203 | 27 | 199 | 46 | 1 | 23 | 1.20 | 134 | 0 | 129 | 99 | 35 | 71 | 63 | 0 | 71 | 47 | 66% | 1128 | 119 | 11% | 615 | 187 | 30% | 3861 | 573 | 15% | 354 | 176 | 50% | 262 | 148 | 56% | 0 |
15 | MIA | 3.98 | 6185 | 46% | 1399 | 191 | 20 | 246 | 45 | 5 | 27 | 1.56 | 122 | 4 | 162 | 98 | 24 | 59 | 63 | 0 | 85 | 71 | 84% | 1186 | 143 | 12% | 602 | 181 | 30% | 3771 | 511 | 14% | 330 | 165 | 50% | 221 | 121 | 55% | 0 |
16 | MIL | 4.01 | 6065 | 40% | 1366 | 171 | 25 | 212 | 47 | 4 | 22 | 1.50 | 150 | 4 | 118 | 109 | 41 | 77 | 73 | 0 | 100 | 70 | 70% | 1026 | 135 | 13% | 558 | 178 | 32% | 3385 | 491 | 15% | 292 | 142 | 49% | 219 | 134 | 61% | 0 |
17 | MIN | 4.41 | 6233 | 60% | 1412 | 163 | 18 | 86 | 18 | 0 | 12 | 1.26 | 128 | 1 | 168 | 94 | 34 | 67 | 61 | 1 | 34 | 25 | 74% | 1177 | 97 | 8% | 553 | 159 | 29% | 3976 | 576 | 14% | 384 | 185 | 48% | 253 | 124 | 49% | 0 |
18 | NYM | 3.88 | 6145 | 52% | 1306 | 144 | 13 | 212 | 39 | 3 | 12 | 1.38 | 125 | 3 | 148 | 103 | 22 | 59 | 66 | 0 | 76 | 59 | 78% | 1129 | 112 | 10% | 582 | 180 | 31% | 3704 | 492 | 13% | 318 | 149 | 47% | 232 | 126 | 54% | 0 |
19 | NYY | 3.91 | 6082 | 70% | 1349 | 161 | 16 | 90 | 22 | 2 | 8 | 1.61 | 147 | 2 | 137 | 108 | 39 | 88 | 59 | 0 | 45 | 29 | 64% | 1085 | 111 | 10% | 539 | 186 | 35% | 3554 | 468 | 13% | 330 | 175 | 53% | 247 | 150 | 61% | 0 |
20 | OAK | 4.50 | 6245 | 72% | 1354 | 164 | 25 | 156 | 32 | 3 | 16 | 1.52 | 146 | 5 | 172 | 102 | 44 | 74 | 72 | 0 | 41 | 19 | 46% | 1207 | 118 | 10% | 551 | 159 | 29% | 3895 | 567 | 15% | 334 | 168 | 50% | 262 | 141 | 54% | 0 |
21 | PHI | 3.82 | 6198 | 65% | 1356 | 143 | 12 | 220 | 40 | 5 | 19 | 1.36 | 125 | 4 | 142 | 84 | 41 | 64 | 61 | 0 | 73 | 59 | 81% | 1103 | 94 | 9% | 563 | 174 | 31% | 3676 | 480 | 13% | 306 | 157 | 51% | 242 | 135 | 56% | 0 |
22 | PIT | 4.21 | 6224 | 48% | 1436 | 185 | 16 | 281 | 61 | 7 | 32 | 1.34 | 156 | 1 | 155 | 133 | 23 | 62 | 94 | 1 | 87 | 54 | 62% | 1166 | 127 | 11% | 584 | 174 | 30% | 3856 | 512 | 13% | 334 | 172 | 52% | 224 | 117 | 52% | 0 |
23 | SDP | 3.30 | 5905 | 67% | 1199 | 131 | 14 | 280 | 61 | 11 | 29 | 1.38 | 109 | 3 | 127 | 82 | 27 | 54 | 55 | 0 | 79 | 56 | 71% | 989 | 118 | 12% | 544 | 175 | 32% | 3310 | 411 | 12% | 308 | 141 | 46% | 222 | 118 | 53% | 0 |
24 | SEA | 3.91 | 5977 | 65% | 1328 | 162 | 20 | 82 | 16 | 1 | 7 | 1.72 | 136 | 0 | 102 | 108 | 28 | 73 | 63 | 0 | 56 | 35 | 63% | 1019 | 112 | 11% | 482 | 152 | 32% | 3292 | 484 | 15% | 284 | 151 | 53% | 177 | 98 | 55% | 0 |
25 | SFG | 4.10 | 6087 | 60% | 1407 | 142 | 12 | 207 | 46 | 4 | 29 | 1.43 | 132 | 4 | 145 | 92 | 40 | 53 | 79 | 0 | 58 | 45 | 78% | 1134 | 113 | 10% | 585 | 182 | 31% | 3681 | 522 | 14% | 326 | 152 | 47% | 243 | 129 | 53% | 0 |
26 | STL | 3.82 | 6086 | 51% | 1371 | 144 | 15 | 218 | 49 | 2 | 24 | 1.11 | 105 | 1 | 140 | 72 | 33 | 57 | 48 | 0 | 90 | 64 | 71% | 1201 | 140 | 12% | 583 | 181 | 31% | 3784 | 503 | 13% | 288 | 160 | 56% | 236 | 143 | 61% | 0 |
27 | TBR | 3.78 | 6205 | 54% | 1361 | 168 | 20 | 142 | 30 | 2 | 20 | 1.79 | 117 | 1 | 155 | 85 | 32 | 51 | 66 | 3 | 70 | 43 | 61% | 1211 | 135 | 11% | 562 | 183 | 33% | 3859 | 485 | 13% | 324 | 160 | 49% | 212 | 105 | 50% | 0 |
28 | TEX | 3.93 | 6026 | 49% | 1400 | 189 | 31 | 85 | 21 | 1 | 10 | 1.14 | 111 | 2 | 109 | 73 | 38 | 51 | 60 | 0 | 51 | 41 | 80% | 1061 | 148 | 14% | 554 | 209 | 38% | 3514 | 507 | 14% | 297 | 152 | 51% | 226 | 142 | 63% | 0 |
29 | TOR | 4.46 | 6167 | 68% | 1435 | 176 | 20 | 176 | 36 | 9 | 21 | 1.17 | 177 | 2 | 138 | 135 | 42 | 98 | 79 | 0 | 58 | 35 | 60% | 1197 | 128 | 11% | 561 | 177 | 32% | 3732 | 536 | 14% | 349 | 193 | 55% | 249 | 132 | 53% | 0 |
30 | WSN | 4.23 | 6216 | 52% | 1403 | 162 | 31 | 209 | 30 | 5 | 15 | 1.31 | 152 | 2 | 138 | 117 | 35 | 63 | 89 | 0 | 91 | 60 | 66% | 1169 | 115 | 10% | 601 | 171 | 28% | 3790 | 507 | 13% | 328 | 161 | 49% | 250 | 124 | 50% | 0 |
31 | LgAvg | 4.07 | 6131 | 55% | 1387 | 161 | 19 | 161 | 34 | 3 | 16 | 0 | 140 | 3 | 135 | 103 | 37 | 70 | 70 | 0 | 64 | 45 | 70% | 1123 | 120 | 11% | 561 | 175 | 31% | 3648 | 506 | 14% | 318 | 163 | 51% | 235 | 129 | 55% | 0 |
The data looks clean - for example, there are no repeated headers.
However, I do have one row that contains league average information (the final row). I will remove this row so that scatterplots are comparing data on a per-team basis only:
situational_batting_2014 <- subset(situational_batting_2014, Tm != "LgAvg")
Now I have only data for the 30 teams in the MLB in 2014:
view(situational_batting_2014)
Tm | R/G | PA | Ptn% | H | Inf | Bnt | AB | H.1 | HR | RBI | PHlev | All | GS | GSo | vRH | vLH | Hm | Rd | IP | Att | Suc | % | Opp | DP | %.1 | Opp.1 | Suc.1 | %.2 | BR | BRS | BRS% | <2,3B | Scr | %.3 | 0,2B | Adv | %.4 | PAu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ARI | 3.80 | 6089 | 50% | 1379 | 149 | 15 | 227 | 56 | 7 | 18 | 1.25 | 118 | 4 | 107 | 91 | 27 | 62 | 56 | 0 | 73 | 56 | 77% | 1113 | 115 | 10% | 586 | 187 | 32% | 3523 | 483 | 14% | 303 | 153 | 51% | 212 | 126 | 59% | 0 |
2 | ATL | 3.54 | 6064 | 44% | 1316 | 155 | 13 | 190 | 34 | 2 | 7 | 1.18 | 123 | 2 | 130 | 91 | 32 | 62 | 61 | 0 | 77 | 53 | 69% | 1061 | 121 | 11% | 563 | 155 | 28% | 3514 | 441 | 13% | 272 | 137 | 50% | 218 | 112 | 51% | 0 |
3 | BAL | 4.35 | 6130 | 46% | 1434 | 157 | 16 | 67 | 21 | 3 | 11 | 1.85 | 211 | 6 | 149 | 155 | 56 | 107 | 104 | 0 | 53 | 35 | 66% | 1108 | 112 | 10% | 504 | 143 | 28% | 3533 | 485 | 14% | 257 | 133 | 52% | 228 | 115 | 50% | 0 |
4 | BOS | 3.91 | 6226 | 50% | 1355 | 140 | 4 | 86 | 20 | 2 | 8 | 1.75 | 123 | 5 | 151 | 84 | 39 | 49 | 74 | 0 | 33 | 20 | 61% | 1227 | 138 | 11% | 563 | 165 | 29% | 3837 | 503 | 13% | 326 | 164 | 50% | 224 | 105 | 47% | 0 |
5 | CHC | 3.79 | 6102 | 57% | 1315 | 131 | 18 | 244 | 45 | 1 | 14 | 1.26 | 157 | 1 | 108 | 119 | 38 | 69 | 88 | 0 | 81 | 57 | 70% | 1076 | 94 | 9% | 568 | 169 | 30% | 3441 | 447 | 13% | 288 | 143 | 50% | 207 | 98 | 47% | 0 |
6 | CHW | 4.07 | 6077 | 51% | 1400 | 164 | 16 | 75 | 23 | 1 | 10 | 1.61 | 155 | 4 | 124 | 110 | 45 | 74 | 81 | 0 | 28 | 19 | 68% | 1096 | 127 | 12% | 528 | 141 | 27% | 3540 | 497 | 14% | 299 | 157 | 53% | 233 | 122 | 52% | 0 |
7 | CIN | 3.67 | 5978 | 52% | 1282 | 145 | 25 | 203 | 50 | 6 | 17 | 1.30 | 131 | 6 | 106 | 96 | 35 | 77 | 54 | 0 | 92 | 76 | 83% | 1003 | 88 | 9% | 551 | 186 | 34% | 3335 | 449 | 13% | 318 | 164 | 52% | 225 | 129 | 57% | 0 |
8 | CLE | 4.13 | 6222 | 74% | 1411 | 155 | 31 | 105 | 23 | 1 | 11 | 1.36 | 142 | 1 | 134 | 111 | 31 | 72 | 70 | 0 | 61 | 51 | 84% | 1117 | 126 | 11% | 542 | 180 | 33% | 3698 | 516 | 14% | 315 | 168 | 53% | 246 | 147 | 60% | 0 |
9 | COL | 4.66 | 6164 | 47% | 1551 | 199 | 20 | 233 | 61 | 5 | 30 | 1.19 | 186 | 4 | 121 | 126 | 60 | 119 | 67 | 2 | 78 | 59 | 76% | 1096 | 121 | 11% | 576 | 201 | 35% | 3608 | 556 | 15% | 334 | 189 | 57% | 268 | 145 | 54% | 0 |
10 | DET | 4.67 | 6202 | 45% | 1557 | 140 | 16 | 74 | 14 | 2 | 6 | 1.66 | 155 | 3 | 149 | 109 | 46 | 76 | 79 | 0 | 40 | 24 | 60% | 1181 | 137 | 12% | 598 | 212 | 35% | 3842 | 594 | 15% | 393 | 209 | 53% | 292 | 171 | 59% | 0 |
11 | HOU | 3.88 | 6055 | 61% | 1317 | 153 | 13 | 76 | 9 | 0 | 4 | 1.38 | 163 | 5 | 124 | 120 | 43 | 90 | 73 | 1 | 35 | 22 | 63% | 1118 | 122 | 11% | 506 | 141 | 28% | 3479 | 451 | 13% | 290 | 151 | 52% | 218 | 118 | 54% | 0 |
12 | KCR | 4.02 | 6058 | 48% | 1456 | 181 | 23 | 43 | 9 | 2 | 5 | 1.50 | 95 | 3 | 110 | 64 | 31 | 43 | 52 | 1 | 54 | 33 | 61% | 1119 | 131 | 12% | 560 | 186 | 33% | 3604 | 540 | 15% | 332 | 190 | 57% | 242 | 139 | 57% | 0 |
13 | LAA | 4.77 | 6285 | 52% | 1464 | 166 | 22 | 105 | 25 | 1 | 12 | 1.36 | 155 | 1 | 143 | 113 | 42 | 73 | 82 | 0 | 39 | 26 | 67% | 1175 | 112 | 10% | 569 | 189 | 33% | 3844 | 605 | 16% | 336 | 184 | 55% | 272 | 142 | 52% | 0 |
14 | LAD | 4.43 | 6231 | 46% | 1476 | 203 | 27 | 199 | 46 | 1 | 23 | 1.20 | 134 | 0 | 129 | 99 | 35 | 71 | 63 | 0 | 71 | 47 | 66% | 1128 | 119 | 11% | 615 | 187 | 30% | 3861 | 573 | 15% | 354 | 176 | 50% | 262 | 148 | 56% | 0 |
15 | MIA | 3.98 | 6185 | 46% | 1399 | 191 | 20 | 246 | 45 | 5 | 27 | 1.56 | 122 | 4 | 162 | 98 | 24 | 59 | 63 | 0 | 85 | 71 | 84% | 1186 | 143 | 12% | 602 | 181 | 30% | 3771 | 511 | 14% | 330 | 165 | 50% | 221 | 121 | 55% | 0 |
16 | MIL | 4.01 | 6065 | 40% | 1366 | 171 | 25 | 212 | 47 | 4 | 22 | 1.50 | 150 | 4 | 118 | 109 | 41 | 77 | 73 | 0 | 100 | 70 | 70% | 1026 | 135 | 13% | 558 | 178 | 32% | 3385 | 491 | 15% | 292 | 142 | 49% | 219 | 134 | 61% | 0 |
17 | MIN | 4.41 | 6233 | 60% | 1412 | 163 | 18 | 86 | 18 | 0 | 12 | 1.26 | 128 | 1 | 168 | 94 | 34 | 67 | 61 | 1 | 34 | 25 | 74% | 1177 | 97 | 8% | 553 | 159 | 29% | 3976 | 576 | 14% | 384 | 185 | 48% | 253 | 124 | 49% | 0 |
18 | NYM | 3.88 | 6145 | 52% | 1306 | 144 | 13 | 212 | 39 | 3 | 12 | 1.38 | 125 | 3 | 148 | 103 | 22 | 59 | 66 | 0 | 76 | 59 | 78% | 1129 | 112 | 10% | 582 | 180 | 31% | 3704 | 492 | 13% | 318 | 149 | 47% | 232 | 126 | 54% | 0 |
19 | NYY | 3.91 | 6082 | 70% | 1349 | 161 | 16 | 90 | 22 | 2 | 8 | 1.61 | 147 | 2 | 137 | 108 | 39 | 88 | 59 | 0 | 45 | 29 | 64% | 1085 | 111 | 10% | 539 | 186 | 35% | 3554 | 468 | 13% | 330 | 175 | 53% | 247 | 150 | 61% | 0 |
20 | OAK | 4.50 | 6245 | 72% | 1354 | 164 | 25 | 156 | 32 | 3 | 16 | 1.52 | 146 | 5 | 172 | 102 | 44 | 74 | 72 | 0 | 41 | 19 | 46% | 1207 | 118 | 10% | 551 | 159 | 29% | 3895 | 567 | 15% | 334 | 168 | 50% | 262 | 141 | 54% | 0 |
21 | PHI | 3.82 | 6198 | 65% | 1356 | 143 | 12 | 220 | 40 | 5 | 19 | 1.36 | 125 | 4 | 142 | 84 | 41 | 64 | 61 | 0 | 73 | 59 | 81% | 1103 | 94 | 9% | 563 | 174 | 31% | 3676 | 480 | 13% | 306 | 157 | 51% | 242 | 135 | 56% | 0 |
22 | PIT | 4.21 | 6224 | 48% | 1436 | 185 | 16 | 281 | 61 | 7 | 32 | 1.34 | 156 | 1 | 155 | 133 | 23 | 62 | 94 | 1 | 87 | 54 | 62% | 1166 | 127 | 11% | 584 | 174 | 30% | 3856 | 512 | 13% | 334 | 172 | 52% | 224 | 117 | 52% | 0 |
23 | SDP | 3.30 | 5905 | 67% | 1199 | 131 | 14 | 280 | 61 | 11 | 29 | 1.38 | 109 | 3 | 127 | 82 | 27 | 54 | 55 | 0 | 79 | 56 | 71% | 989 | 118 | 12% | 544 | 175 | 32% | 3310 | 411 | 12% | 308 | 141 | 46% | 222 | 118 | 53% | 0 |
24 | SEA | 3.91 | 5977 | 65% | 1328 | 162 | 20 | 82 | 16 | 1 | 7 | 1.72 | 136 | 0 | 102 | 108 | 28 | 73 | 63 | 0 | 56 | 35 | 63% | 1019 | 112 | 11% | 482 | 152 | 32% | 3292 | 484 | 15% | 284 | 151 | 53% | 177 | 98 | 55% | 0 |
25 | SFG | 4.10 | 6087 | 60% | 1407 | 142 | 12 | 207 | 46 | 4 | 29 | 1.43 | 132 | 4 | 145 | 92 | 40 | 53 | 79 | 0 | 58 | 45 | 78% | 1134 | 113 | 10% | 585 | 182 | 31% | 3681 | 522 | 14% | 326 | 152 | 47% | 243 | 129 | 53% | 0 |
26 | STL | 3.82 | 6086 | 51% | 1371 | 144 | 15 | 218 | 49 | 2 | 24 | 1.11 | 105 | 1 | 140 | 72 | 33 | 57 | 48 | 0 | 90 | 64 | 71% | 1201 | 140 | 12% | 583 | 181 | 31% | 3784 | 503 | 13% | 288 | 160 | 56% | 236 | 143 | 61% | 0 |
27 | TBR | 3.78 | 6205 | 54% | 1361 | 168 | 20 | 142 | 30 | 2 | 20 | 1.79 | 117 | 1 | 155 | 85 | 32 | 51 | 66 | 3 | 70 | 43 | 61% | 1211 | 135 | 11% | 562 | 183 | 33% | 3859 | 485 | 13% | 324 | 160 | 49% | 212 | 105 | 50% | 0 |
28 | TEX | 3.93 | 6026 | 49% | 1400 | 189 | 31 | 85 | 21 | 1 | 10 | 1.14 | 111 | 2 | 109 | 73 | 38 | 51 | 60 | 0 | 51 | 41 | 80% | 1061 | 148 | 14% | 554 | 209 | 38% | 3514 | 507 | 14% | 297 | 152 | 51% | 226 | 142 | 63% | 0 |
29 | TOR | 4.46 | 6167 | 68% | 1435 | 176 | 20 | 176 | 36 | 9 | 21 | 1.17 | 177 | 2 | 138 | 135 | 42 | 98 | 79 | 0 | 58 | 35 | 60% | 1197 | 128 | 11% | 561 | 177 | 32% | 3732 | 536 | 14% | 349 | 193 | 55% | 249 | 132 | 53% | 0 |
30 | WSN | 4.23 | 6216 | 52% | 1403 | 162 | 31 | 209 | 30 | 5 | 15 | 1.31 | 152 | 2 | 138 | 117 | 35 | 63 | 89 | 0 | 91 | 60 | 66% | 1169 | 115 | 10% | 601 | 171 | 28% | 3790 | 507 | 13% | 328 | 161 | 49% | 250 | 124 | 50% | 0 |
It’s also important to note what column headers contain the data I need.
By looking at my original data source I can see that the columns:
… are overall batting statistics (regular and pinch-hit).
The column headers that contain situational batting statistics are:
Only the first 11 columns contain data that I need for my project. I will trim the dataframe so that I keep only these columns:
situational_batting_2014 <- situational_batting_2014[,c(1:11)]
The result:
view(situational_batting_2014)
Tm | R/G | PA | Ptn% | H | Inf | Bnt | AB | H.1 | HR | RBI | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ARI | 3.80 | 6089 | 50% | 1379 | 149 | 15 | 227 | 56 | 7 | 18 |
2 | ATL | 3.54 | 6064 | 44% | 1316 | 155 | 13 | 190 | 34 | 2 | 7 |
3 | BAL | 4.35 | 6130 | 46% | 1434 | 157 | 16 | 67 | 21 | 3 | 11 |
4 | BOS | 3.91 | 6226 | 50% | 1355 | 140 | 4 | 86 | 20 | 2 | 8 |
5 | CHC | 3.79 | 6102 | 57% | 1315 | 131 | 18 | 244 | 45 | 1 | 14 |
6 | CHW | 4.07 | 6077 | 51% | 1400 | 164 | 16 | 75 | 23 | 1 | 10 |
7 | CIN | 3.67 | 5978 | 52% | 1282 | 145 | 25 | 203 | 50 | 6 | 17 |
8 | CLE | 4.13 | 6222 | 74% | 1411 | 155 | 31 | 105 | 23 | 1 | 11 |
9 | COL | 4.66 | 6164 | 47% | 1551 | 199 | 20 | 233 | 61 | 5 | 30 |
10 | DET | 4.67 | 6202 | 45% | 1557 | 140 | 16 | 74 | 14 | 2 | 6 |
11 | HOU | 3.88 | 6055 | 61% | 1317 | 153 | 13 | 76 | 9 | 0 | 4 |
12 | KCR | 4.02 | 6058 | 48% | 1456 | 181 | 23 | 43 | 9 | 2 | 5 |
13 | LAA | 4.77 | 6285 | 52% | 1464 | 166 | 22 | 105 | 25 | 1 | 12 |
14 | LAD | 4.43 | 6231 | 46% | 1476 | 203 | 27 | 199 | 46 | 1 | 23 |
15 | MIA | 3.98 | 6185 | 46% | 1399 | 191 | 20 | 246 | 45 | 5 | 27 |
16 | MIL | 4.01 | 6065 | 40% | 1366 | 171 | 25 | 212 | 47 | 4 | 22 |
17 | MIN | 4.41 | 6233 | 60% | 1412 | 163 | 18 | 86 | 18 | 0 | 12 |
18 | NYM | 3.88 | 6145 | 52% | 1306 | 144 | 13 | 212 | 39 | 3 | 12 |
19 | NYY | 3.91 | 6082 | 70% | 1349 | 161 | 16 | 90 | 22 | 2 | 8 |
20 | OAK | 4.50 | 6245 | 72% | 1354 | 164 | 25 | 156 | 32 | 3 | 16 |
21 | PHI | 3.82 | 6198 | 65% | 1356 | 143 | 12 | 220 | 40 | 5 | 19 |
22 | PIT | 4.21 | 6224 | 48% | 1436 | 185 | 16 | 281 | 61 | 7 | 32 |
23 | SDP | 3.30 | 5905 | 67% | 1199 | 131 | 14 | 280 | 61 | 11 | 29 |
24 | SEA | 3.91 | 5977 | 65% | 1328 | 162 | 20 | 82 | 16 | 1 | 7 |
25 | SFG | 4.10 | 6087 | 60% | 1407 | 142 | 12 | 207 | 46 | 4 | 29 |
26 | STL | 3.82 | 6086 | 51% | 1371 | 144 | 15 | 218 | 49 | 2 | 24 |
27 | TBR | 3.78 | 6205 | 54% | 1361 | 168 | 20 | 142 | 30 | 2 | 20 |
28 | TEX | 3.93 | 6026 | 49% | 1400 | 189 | 31 | 85 | 21 | 1 | 10 |
29 | TOR | 4.46 | 6167 | 68% | 1435 | 176 | 20 | 176 | 36 | 9 | 21 |
30 | WSN | 4.23 | 6216 | 52% | 1403 | 162 | 31 | 209 | 30 | 5 | 15 |
I intend to analyze my data in R.
R, by default, sees all data as “factors” - that is, plain text.
You can see this here:
str(situational_batting_2014)
## 'data.frame': 30 obs. of 11 variables:
## $ Tm : Factor w/ 31 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ R/G : Factor w/ 26 levels "3.30","3.54",..: 6 2 19 9 5 14 3 16 24 25 ...
## $ PA : Factor w/ 31 levels "5905","5977",..: 13 7 15 27 14 9 3 25 18 22 ...
## $ Ptn%: Factor w/ 21 levels "40%","44%","45%",..: 8 2 4 8 13 9 10 21 5 3 ...
## $ H : Factor w/ 30 levels "1199","1282",..: 15 5 23 10 4 18 2 21 29 30 ...
## $ Inf : Factor w/ 24 levels "131","140","142",..: 7 9 10 2 1 14 6 9 23 2 ...
## $ Bnt : Factor w/ 14 levels "12","13","14",..: 4 2 5 14 6 5 11 13 8 5 ...
## $ AB : Factor w/ 28 levels "105","142","156",..: 14 6 21 27 16 23 8 1 15 22 ...
## $ H.1 : Factor w/ 22 levels "14","16","18",..: 20 11 5 4 15 7 19 7 21 1 ...
## $ HR : Factor w/ 10 levels "0","1","11","2",..: 9 4 5 4 2 2 8 2 7 4 ...
## $ RBI : Factor w/ 23 levels "10","11","12",..: 8 22 2 23 4 1 7 2 17 21 ...
So, I will convert the columns of data that I intend to use as numeric data to be seen as numeric data:
situational_batting_2014$nPA <- as.numeric(as.character(situational_batting_2014$PA))
situational_batting_2014$nH <- as.numeric(as.character(situational_batting_2014$H))
situational_batting_2014$nInf <- as.numeric(as.character(situational_batting_2014[["Inf"]]))
situational_batting_2014$nBnt <- as.numeric(as.character(situational_batting_2014$Bnt))
situational_batting_2014$nAB <- as.numeric(as.character(situational_batting_2014$AB))
situational_batting_2014$nH1 <- as.numeric(as.character(situational_batting_2014$H.1))
situational_batting_2014$nHR <- as.numeric(as.character(situational_batting_2014$HR))
situational_batting_2014$nRBI <- as.numeric(as.character(situational_batting_2014$RBI))
The new columns are numeric:
str(situational_batting_2014)
## 'data.frame': 30 obs. of 19 variables:
## $ Tm : Factor w/ 31 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ R/G : Factor w/ 26 levels "3.30","3.54",..: 6 2 19 9 5 14 3 16 24 25 ...
## $ PA : Factor w/ 31 levels "5905","5977",..: 13 7 15 27 14 9 3 25 18 22 ...
## $ Ptn%: Factor w/ 21 levels "40%","44%","45%",..: 8 2 4 8 13 9 10 21 5 3 ...
## $ H : Factor w/ 30 levels "1199","1282",..: 15 5 23 10 4 18 2 21 29 30 ...
## $ Inf : Factor w/ 24 levels "131","140","142",..: 7 9 10 2 1 14 6 9 23 2 ...
## $ Bnt : Factor w/ 14 levels "12","13","14",..: 4 2 5 14 6 5 11 13 8 5 ...
## $ AB : Factor w/ 28 levels "105","142","156",..: 14 6 21 27 16 23 8 1 15 22 ...
## $ H.1 : Factor w/ 22 levels "14","16","18",..: 20 11 5 4 15 7 19 7 21 1 ...
## $ HR : Factor w/ 10 levels "0","1","11","2",..: 9 4 5 4 2 2 8 2 7 4 ...
## $ RBI : Factor w/ 23 levels "10","11","12",..: 8 22 2 23 4 1 7 2 17 21 ...
## $ nPA : num 6089 6064 6130 6226 6102 ...
## $ nH : num 1379 1316 1434 1355 1315 ...
## $ nInf: num 149 155 157 140 131 164 145 155 199 140 ...
## $ nBnt: num 15 13 16 4 18 16 25 31 20 16 ...
## $ nAB : num 227 190 67 86 244 75 203 105 233 74 ...
## $ nH1 : num 56 34 21 20 45 23 50 23 61 14 ...
## $ nHR : num 7 2 3 2 1 1 6 1 5 2 ...
## $ nRBI: num 18 7 11 8 14 10 17 11 30 6 ...
Note that I placed the converted data into new columns. I will compare the converted data to the data in the original columns to be certain that no data values were changed unexpectedly during the conversion process from “factors” to “numeric” data.
Reviewing the converted data:
view(situational_batting_2014)
Tm | R/G | PA | Ptn% | H | Inf | Bnt | AB | H.1 | HR | RBI | nPA | nH | nInf | nBnt | nAB | nH1 | nHR | nRBI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ARI | 3.80 | 6089 | 50% | 1379 | 149 | 15 | 227 | 56 | 7 | 18 | 6089 | 1379 | 149 | 15 | 227 | 56 | 7 | 18 |
2 | ATL | 3.54 | 6064 | 44% | 1316 | 155 | 13 | 190 | 34 | 2 | 7 | 6064 | 1316 | 155 | 13 | 190 | 34 | 2 | 7 |
3 | BAL | 4.35 | 6130 | 46% | 1434 | 157 | 16 | 67 | 21 | 3 | 11 | 6130 | 1434 | 157 | 16 | 67 | 21 | 3 | 11 |
4 | BOS | 3.91 | 6226 | 50% | 1355 | 140 | 4 | 86 | 20 | 2 | 8 | 6226 | 1355 | 140 | 4 | 86 | 20 | 2 | 8 |
5 | CHC | 3.79 | 6102 | 57% | 1315 | 131 | 18 | 244 | 45 | 1 | 14 | 6102 | 1315 | 131 | 18 | 244 | 45 | 1 | 14 |
6 | CHW | 4.07 | 6077 | 51% | 1400 | 164 | 16 | 75 | 23 | 1 | 10 | 6077 | 1400 | 164 | 16 | 75 | 23 | 1 | 10 |
7 | CIN | 3.67 | 5978 | 52% | 1282 | 145 | 25 | 203 | 50 | 6 | 17 | 5978 | 1282 | 145 | 25 | 203 | 50 | 6 | 17 |
8 | CLE | 4.13 | 6222 | 74% | 1411 | 155 | 31 | 105 | 23 | 1 | 11 | 6222 | 1411 | 155 | 31 | 105 | 23 | 1 | 11 |
9 | COL | 4.66 | 6164 | 47% | 1551 | 199 | 20 | 233 | 61 | 5 | 30 | 6164 | 1551 | 199 | 20 | 233 | 61 | 5 | 30 |
10 | DET | 4.67 | 6202 | 45% | 1557 | 140 | 16 | 74 | 14 | 2 | 6 | 6202 | 1557 | 140 | 16 | 74 | 14 | 2 | 6 |
11 | HOU | 3.88 | 6055 | 61% | 1317 | 153 | 13 | 76 | 9 | 0 | 4 | 6055 | 1317 | 153 | 13 | 76 | 9 | 0 | 4 |
12 | KCR | 4.02 | 6058 | 48% | 1456 | 181 | 23 | 43 | 9 | 2 | 5 | 6058 | 1456 | 181 | 23 | 43 | 9 | 2 | 5 |
13 | LAA | 4.77 | 6285 | 52% | 1464 | 166 | 22 | 105 | 25 | 1 | 12 | 6285 | 1464 | 166 | 22 | 105 | 25 | 1 | 12 |
14 | LAD | 4.43 | 6231 | 46% | 1476 | 203 | 27 | 199 | 46 | 1 | 23 | 6231 | 1476 | 203 | 27 | 199 | 46 | 1 | 23 |
15 | MIA | 3.98 | 6185 | 46% | 1399 | 191 | 20 | 246 | 45 | 5 | 27 | 6185 | 1399 | 191 | 20 | 246 | 45 | 5 | 27 |
16 | MIL | 4.01 | 6065 | 40% | 1366 | 171 | 25 | 212 | 47 | 4 | 22 | 6065 | 1366 | 171 | 25 | 212 | 47 | 4 | 22 |
17 | MIN | 4.41 | 6233 | 60% | 1412 | 163 | 18 | 86 | 18 | 0 | 12 | 6233 | 1412 | 163 | 18 | 86 | 18 | 0 | 12 |
18 | NYM | 3.88 | 6145 | 52% | 1306 | 144 | 13 | 212 | 39 | 3 | 12 | 6145 | 1306 | 144 | 13 | 212 | 39 | 3 | 12 |
19 | NYY | 3.91 | 6082 | 70% | 1349 | 161 | 16 | 90 | 22 | 2 | 8 | 6082 | 1349 | 161 | 16 | 90 | 22 | 2 | 8 |
20 | OAK | 4.50 | 6245 | 72% | 1354 | 164 | 25 | 156 | 32 | 3 | 16 | 6245 | 1354 | 164 | 25 | 156 | 32 | 3 | 16 |
21 | PHI | 3.82 | 6198 | 65% | 1356 | 143 | 12 | 220 | 40 | 5 | 19 | 6198 | 1356 | 143 | 12 | 220 | 40 | 5 | 19 |
22 | PIT | 4.21 | 6224 | 48% | 1436 | 185 | 16 | 281 | 61 | 7 | 32 | 6224 | 1436 | 185 | 16 | 281 | 61 | 7 | 32 |
23 | SDP | 3.30 | 5905 | 67% | 1199 | 131 | 14 | 280 | 61 | 11 | 29 | 5905 | 1199 | 131 | 14 | 280 | 61 | 11 | 29 |
24 | SEA | 3.91 | 5977 | 65% | 1328 | 162 | 20 | 82 | 16 | 1 | 7 | 5977 | 1328 | 162 | 20 | 82 | 16 | 1 | 7 |
25 | SFG | 4.10 | 6087 | 60% | 1407 | 142 | 12 | 207 | 46 | 4 | 29 | 6087 | 1407 | 142 | 12 | 207 | 46 | 4 | 29 |
26 | STL | 3.82 | 6086 | 51% | 1371 | 144 | 15 | 218 | 49 | 2 | 24 | 6086 | 1371 | 144 | 15 | 218 | 49 | 2 | 24 |
27 | TBR | 3.78 | 6205 | 54% | 1361 | 168 | 20 | 142 | 30 | 2 | 20 | 6205 | 1361 | 168 | 20 | 142 | 30 | 2 | 20 |
28 | TEX | 3.93 | 6026 | 49% | 1400 | 189 | 31 | 85 | 21 | 1 | 10 | 6026 | 1400 | 189 | 31 | 85 | 21 | 1 | 10 |
29 | TOR | 4.46 | 6167 | 68% | 1435 | 176 | 20 | 176 | 36 | 9 | 21 | 6167 | 1435 | 176 | 20 | 176 | 36 | 9 | 21 |
30 | WSN | 4.23 | 6216 | 52% | 1403 | 162 | 31 | 209 | 30 | 5 | 15 | 6216 | 1403 | 162 | 31 | 209 | 30 | 5 | 15 |
The data appears to be have been converted successfully (that is, no values have obviously been changed to something incorrect).
My data source did not include batting averages for regular vs. pinch-hitting situations.
In this section, I will use the data provided to directly calculate this information.
First, I will determine how many at-bats occured in “regular” batting situations.
To do this, I will subtract pinch hit at-bat totals from overall plate appearances.
situational_batting_2014$nRegularAB <- situational_batting_2014$nPA - situational_batting_2014$nAB
Next, I will determine batting averages for “regular” situations:
situational_batting_2014$nRegularAVG <- situational_batting_2014$nH / situational_batting_2014$nRegularAB
Finally, I will determine batting averages for “pinch-hit” situations:
situational_batting_2014$nPinchHitAVG <- situational_batting_2014$nH1 / situational_batting_2014$nAB
One of my possible extensions is to compare data by year.
Therefore, before I continue, I will add a column to the dataframe, identifying what year this data is from.
This also makes it possible to generate box-and-whisker plots (as those graphs require a “factor” that is used to group the data being plotted).
situational_batting_2014$Year = "2014"
Here is what the data now looks like, after calculating the batting averages (see last section) and tagging with the year that the data is from:
view(situational_batting_2014)
Tm | R/G | PA | Ptn% | H | Inf | Bnt | AB | H.1 | HR | RBI | nPA | nH | nInf | nBnt | nAB | nH1 | nHR | nRBI | nRegularAB | nRegularAVG | nPinchHitAVG | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ARI | 3.80 | 6089 | 50% | 1379 | 149 | 15 | 227 | 56 | 7 | 18 | 6089 | 1379 | 149 | 15 | 227 | 56 | 7 | 18 | 5862 | 0.2352 | 0.2467 | 2014 |
2 | ATL | 3.54 | 6064 | 44% | 1316 | 155 | 13 | 190 | 34 | 2 | 7 | 6064 | 1316 | 155 | 13 | 190 | 34 | 2 | 7 | 5874 | 0.2240 | 0.1789 | 2014 |
3 | BAL | 4.35 | 6130 | 46% | 1434 | 157 | 16 | 67 | 21 | 3 | 11 | 6130 | 1434 | 157 | 16 | 67 | 21 | 3 | 11 | 6063 | 0.2365 | 0.3134 | 2014 |
4 | BOS | 3.91 | 6226 | 50% | 1355 | 140 | 4 | 86 | 20 | 2 | 8 | 6226 | 1355 | 140 | 4 | 86 | 20 | 2 | 8 | 6140 | 0.2207 | 0.2326 | 2014 |
5 | CHC | 3.79 | 6102 | 57% | 1315 | 131 | 18 | 244 | 45 | 1 | 14 | 6102 | 1315 | 131 | 18 | 244 | 45 | 1 | 14 | 5858 | 0.2245 | 0.1844 | 2014 |
6 | CHW | 4.07 | 6077 | 51% | 1400 | 164 | 16 | 75 | 23 | 1 | 10 | 6077 | 1400 | 164 | 16 | 75 | 23 | 1 | 10 | 6002 | 0.2333 | 0.3067 | 2014 |
7 | CIN | 3.67 | 5978 | 52% | 1282 | 145 | 25 | 203 | 50 | 6 | 17 | 5978 | 1282 | 145 | 25 | 203 | 50 | 6 | 17 | 5775 | 0.2220 | 0.2463 | 2014 |
8 | CLE | 4.13 | 6222 | 74% | 1411 | 155 | 31 | 105 | 23 | 1 | 11 | 6222 | 1411 | 155 | 31 | 105 | 23 | 1 | 11 | 6117 | 0.2307 | 0.2190 | 2014 |
9 | COL | 4.66 | 6164 | 47% | 1551 | 199 | 20 | 233 | 61 | 5 | 30 | 6164 | 1551 | 199 | 20 | 233 | 61 | 5 | 30 | 5931 | 0.2615 | 0.2618 | 2014 |
10 | DET | 4.67 | 6202 | 45% | 1557 | 140 | 16 | 74 | 14 | 2 | 6 | 6202 | 1557 | 140 | 16 | 74 | 14 | 2 | 6 | 6128 | 0.2541 | 0.1892 | 2014 |
11 | HOU | 3.88 | 6055 | 61% | 1317 | 153 | 13 | 76 | 9 | 0 | 4 | 6055 | 1317 | 153 | 13 | 76 | 9 | 0 | 4 | 5979 | 0.2203 | 0.1184 | 2014 |
12 | KCR | 4.02 | 6058 | 48% | 1456 | 181 | 23 | 43 | 9 | 2 | 5 | 6058 | 1456 | 181 | 23 | 43 | 9 | 2 | 5 | 6015 | 0.2421 | 0.2093 | 2014 |
13 | LAA | 4.77 | 6285 | 52% | 1464 | 166 | 22 | 105 | 25 | 1 | 12 | 6285 | 1464 | 166 | 22 | 105 | 25 | 1 | 12 | 6180 | 0.2369 | 0.2381 | 2014 |
14 | LAD | 4.43 | 6231 | 46% | 1476 | 203 | 27 | 199 | 46 | 1 | 23 | 6231 | 1476 | 203 | 27 | 199 | 46 | 1 | 23 | 6032 | 0.2447 | 0.2312 | 2014 |
15 | MIA | 3.98 | 6185 | 46% | 1399 | 191 | 20 | 246 | 45 | 5 | 27 | 6185 | 1399 | 191 | 20 | 246 | 45 | 5 | 27 | 5939 | 0.2356 | 0.1829 | 2014 |
16 | MIL | 4.01 | 6065 | 40% | 1366 | 171 | 25 | 212 | 47 | 4 | 22 | 6065 | 1366 | 171 | 25 | 212 | 47 | 4 | 22 | 5853 | 0.2334 | 0.2217 | 2014 |
17 | MIN | 4.41 | 6233 | 60% | 1412 | 163 | 18 | 86 | 18 | 0 | 12 | 6233 | 1412 | 163 | 18 | 86 | 18 | 0 | 12 | 6147 | 0.2297 | 0.2093 | 2014 |
18 | NYM | 3.88 | 6145 | 52% | 1306 | 144 | 13 | 212 | 39 | 3 | 12 | 6145 | 1306 | 144 | 13 | 212 | 39 | 3 | 12 | 5933 | 0.2201 | 0.1840 | 2014 |
19 | NYY | 3.91 | 6082 | 70% | 1349 | 161 | 16 | 90 | 22 | 2 | 8 | 6082 | 1349 | 161 | 16 | 90 | 22 | 2 | 8 | 5992 | 0.2251 | 0.2444 | 2014 |
20 | OAK | 4.50 | 6245 | 72% | 1354 | 164 | 25 | 156 | 32 | 3 | 16 | 6245 | 1354 | 164 | 25 | 156 | 32 | 3 | 16 | 6089 | 0.2224 | 0.2051 | 2014 |
21 | PHI | 3.82 | 6198 | 65% | 1356 | 143 | 12 | 220 | 40 | 5 | 19 | 6198 | 1356 | 143 | 12 | 220 | 40 | 5 | 19 | 5978 | 0.2268 | 0.1818 | 2014 |
22 | PIT | 4.21 | 6224 | 48% | 1436 | 185 | 16 | 281 | 61 | 7 | 32 | 6224 | 1436 | 185 | 16 | 281 | 61 | 7 | 32 | 5943 | 0.2416 | 0.2171 | 2014 |
23 | SDP | 3.30 | 5905 | 67% | 1199 | 131 | 14 | 280 | 61 | 11 | 29 | 5905 | 1199 | 131 | 14 | 280 | 61 | 11 | 29 | 5625 | 0.2132 | 0.2179 | 2014 |
24 | SEA | 3.91 | 5977 | 65% | 1328 | 162 | 20 | 82 | 16 | 1 | 7 | 5977 | 1328 | 162 | 20 | 82 | 16 | 1 | 7 | 5895 | 0.2253 | 0.1951 | 2014 |
25 | SFG | 4.10 | 6087 | 60% | 1407 | 142 | 12 | 207 | 46 | 4 | 29 | 6087 | 1407 | 142 | 12 | 207 | 46 | 4 | 29 | 5880 | 0.2393 | 0.2222 | 2014 |
26 | STL | 3.82 | 6086 | 51% | 1371 | 144 | 15 | 218 | 49 | 2 | 24 | 6086 | 1371 | 144 | 15 | 218 | 49 | 2 | 24 | 5868 | 0.2336 | 0.2248 | 2014 |
27 | TBR | 3.78 | 6205 | 54% | 1361 | 168 | 20 | 142 | 30 | 2 | 20 | 6205 | 1361 | 168 | 20 | 142 | 30 | 2 | 20 | 6063 | 0.2245 | 0.2113 | 2014 |
28 | TEX | 3.93 | 6026 | 49% | 1400 | 189 | 31 | 85 | 21 | 1 | 10 | 6026 | 1400 | 189 | 31 | 85 | 21 | 1 | 10 | 5941 | 0.2357 | 0.2471 | 2014 |
29 | TOR | 4.46 | 6167 | 68% | 1435 | 176 | 20 | 176 | 36 | 9 | 21 | 6167 | 1435 | 176 | 20 | 176 | 36 | 9 | 21 | 5991 | 0.2395 | 0.2045 | 2014 |
30 | WSN | 4.23 | 6216 | 52% | 1403 | 162 | 31 | 209 | 30 | 5 | 15 | 6216 | 1403 | 162 | 31 | 209 | 30 | 5 | 15 | 6007 | 0.2336 | 0.1435 | 2014 |
To analyze single variable data, I remember from studies earlier this year that I must comment on:
To illustrate the shape of the data, I will use a histogram:
h_pinch_hit <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014")
## Loading required package: ggplot2
h_regular <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014")
multiplot(h_pinch_hit, h_regular, cols=1)
## Loading required package: grid
The default binwidth for a histogram is 1 unit, which does not make sense for a batting average. I will specify the binwidth as 0.01:
h_pinch_hit <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014", binwidth = 0.01)
h_regular <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014", binwidth = 0.01)
multiplot(h_pinch_hit, h_regular, cols=1)
I have noticed that the horizontal scale for each plot is different. This makes comparison of the two plots difficult. I will generate the graphs again, this time specifying the minimum and maximum values on the horizontal scale. I will also specify the vertical scale min and max values, so that the scale does not show counts with decimal values (which is not very helpful, as counts are discrete values):
h_pinch_hit <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014", binwidth = 0.01, xmin = 0.100, xmax = 0.325, ymin = 0, ymax = 15)
h_regular <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014", binwidth = 0.01, xmin = 0.100, xmax = 0.325, ymin = 0, ymax = 15)
multiplot(h_pinch_hit, h_regular, cols=1)
Now I can see that the shape for batting average for “regular” at-bats is somewhat normal, with a very small spread (the range from minimum to maximum values is not large).
There is a much larger spread of values for “pinch-hitting” batting averages.
What this tells me is that batting averages are much more consistent for “regular” at-bats, whereas batting averages for “pinch-hit” at-bats are significantly more varied. Some pinch-hit batting averages are very good (greater than .300), some are very poor (less than .150).
Here is the five-number summary for “pinch-hit” at bats:
five_number(dataframe = situational_batting_2014, variable = "nPinchHitAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min : 0.118"
## [1] "Q1 : 0.189"
## [1] "median: 0.217"
## [1] "Q3 : 0.238"
## [1] "max : 0.313"
And the mean for “pinch-hit” at bats:
mean(situational_batting_2014$nPinchHitAVG)
## [1] 0.2163
Here is the five-number summary for “regular” at bats:
five_number(dataframe = situational_batting_2014, variable = "nRegularAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min : 0.213"
## [1] "Q1 : 0.224"
## [1] "median: 0.233"
## [1] "Q3 : 0.237"
## [1] "max : 0.262"
And the mean for “regular” at bats:
mean(situational_batting_2014$nRegularAVG)
## [1] 0.2322
What jumps out at me for these numerical summaries is that both measures of the centre (mean and median) are higher for “regular” at-bats vs. “pinch-hit” at-bats.
Since I have data for 30 teams in 2014, it is appropriate to use a box-and-whisker plot. This is a good graphical summary because it visually illustrates the five-number summary.
The variable I am comparing is batting averages, based on a categorical variable with two values:
The factor I am using is the year the batting averages are from.
I am being careful to use the same horizontal scale for each box-and-whisker plot so that I can make comparisons.
bw_pinch_hit <- box_and_whisker(dataframe = situational_batting_2014, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2014", min = 0.100, max = 0.325)
bw_regular <- box_and_whisker(dataframe = situational_batting_2014, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2014", min = 0.100, max = 0.325)
multiplot(bw_pinch_hit, bw_regular, cols=1)
Here is what I notice about the spread of the data:
I will calculate the standard deviation for pinch hitting vs. regular hitting batting averages. The box-and-whisker plots suggest that the standard deviation for pinch hitting will be considerably larger.
Standard deviation for pinch hitting:
sd(situational_batting_2014$nPinchHitAVG)
## [1] 0.04006
Standard deviation for regular hitting situations:
sd(situational_batting_2014$nRegularAVG)
## [1] 0.01045
The standard deviation for pinch-hitting situations is nearly four times the standard deviation for regular hitting situations.
This adds further weight to my earlier observation that pinch-hitting batting averages have a much greater range. One might say that pinch-hitting batting averages are considerably more volatile, or less predictable, than regular hitting batting averages.
The graphical and numercal summaries illustrate that in 2014 in Major League Baseball, “pinch hitting”, as measured by the batting average statistic, was considerably less successful than “regular hitting” situations.
As observed earlier, the mean and median batting averages for pinch-hit at-bats were both lower than the mean and median batting averages for regular batting situations.
Further, the spread of the batting averages for pinch hitting situations was considerably larger. In other words, the results are less predictable. Some teams do very well with pinch hitting batting averages. Some teams do poorly. On the whole, pinch hitting batting averages are far more volatile than regular situation batting averages.
Neither of my original hypotheses is supported by the data from 2014. In fact, it seems that pinch-hitters have a lesser ability to get a hit as compared to players who are not in pinch-hit situations (typically, starting players).
I have only examined a single season’s worth of data. That is only 30 observations. Perhaps 2014 was an unusual year, and pinch-hitting was not a very successful strategy in 2014, but it may in fact be a very successful strategy in other years.
I would like to extend my analysis to collect data from additional years, and then examine the results.
I would also like to compare results by league (American League vs. National League) as I suspect that pinch hitters would be more successful in the National League (when the starting player being replaced is often the pitcher, who is usually not a good hitter).
However, I do not think it is useful to make a comparison between NL and AL batting averages for a single year, as that would mean each subset would have just 15 data values.
This is an addendum for any students who are doing a two-variable analysis.
There was no need, for the question I explored in the exemplar, to do a two-variable analysis.
However, R does have the ability to produce scatterplots.
Here is a basic example - from the data set used in my exemplar - plotting RBI vs. Home Runs in pinch-hit situations:
scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI")
Here is the same scatterplot, with a linear regression applied:
scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI", regression = TRUE)
You will notice that besides the best fit line, there is a shaded region.
This shaded region is called a 95% confidence band.
It shows the region that you can be 95% sure contains the true best-fit line. With more data points (a larger sample to work with) the 95% confidence band will typically be smaller and closer to best-fit line. In other words, as usual, with more data, we have more confidence that our predictions (which are based on a sample) will be accurate.
Finally, here is a scatterplot with a title and labelled axes:
scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI", regression = TRUE, x_label = "Home Runs", y_label = "RBI", title = "RBI vs. Home Runs (Pinch-Hitting, 2014)")
In this part of my project, I will use the work I have already done to analyze data for a single season to extend my analysis and examine multiple seasons.
Most of the hard work is done. I know how to obtain and normalize the data for a single season.
I will use a function that obtains the data for a single season.
Here is how it works. First, here are the bare minimum commands I used earlier to obtain data for a single season:
# Set the website to retrieve data from
url <- "http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml"
# Read the data from that page
tables <- readHTMLTable(url)
# Save the data from the first table on that page
situational_batting_2014 <- tables[[1]]
# Remove league average information
situational_batting_2014 <- subset(situational_batting_2014, Tm != "LgAvg")
# Keep only the columns I really need
situational_batting_2014 <- situational_batting_2014[,c(1:11)]
# Convert the columns I _do_ need to numeric
situational_batting_2014$nPA <- as.numeric(as.character(situational_batting_2014$PA))
situational_batting_2014$nH <- as.numeric(as.character(situational_batting_2014$H))
situational_batting_2014$nInf <- as.numeric(as.character(situational_batting_2014[["Inf"]]))
situational_batting_2014$nBnt <- as.numeric(as.character(situational_batting_2014$Bnt))
situational_batting_2014$nAB <- as.numeric(as.character(situational_batting_2014$AB))
situational_batting_2014$nH1 <- as.numeric(as.character(situational_batting_2014$H.1))
situational_batting_2014$nHR <- as.numeric(as.character(situational_batting_2014$HR))
situational_batting_2014$nRBI <- as.numeric(as.character(situational_batting_2014$RBI))
# Calculate number of "regular" situation at bats
situational_batting_2014$nRegularAB <- situational_batting_2014$nPA - situational_batting_2014$nAB
# Calculate "regular" situation batting averages
situational_batting_2014$nRegularAVG <- situational_batting_2014$nH / situational_batting_2014$nRegularAB
# Calculate "pinch-hit" situation batting averages
situational_batting_2014$nPinchHitAVG <- situational_batting_2014$nH1 / situational_batting_2014$nAB
# Tag data with year
situational_batting_2014$Year = "2014"
Now, I will take these commands and put them in a function. By doing this, I can re-use this function to get data for many years.
# situationalStats
# Purpose: Gets situational batting statistics fora single year
# Returns: Data frame containing desired data
situationalStats<-function(URL, year) {
# Read the data from provided page
tables <- readHTMLTable(URL)
# Save the data from the first table on that page
stats <- tables[[1]]
# Remove league average information
stats <- subset(stats, Tm != "LgAvg")
# Keep only the columns I really need
stats <- stats[,c(1:11)]
# Convert the columns I _do_ need to numeric
stats$nPA <- as.numeric(as.character(stats$PA))
stats$nH <- as.numeric(as.character(stats$H))
stats$nInf <- as.numeric(as.character(stats[["Inf"]]))
stats$nBnt <- as.numeric(as.character(stats$Bnt))
stats$nAB <- as.numeric(as.character(stats$AB))
stats$nH1 <- as.numeric(as.character(stats$H.1))
stats$nHR <- as.numeric(as.character(stats$HR))
stats$nRBI <- as.numeric(as.character(stats$RBI))
# Calculate number of "regular" situation at bats
stats$nRegularAB <- stats$nPA - stats$nAB
# Calculate "regular" situation batting averages
stats$nRegularAVG <- stats$nH / stats$nRegularAB
# Calculate "pinch-hit" situation batting averages
stats$nPinchHitAVG <- stats$nH1 / stats$nAB
# Tag data with year
stats$Year = year
# Return the desired data
return(stats)
}
Here is how I use the function. I only need a one line command. I pass in the website address that has the source data as the first argument. As the second argument, I pass in the year the data is from.
Here, I will use the function to get data from the last 10 years and then combine that data into a single dataframe:
# Get data for last 10 years
stats_2014 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml", "2014")
stats_2013 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2013-situational-batting.shtml", "2013")
stats_2012 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2012-situational-batting.shtml", "2012")
stats_2011 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2011-situational-batting.shtml", "2011")
stats_2010 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2010-situational-batting.shtml", "2010")
stats_2009 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2009-situational-batting.shtml", "2009")
stats_2008 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2008-situational-batting.shtml", "2008")
stats_2007 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2007-situational-batting.shtml", "2007")
stats_2006 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2006-situational-batting.shtml", "2006")
stats_2005 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2005-situational-batting.shtml", "2005")
# Combine the 10 separate data frames into a single data frame
stats <- rbind(stats_2014, stats_2013, stats_2012, stats_2011, stats_2010, stats_2009, stats_2008, stats_2007, stats_2006, stats_2005)
Now that I have data for the last 10 years, I can repeat prior analyses, but have greater confidence that the results are accurate.
Basic histograms:
h_pinch_hit <- histogram(dataframe = stats, variable = "nPinchHitAVG", title = "Pinch Hitting, 2005-2014", binwidth = 0.005)
h_regular <- histogram(dataframe = stats, variable = "nRegularAVG", title = "Regular Hitting, 2005-2014", binwidth = 0.005)
multiplot(h_pinch_hit, h_regular, cols=1)
## Warning: position_stack requires constant width: output may be incorrect
Ensuring same horizontal and vertical scale for accurate comparisons:
h_pinch_hit <- histogram(dataframe = stats, variable = "nPinchHitAVG", title = "Pinch Hitting, 2005-2014", binwidth = 0.005, xmin = 0.075, xmax = 0.350, ymin = 0, ymax = 60)
h_regular <- histogram(dataframe = stats, variable = "nRegularAVG", title = "Regular Hitting, 2005-2014", binwidth = 0.005, xmin = 0.075, xmax = 0.350, ymin = 0, ymax = 60)
multiplot(h_pinch_hit, h_regular, cols=1)
## Warning: position_stack requires constant width: output may be incorrect
There is still a much greater spread for pinch-hit batting averages vs. regular situation batting averages.
As one migh expect, with 300 data points (instead of just 30) both distributions now appear more Normal.
Here is the five-number summary for “pinch-hit” at bats:
five_number(dataframe = stats, variable = "nPinchHitAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min : 0.092"
## [1] "Q1 : 0.200"
## [1] "median: 0.222"
## [1] "Q3 : 0.244"
## [1] "max : 0.338"
And the mean for “pinch-hit” at bats:
mean(stats$nPinchHitAVG)
## [1] 0.2203
Here is the five-number summary for “regular” at bats:
five_number(dataframe = stats, variable = "nRegularAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min : 0.213"
## [1] "Q1 : 0.231"
## [1] "median: 0.238"
## [1] "Q3 : 0.246"
## [1] "max : 0.266"
And the mean for “regular” at bats:
mean(stats$nRegularAVG)
## [1] 0.2382
Median and mean batting averages for pinch hit situations are both lower than median and mean batting averages for regular situations.
Finally, we will explore the spread of the data
bw_pinch_hit <- box_and_whisker(dataframe = stats, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2005-2014")
bw_regular <- box_and_whisker(dataframe = stats, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2005-2014")
multiplot(bw_pinch_hit, bw_regular, cols=1)
One of the great things about having data for more than one year is that we can now make a comparison of box-and-whisker plots by year. This helps us see if a trend in the spread of data continues year after year.
One important note, however: these comparisons are meaningless unless we look at the data with the same horizontal scale. Let’s make that adjustment:
bw_pinch_hit <- box_and_whisker(dataframe = stats, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2005-2014", min = 0.075, max = 0.325)
bw_regular <- box_and_whisker(dataframe = stats, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2005-2014", min = 0.075, max = 0.325)
multiplot(bw_pinch_hit, bw_regular, cols=1)