Introduction

I have a long-standing personal interest in the sport of baseball. To “pinch hit” in baseball is to replace the starting player at a position in the batting order with another player who, presumably, has a better chance at getting a hit against the current pitcher. A pinch-hitter is often called on in “high-stakes” situations, where the outcome of the game is on the line.

The specific question I want to explore:

“Do pinch hitters in Major League Baseball, as a group, have a greater ability to get a hit than the starting players they have replaced?”

An extension that I might pursue – does the answer to this question change when data from different spans of time in baseball history is analyzed?

A second extension I might pursue – is there any difference in results when we examine National League games as compared to American League games? It should be noted that in National League games, the player being replaced for an at-bat is usually the pitcher (who is often a poor hitter).

This question, and the possible extensions, would be useful to anyone who enjoys the sport of baseball: fans, players, coaches, and team staff. The population of interest is Major League Baseball players.

Hypotheses

H0: Major League Baseball players during the 2014 baseball season who are pinch hitting have the same ability to get a hit as starting players.

Ha: Major League Baseball players during the 2014 baseball season who are pinch hitting have a greater ability to get a hit as starting players.

Load helper functions

Load functions that make it easier to create numerical and graphical summaries:

source('http://russellgordon.ca/rsgc/r/functions.r')

Obtaining the data

Load required packages

First, load packages that make it possible to automatically scrape data from the web.

# We need the 'XML' package.  Install if not present.
if (!require(XML)) {
  install.packages("XML")
  require(XML)
}
## Loading required package: XML
# We need the 'stringr' package.  Install if not present.
if (!require(stringr)) {
  install.packages("stringr")
  require(stringr)
}
## Loading required package: stringr

Rationale for selecting data source

I have chosen to work with the data from Baseball Reference.

The data at Baseball Reference provides pinch-hit statistics with more detail than ESPN. Specifically, the following situational hitting stats are provided by Baseball Reference:

  • at bats
  • hits
  • home runs
  • RBI

In my proposal, I had mentioned that the primary statistic I would use to evaluate pinch-hitting effectiveness was batting average. Although batting average is not provided at the Baseball Reference data source, Major League Baseball defines batting average as the “number of base hits divided by the total number of at-bats”. With the information provided by my Baseball Reference source I can calculate batting average myself, and then plot the results to explore whether any of my hypotheses are supported by the data.

On the same Baseball Reference page, overall batting statistics (in any situation) are provided. This will permit me to a comparison based on one of the categorical variables I identified in my proposal (at-bat type: “pinch hit” or “regular”). I plan to subtract the pinch-hit batting statistics from the overall batting statistics so that I can identify batting results in “regular” situations.

Here, I set the data source - this is data for the 2014 regular season:

url <- "http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml"

Retrieve the data

Now we read data in all the table(s) found at that URL (web site address):

tables <- readHTMLTable(url)

Identify the table that has the data I need

My Baseball Reference data source contains multiple tables of data.

The first table at the page appears to contain the data that I need, so I will load that into a data frame.

situational_batting_2014 <- tables[[1]]

Normalize the data

Review what we have

First, I will review the data that I have found:

view(situational_batting_2014)
## Loading required package: knitr
Tm R/G PA Ptn% H Inf Bnt AB H HR RBI PHlev All GS GSo vRH vLH Hm Rd IP Att Suc % Opp DP % Opp Suc % BR BRS BRS% <2,3B Scr % 0,2B Adv % PAu
1 ARI 3.80 6089 50% 1379 149 15 227 56 7 18 1.25 118 4 107 91 27 62 56 0 73 56 77% 1113 115 10% 586 187 32% 3523 483 14% 303 153 51% 212 126 59% 0
2 ATL 3.54 6064 44% 1316 155 13 190 34 2 7 1.18 123 2 130 91 32 62 61 0 77 53 69% 1061 121 11% 563 155 28% 3514 441 13% 272 137 50% 218 112 51% 0
3 BAL 4.35 6130 46% 1434 157 16 67 21 3 11 1.85 211 6 149 155 56 107 104 0 53 35 66% 1108 112 10% 504 143 28% 3533 485 14% 257 133 52% 228 115 50% 0
4 BOS 3.91 6226 50% 1355 140 4 86 20 2 8 1.75 123 5 151 84 39 49 74 0 33 20 61% 1227 138 11% 563 165 29% 3837 503 13% 326 164 50% 224 105 47% 0
5 CHC 3.79 6102 57% 1315 131 18 244 45 1 14 1.26 157 1 108 119 38 69 88 0 81 57 70% 1076 94 9% 568 169 30% 3441 447 13% 288 143 50% 207 98 47% 0
6 CHW 4.07 6077 51% 1400 164 16 75 23 1 10 1.61 155 4 124 110 45 74 81 0 28 19 68% 1096 127 12% 528 141 27% 3540 497 14% 299 157 53% 233 122 52% 0
7 CIN 3.67 5978 52% 1282 145 25 203 50 6 17 1.30 131 6 106 96 35 77 54 0 92 76 83% 1003 88 9% 551 186 34% 3335 449 13% 318 164 52% 225 129 57% 0
8 CLE 4.13 6222 74% 1411 155 31 105 23 1 11 1.36 142 1 134 111 31 72 70 0 61 51 84% 1117 126 11% 542 180 33% 3698 516 14% 315 168 53% 246 147 60% 0
9 COL 4.66 6164 47% 1551 199 20 233 61 5 30 1.19 186 4 121 126 60 119 67 2 78 59 76% 1096 121 11% 576 201 35% 3608 556 15% 334 189 57% 268 145 54% 0
10 DET 4.67 6202 45% 1557 140 16 74 14 2 6 1.66 155 3 149 109 46 76 79 0 40 24 60% 1181 137 12% 598 212 35% 3842 594 15% 393 209 53% 292 171 59% 0
11 HOU 3.88 6055 61% 1317 153 13 76 9 0 4 1.38 163 5 124 120 43 90 73 1 35 22 63% 1118 122 11% 506 141 28% 3479 451 13% 290 151 52% 218 118 54% 0
12 KCR 4.02 6058 48% 1456 181 23 43 9 2 5 1.50 95 3 110 64 31 43 52 1 54 33 61% 1119 131 12% 560 186 33% 3604 540 15% 332 190 57% 242 139 57% 0
13 LAA 4.77 6285 52% 1464 166 22 105 25 1 12 1.36 155 1 143 113 42 73 82 0 39 26 67% 1175 112 10% 569 189 33% 3844 605 16% 336 184 55% 272 142 52% 0
14 LAD 4.43 6231 46% 1476 203 27 199 46 1 23 1.20 134 0 129 99 35 71 63 0 71 47 66% 1128 119 11% 615 187 30% 3861 573 15% 354 176 50% 262 148 56% 0
15 MIA 3.98 6185 46% 1399 191 20 246 45 5 27 1.56 122 4 162 98 24 59 63 0 85 71 84% 1186 143 12% 602 181 30% 3771 511 14% 330 165 50% 221 121 55% 0
16 MIL 4.01 6065 40% 1366 171 25 212 47 4 22 1.50 150 4 118 109 41 77 73 0 100 70 70% 1026 135 13% 558 178 32% 3385 491 15% 292 142 49% 219 134 61% 0
17 MIN 4.41 6233 60% 1412 163 18 86 18 0 12 1.26 128 1 168 94 34 67 61 1 34 25 74% 1177 97 8% 553 159 29% 3976 576 14% 384 185 48% 253 124 49% 0
18 NYM 3.88 6145 52% 1306 144 13 212 39 3 12 1.38 125 3 148 103 22 59 66 0 76 59 78% 1129 112 10% 582 180 31% 3704 492 13% 318 149 47% 232 126 54% 0
19 NYY 3.91 6082 70% 1349 161 16 90 22 2 8 1.61 147 2 137 108 39 88 59 0 45 29 64% 1085 111 10% 539 186 35% 3554 468 13% 330 175 53% 247 150 61% 0
20 OAK 4.50 6245 72% 1354 164 25 156 32 3 16 1.52 146 5 172 102 44 74 72 0 41 19 46% 1207 118 10% 551 159 29% 3895 567 15% 334 168 50% 262 141 54% 0
21 PHI 3.82 6198 65% 1356 143 12 220 40 5 19 1.36 125 4 142 84 41 64 61 0 73 59 81% 1103 94 9% 563 174 31% 3676 480 13% 306 157 51% 242 135 56% 0
22 PIT 4.21 6224 48% 1436 185 16 281 61 7 32 1.34 156 1 155 133 23 62 94 1 87 54 62% 1166 127 11% 584 174 30% 3856 512 13% 334 172 52% 224 117 52% 0
23 SDP 3.30 5905 67% 1199 131 14 280 61 11 29 1.38 109 3 127 82 27 54 55 0 79 56 71% 989 118 12% 544 175 32% 3310 411 12% 308 141 46% 222 118 53% 0
24 SEA 3.91 5977 65% 1328 162 20 82 16 1 7 1.72 136 0 102 108 28 73 63 0 56 35 63% 1019 112 11% 482 152 32% 3292 484 15% 284 151 53% 177 98 55% 0
25 SFG 4.10 6087 60% 1407 142 12 207 46 4 29 1.43 132 4 145 92 40 53 79 0 58 45 78% 1134 113 10% 585 182 31% 3681 522 14% 326 152 47% 243 129 53% 0
26 STL 3.82 6086 51% 1371 144 15 218 49 2 24 1.11 105 1 140 72 33 57 48 0 90 64 71% 1201 140 12% 583 181 31% 3784 503 13% 288 160 56% 236 143 61% 0
27 TBR 3.78 6205 54% 1361 168 20 142 30 2 20 1.79 117 1 155 85 32 51 66 3 70 43 61% 1211 135 11% 562 183 33% 3859 485 13% 324 160 49% 212 105 50% 0
28 TEX 3.93 6026 49% 1400 189 31 85 21 1 10 1.14 111 2 109 73 38 51 60 0 51 41 80% 1061 148 14% 554 209 38% 3514 507 14% 297 152 51% 226 142 63% 0
29 TOR 4.46 6167 68% 1435 176 20 176 36 9 21 1.17 177 2 138 135 42 98 79 0 58 35 60% 1197 128 11% 561 177 32% 3732 536 14% 349 193 55% 249 132 53% 0
30 WSN 4.23 6216 52% 1403 162 31 209 30 5 15 1.31 152 2 138 117 35 63 89 0 91 60 66% 1169 115 10% 601 171 28% 3790 507 13% 328 161 49% 250 124 50% 0
31 LgAvg 4.07 6131 55% 1387 161 19 161 34 3 16 0 140 3 135 103 37 70 70 0 64 45 70% 1123 120 11% 561 175 31% 3648 506 14% 318 163 51% 235 129 55% 0

The data looks clean - for example, there are no repeated headers.

However, I do have one row that contains league average information (the final row). I will remove this row so that scatterplots are comparing data on a per-team basis only:

situational_batting_2014 <- subset(situational_batting_2014, Tm != "LgAvg")

Now I have only data for the 30 teams in the MLB in 2014:

view(situational_batting_2014)
Tm R/G PA Ptn% H Inf Bnt AB H.1 HR RBI PHlev All GS GSo vRH vLH Hm Rd IP Att Suc % Opp DP %.1 Opp.1 Suc.1 %.2 BR BRS BRS% <2,3B Scr %.3 0,2B Adv %.4 PAu
1 ARI 3.80 6089 50% 1379 149 15 227 56 7 18 1.25 118 4 107 91 27 62 56 0 73 56 77% 1113 115 10% 586 187 32% 3523 483 14% 303 153 51% 212 126 59% 0
2 ATL 3.54 6064 44% 1316 155 13 190 34 2 7 1.18 123 2 130 91 32 62 61 0 77 53 69% 1061 121 11% 563 155 28% 3514 441 13% 272 137 50% 218 112 51% 0
3 BAL 4.35 6130 46% 1434 157 16 67 21 3 11 1.85 211 6 149 155 56 107 104 0 53 35 66% 1108 112 10% 504 143 28% 3533 485 14% 257 133 52% 228 115 50% 0
4 BOS 3.91 6226 50% 1355 140 4 86 20 2 8 1.75 123 5 151 84 39 49 74 0 33 20 61% 1227 138 11% 563 165 29% 3837 503 13% 326 164 50% 224 105 47% 0
5 CHC 3.79 6102 57% 1315 131 18 244 45 1 14 1.26 157 1 108 119 38 69 88 0 81 57 70% 1076 94 9% 568 169 30% 3441 447 13% 288 143 50% 207 98 47% 0
6 CHW 4.07 6077 51% 1400 164 16 75 23 1 10 1.61 155 4 124 110 45 74 81 0 28 19 68% 1096 127 12% 528 141 27% 3540 497 14% 299 157 53% 233 122 52% 0
7 CIN 3.67 5978 52% 1282 145 25 203 50 6 17 1.30 131 6 106 96 35 77 54 0 92 76 83% 1003 88 9% 551 186 34% 3335 449 13% 318 164 52% 225 129 57% 0
8 CLE 4.13 6222 74% 1411 155 31 105 23 1 11 1.36 142 1 134 111 31 72 70 0 61 51 84% 1117 126 11% 542 180 33% 3698 516 14% 315 168 53% 246 147 60% 0
9 COL 4.66 6164 47% 1551 199 20 233 61 5 30 1.19 186 4 121 126 60 119 67 2 78 59 76% 1096 121 11% 576 201 35% 3608 556 15% 334 189 57% 268 145 54% 0
10 DET 4.67 6202 45% 1557 140 16 74 14 2 6 1.66 155 3 149 109 46 76 79 0 40 24 60% 1181 137 12% 598 212 35% 3842 594 15% 393 209 53% 292 171 59% 0
11 HOU 3.88 6055 61% 1317 153 13 76 9 0 4 1.38 163 5 124 120 43 90 73 1 35 22 63% 1118 122 11% 506 141 28% 3479 451 13% 290 151 52% 218 118 54% 0
12 KCR 4.02 6058 48% 1456 181 23 43 9 2 5 1.50 95 3 110 64 31 43 52 1 54 33 61% 1119 131 12% 560 186 33% 3604 540 15% 332 190 57% 242 139 57% 0
13 LAA 4.77 6285 52% 1464 166 22 105 25 1 12 1.36 155 1 143 113 42 73 82 0 39 26 67% 1175 112 10% 569 189 33% 3844 605 16% 336 184 55% 272 142 52% 0
14 LAD 4.43 6231 46% 1476 203 27 199 46 1 23 1.20 134 0 129 99 35 71 63 0 71 47 66% 1128 119 11% 615 187 30% 3861 573 15% 354 176 50% 262 148 56% 0
15 MIA 3.98 6185 46% 1399 191 20 246 45 5 27 1.56 122 4 162 98 24 59 63 0 85 71 84% 1186 143 12% 602 181 30% 3771 511 14% 330 165 50% 221 121 55% 0
16 MIL 4.01 6065 40% 1366 171 25 212 47 4 22 1.50 150 4 118 109 41 77 73 0 100 70 70% 1026 135 13% 558 178 32% 3385 491 15% 292 142 49% 219 134 61% 0
17 MIN 4.41 6233 60% 1412 163 18 86 18 0 12 1.26 128 1 168 94 34 67 61 1 34 25 74% 1177 97 8% 553 159 29% 3976 576 14% 384 185 48% 253 124 49% 0
18 NYM 3.88 6145 52% 1306 144 13 212 39 3 12 1.38 125 3 148 103 22 59 66 0 76 59 78% 1129 112 10% 582 180 31% 3704 492 13% 318 149 47% 232 126 54% 0
19 NYY 3.91 6082 70% 1349 161 16 90 22 2 8 1.61 147 2 137 108 39 88 59 0 45 29 64% 1085 111 10% 539 186 35% 3554 468 13% 330 175 53% 247 150 61% 0
20 OAK 4.50 6245 72% 1354 164 25 156 32 3 16 1.52 146 5 172 102 44 74 72 0 41 19 46% 1207 118 10% 551 159 29% 3895 567 15% 334 168 50% 262 141 54% 0
21 PHI 3.82 6198 65% 1356 143 12 220 40 5 19 1.36 125 4 142 84 41 64 61 0 73 59 81% 1103 94 9% 563 174 31% 3676 480 13% 306 157 51% 242 135 56% 0
22 PIT 4.21 6224 48% 1436 185 16 281 61 7 32 1.34 156 1 155 133 23 62 94 1 87 54 62% 1166 127 11% 584 174 30% 3856 512 13% 334 172 52% 224 117 52% 0
23 SDP 3.30 5905 67% 1199 131 14 280 61 11 29 1.38 109 3 127 82 27 54 55 0 79 56 71% 989 118 12% 544 175 32% 3310 411 12% 308 141 46% 222 118 53% 0
24 SEA 3.91 5977 65% 1328 162 20 82 16 1 7 1.72 136 0 102 108 28 73 63 0 56 35 63% 1019 112 11% 482 152 32% 3292 484 15% 284 151 53% 177 98 55% 0
25 SFG 4.10 6087 60% 1407 142 12 207 46 4 29 1.43 132 4 145 92 40 53 79 0 58 45 78% 1134 113 10% 585 182 31% 3681 522 14% 326 152 47% 243 129 53% 0
26 STL 3.82 6086 51% 1371 144 15 218 49 2 24 1.11 105 1 140 72 33 57 48 0 90 64 71% 1201 140 12% 583 181 31% 3784 503 13% 288 160 56% 236 143 61% 0
27 TBR 3.78 6205 54% 1361 168 20 142 30 2 20 1.79 117 1 155 85 32 51 66 3 70 43 61% 1211 135 11% 562 183 33% 3859 485 13% 324 160 49% 212 105 50% 0
28 TEX 3.93 6026 49% 1400 189 31 85 21 1 10 1.14 111 2 109 73 38 51 60 0 51 41 80% 1061 148 14% 554 209 38% 3514 507 14% 297 152 51% 226 142 63% 0
29 TOR 4.46 6167 68% 1435 176 20 176 36 9 21 1.17 177 2 138 135 42 98 79 0 58 35 60% 1197 128 11% 561 177 32% 3732 536 14% 349 193 55% 249 132 53% 0
30 WSN 4.23 6216 52% 1403 162 31 209 30 5 15 1.31 152 2 138 117 35 63 89 0 91 60 66% 1169 115 10% 601 171 28% 3790 507 13% 328 161 49% 250 124 50% 0

It’s also important to note what column headers contain the data I need.

By looking at my original data source I can see that the columns:

  • PA
  • H
  • Inf
  • Bnt

… are overall batting statistics (regular and pinch-hit).

The column headers that contain situational batting statistics are:

  • AB
  • H.1
  • HR
  • RBI

Only the first 11 columns contain data that I need for my project. I will trim the dataframe so that I keep only these columns:

situational_batting_2014 <- situational_batting_2014[,c(1:11)]

The result:

view(situational_batting_2014)
Tm R/G PA Ptn% H Inf Bnt AB H.1 HR RBI
1 ARI 3.80 6089 50% 1379 149 15 227 56 7 18
2 ATL 3.54 6064 44% 1316 155 13 190 34 2 7
3 BAL 4.35 6130 46% 1434 157 16 67 21 3 11
4 BOS 3.91 6226 50% 1355 140 4 86 20 2 8
5 CHC 3.79 6102 57% 1315 131 18 244 45 1 14
6 CHW 4.07 6077 51% 1400 164 16 75 23 1 10
7 CIN 3.67 5978 52% 1282 145 25 203 50 6 17
8 CLE 4.13 6222 74% 1411 155 31 105 23 1 11
9 COL 4.66 6164 47% 1551 199 20 233 61 5 30
10 DET 4.67 6202 45% 1557 140 16 74 14 2 6
11 HOU 3.88 6055 61% 1317 153 13 76 9 0 4
12 KCR 4.02 6058 48% 1456 181 23 43 9 2 5
13 LAA 4.77 6285 52% 1464 166 22 105 25 1 12
14 LAD 4.43 6231 46% 1476 203 27 199 46 1 23
15 MIA 3.98 6185 46% 1399 191 20 246 45 5 27
16 MIL 4.01 6065 40% 1366 171 25 212 47 4 22
17 MIN 4.41 6233 60% 1412 163 18 86 18 0 12
18 NYM 3.88 6145 52% 1306 144 13 212 39 3 12
19 NYY 3.91 6082 70% 1349 161 16 90 22 2 8
20 OAK 4.50 6245 72% 1354 164 25 156 32 3 16
21 PHI 3.82 6198 65% 1356 143 12 220 40 5 19
22 PIT 4.21 6224 48% 1436 185 16 281 61 7 32
23 SDP 3.30 5905 67% 1199 131 14 280 61 11 29
24 SEA 3.91 5977 65% 1328 162 20 82 16 1 7
25 SFG 4.10 6087 60% 1407 142 12 207 46 4 29
26 STL 3.82 6086 51% 1371 144 15 218 49 2 24
27 TBR 3.78 6205 54% 1361 168 20 142 30 2 20
28 TEX 3.93 6026 49% 1400 189 31 85 21 1 10
29 TOR 4.46 6167 68% 1435 176 20 176 36 9 21
30 WSN 4.23 6216 52% 1403 162 31 209 30 5 15

Convert data to numeric

I intend to analyze my data in R.

R, by default, sees all data as “factors” - that is, plain text.

You can see this here:

str(situational_batting_2014)
## 'data.frame':    30 obs. of  11 variables:
##  $ Tm  : Factor w/ 31 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ R/G : Factor w/ 26 levels "3.30","3.54",..: 6 2 19 9 5 14 3 16 24 25 ...
##  $ PA  : Factor w/ 31 levels "5905","5977",..: 13 7 15 27 14 9 3 25 18 22 ...
##  $ Ptn%: Factor w/ 21 levels "40%","44%","45%",..: 8 2 4 8 13 9 10 21 5 3 ...
##  $ H   : Factor w/ 30 levels "1199","1282",..: 15 5 23 10 4 18 2 21 29 30 ...
##  $ Inf : Factor w/ 24 levels "131","140","142",..: 7 9 10 2 1 14 6 9 23 2 ...
##  $ Bnt : Factor w/ 14 levels "12","13","14",..: 4 2 5 14 6 5 11 13 8 5 ...
##  $ AB  : Factor w/ 28 levels "105","142","156",..: 14 6 21 27 16 23 8 1 15 22 ...
##  $ H.1 : Factor w/ 22 levels "14","16","18",..: 20 11 5 4 15 7 19 7 21 1 ...
##  $ HR  : Factor w/ 10 levels "0","1","11","2",..: 9 4 5 4 2 2 8 2 7 4 ...
##  $ RBI : Factor w/ 23 levels "10","11","12",..: 8 22 2 23 4 1 7 2 17 21 ...

So, I will convert the columns of data that I intend to use as numeric data to be seen as numeric data:

situational_batting_2014$nPA  <- as.numeric(as.character(situational_batting_2014$PA))
situational_batting_2014$nH  <- as.numeric(as.character(situational_batting_2014$H))
situational_batting_2014$nInf  <- as.numeric(as.character(situational_batting_2014[["Inf"]]))
situational_batting_2014$nBnt  <- as.numeric(as.character(situational_batting_2014$Bnt))
situational_batting_2014$nAB  <- as.numeric(as.character(situational_batting_2014$AB))
situational_batting_2014$nH1  <- as.numeric(as.character(situational_batting_2014$H.1))
situational_batting_2014$nHR  <- as.numeric(as.character(situational_batting_2014$HR))
situational_batting_2014$nRBI  <- as.numeric(as.character(situational_batting_2014$RBI))

The new columns are numeric:

str(situational_batting_2014)
## 'data.frame':    30 obs. of  19 variables:
##  $ Tm  : Factor w/ 31 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ R/G : Factor w/ 26 levels "3.30","3.54",..: 6 2 19 9 5 14 3 16 24 25 ...
##  $ PA  : Factor w/ 31 levels "5905","5977",..: 13 7 15 27 14 9 3 25 18 22 ...
##  $ Ptn%: Factor w/ 21 levels "40%","44%","45%",..: 8 2 4 8 13 9 10 21 5 3 ...
##  $ H   : Factor w/ 30 levels "1199","1282",..: 15 5 23 10 4 18 2 21 29 30 ...
##  $ Inf : Factor w/ 24 levels "131","140","142",..: 7 9 10 2 1 14 6 9 23 2 ...
##  $ Bnt : Factor w/ 14 levels "12","13","14",..: 4 2 5 14 6 5 11 13 8 5 ...
##  $ AB  : Factor w/ 28 levels "105","142","156",..: 14 6 21 27 16 23 8 1 15 22 ...
##  $ H.1 : Factor w/ 22 levels "14","16","18",..: 20 11 5 4 15 7 19 7 21 1 ...
##  $ HR  : Factor w/ 10 levels "0","1","11","2",..: 9 4 5 4 2 2 8 2 7 4 ...
##  $ RBI : Factor w/ 23 levels "10","11","12",..: 8 22 2 23 4 1 7 2 17 21 ...
##  $ nPA : num  6089 6064 6130 6226 6102 ...
##  $ nH  : num  1379 1316 1434 1355 1315 ...
##  $ nInf: num  149 155 157 140 131 164 145 155 199 140 ...
##  $ nBnt: num  15 13 16 4 18 16 25 31 20 16 ...
##  $ nAB : num  227 190 67 86 244 75 203 105 233 74 ...
##  $ nH1 : num  56 34 21 20 45 23 50 23 61 14 ...
##  $ nHR : num  7 2 3 2 1 1 6 1 5 2 ...
##  $ nRBI: num  18 7 11 8 14 10 17 11 30 6 ...

Note that I placed the converted data into new columns. I will compare the converted data to the data in the original columns to be certain that no data values were changed unexpectedly during the conversion process from “factors” to “numeric” data.

Reviewing the converted data:

view(situational_batting_2014)
Tm R/G PA Ptn% H Inf Bnt AB H.1 HR RBI nPA nH nInf nBnt nAB nH1 nHR nRBI
1 ARI 3.80 6089 50% 1379 149 15 227 56 7 18 6089 1379 149 15 227 56 7 18
2 ATL 3.54 6064 44% 1316 155 13 190 34 2 7 6064 1316 155 13 190 34 2 7
3 BAL 4.35 6130 46% 1434 157 16 67 21 3 11 6130 1434 157 16 67 21 3 11
4 BOS 3.91 6226 50% 1355 140 4 86 20 2 8 6226 1355 140 4 86 20 2 8
5 CHC 3.79 6102 57% 1315 131 18 244 45 1 14 6102 1315 131 18 244 45 1 14
6 CHW 4.07 6077 51% 1400 164 16 75 23 1 10 6077 1400 164 16 75 23 1 10
7 CIN 3.67 5978 52% 1282 145 25 203 50 6 17 5978 1282 145 25 203 50 6 17
8 CLE 4.13 6222 74% 1411 155 31 105 23 1 11 6222 1411 155 31 105 23 1 11
9 COL 4.66 6164 47% 1551 199 20 233 61 5 30 6164 1551 199 20 233 61 5 30
10 DET 4.67 6202 45% 1557 140 16 74 14 2 6 6202 1557 140 16 74 14 2 6
11 HOU 3.88 6055 61% 1317 153 13 76 9 0 4 6055 1317 153 13 76 9 0 4
12 KCR 4.02 6058 48% 1456 181 23 43 9 2 5 6058 1456 181 23 43 9 2 5
13 LAA 4.77 6285 52% 1464 166 22 105 25 1 12 6285 1464 166 22 105 25 1 12
14 LAD 4.43 6231 46% 1476 203 27 199 46 1 23 6231 1476 203 27 199 46 1 23
15 MIA 3.98 6185 46% 1399 191 20 246 45 5 27 6185 1399 191 20 246 45 5 27
16 MIL 4.01 6065 40% 1366 171 25 212 47 4 22 6065 1366 171 25 212 47 4 22
17 MIN 4.41 6233 60% 1412 163 18 86 18 0 12 6233 1412 163 18 86 18 0 12
18 NYM 3.88 6145 52% 1306 144 13 212 39 3 12 6145 1306 144 13 212 39 3 12
19 NYY 3.91 6082 70% 1349 161 16 90 22 2 8 6082 1349 161 16 90 22 2 8
20 OAK 4.50 6245 72% 1354 164 25 156 32 3 16 6245 1354 164 25 156 32 3 16
21 PHI 3.82 6198 65% 1356 143 12 220 40 5 19 6198 1356 143 12 220 40 5 19
22 PIT 4.21 6224 48% 1436 185 16 281 61 7 32 6224 1436 185 16 281 61 7 32
23 SDP 3.30 5905 67% 1199 131 14 280 61 11 29 5905 1199 131 14 280 61 11 29
24 SEA 3.91 5977 65% 1328 162 20 82 16 1 7 5977 1328 162 20 82 16 1 7
25 SFG 4.10 6087 60% 1407 142 12 207 46 4 29 6087 1407 142 12 207 46 4 29
26 STL 3.82 6086 51% 1371 144 15 218 49 2 24 6086 1371 144 15 218 49 2 24
27 TBR 3.78 6205 54% 1361 168 20 142 30 2 20 6205 1361 168 20 142 30 2 20
28 TEX 3.93 6026 49% 1400 189 31 85 21 1 10 6026 1400 189 31 85 21 1 10
29 TOR 4.46 6167 68% 1435 176 20 176 36 9 21 6167 1435 176 20 176 36 9 21
30 WSN 4.23 6216 52% 1403 162 31 209 30 5 15 6216 1403 162 31 209 30 5 15

The data appears to be have been converted successfully (that is, no values have obviously been changed to something incorrect).

Calculate batting averages

My data source did not include batting averages for regular vs. pinch-hitting situations.

In this section, I will use the data provided to directly calculate this information.

First, I will determine how many at-bats occured in “regular” batting situations.

To do this, I will subtract pinch hit at-bat totals from overall plate appearances.

situational_batting_2014$nRegularAB  <- situational_batting_2014$nPA - situational_batting_2014$nAB

Next, I will determine batting averages for “regular” situations:

situational_batting_2014$nRegularAVG <- situational_batting_2014$nH / situational_batting_2014$nRegularAB

Finally, I will determine batting averages for “pinch-hit” situations:

situational_batting_2014$nPinchHitAVG <- situational_batting_2014$nH1 / situational_batting_2014$nAB

Tag data with year

One of my possible extensions is to compare data by year.

Therefore, before I continue, I will add a column to the dataframe, identifying what year this data is from.

This also makes it possible to generate box-and-whisker plots (as those graphs require a “factor” that is used to group the data being plotted).

situational_batting_2014$Year = "2014"

Here is what the data now looks like, after calculating the batting averages (see last section) and tagging with the year that the data is from:

view(situational_batting_2014)
Tm R/G PA Ptn% H Inf Bnt AB H.1 HR RBI nPA nH nInf nBnt nAB nH1 nHR nRBI nRegularAB nRegularAVG nPinchHitAVG Year
1 ARI 3.80 6089 50% 1379 149 15 227 56 7 18 6089 1379 149 15 227 56 7 18 5862 0.2352 0.2467 2014
2 ATL 3.54 6064 44% 1316 155 13 190 34 2 7 6064 1316 155 13 190 34 2 7 5874 0.2240 0.1789 2014
3 BAL 4.35 6130 46% 1434 157 16 67 21 3 11 6130 1434 157 16 67 21 3 11 6063 0.2365 0.3134 2014
4 BOS 3.91 6226 50% 1355 140 4 86 20 2 8 6226 1355 140 4 86 20 2 8 6140 0.2207 0.2326 2014
5 CHC 3.79 6102 57% 1315 131 18 244 45 1 14 6102 1315 131 18 244 45 1 14 5858 0.2245 0.1844 2014
6 CHW 4.07 6077 51% 1400 164 16 75 23 1 10 6077 1400 164 16 75 23 1 10 6002 0.2333 0.3067 2014
7 CIN 3.67 5978 52% 1282 145 25 203 50 6 17 5978 1282 145 25 203 50 6 17 5775 0.2220 0.2463 2014
8 CLE 4.13 6222 74% 1411 155 31 105 23 1 11 6222 1411 155 31 105 23 1 11 6117 0.2307 0.2190 2014
9 COL 4.66 6164 47% 1551 199 20 233 61 5 30 6164 1551 199 20 233 61 5 30 5931 0.2615 0.2618 2014
10 DET 4.67 6202 45% 1557 140 16 74 14 2 6 6202 1557 140 16 74 14 2 6 6128 0.2541 0.1892 2014
11 HOU 3.88 6055 61% 1317 153 13 76 9 0 4 6055 1317 153 13 76 9 0 4 5979 0.2203 0.1184 2014
12 KCR 4.02 6058 48% 1456 181 23 43 9 2 5 6058 1456 181 23 43 9 2 5 6015 0.2421 0.2093 2014
13 LAA 4.77 6285 52% 1464 166 22 105 25 1 12 6285 1464 166 22 105 25 1 12 6180 0.2369 0.2381 2014
14 LAD 4.43 6231 46% 1476 203 27 199 46 1 23 6231 1476 203 27 199 46 1 23 6032 0.2447 0.2312 2014
15 MIA 3.98 6185 46% 1399 191 20 246 45 5 27 6185 1399 191 20 246 45 5 27 5939 0.2356 0.1829 2014
16 MIL 4.01 6065 40% 1366 171 25 212 47 4 22 6065 1366 171 25 212 47 4 22 5853 0.2334 0.2217 2014
17 MIN 4.41 6233 60% 1412 163 18 86 18 0 12 6233 1412 163 18 86 18 0 12 6147 0.2297 0.2093 2014
18 NYM 3.88 6145 52% 1306 144 13 212 39 3 12 6145 1306 144 13 212 39 3 12 5933 0.2201 0.1840 2014
19 NYY 3.91 6082 70% 1349 161 16 90 22 2 8 6082 1349 161 16 90 22 2 8 5992 0.2251 0.2444 2014
20 OAK 4.50 6245 72% 1354 164 25 156 32 3 16 6245 1354 164 25 156 32 3 16 6089 0.2224 0.2051 2014
21 PHI 3.82 6198 65% 1356 143 12 220 40 5 19 6198 1356 143 12 220 40 5 19 5978 0.2268 0.1818 2014
22 PIT 4.21 6224 48% 1436 185 16 281 61 7 32 6224 1436 185 16 281 61 7 32 5943 0.2416 0.2171 2014
23 SDP 3.30 5905 67% 1199 131 14 280 61 11 29 5905 1199 131 14 280 61 11 29 5625 0.2132 0.2179 2014
24 SEA 3.91 5977 65% 1328 162 20 82 16 1 7 5977 1328 162 20 82 16 1 7 5895 0.2253 0.1951 2014
25 SFG 4.10 6087 60% 1407 142 12 207 46 4 29 6087 1407 142 12 207 46 4 29 5880 0.2393 0.2222 2014
26 STL 3.82 6086 51% 1371 144 15 218 49 2 24 6086 1371 144 15 218 49 2 24 5868 0.2336 0.2248 2014
27 TBR 3.78 6205 54% 1361 168 20 142 30 2 20 6205 1361 168 20 142 30 2 20 6063 0.2245 0.2113 2014
28 TEX 3.93 6026 49% 1400 189 31 85 21 1 10 6026 1400 189 31 85 21 1 10 5941 0.2357 0.2471 2014
29 TOR 4.46 6167 68% 1435 176 20 176 36 9 21 6167 1435 176 20 176 36 9 21 5991 0.2395 0.2045 2014
30 WSN 4.23 6216 52% 1403 162 31 209 30 5 15 6216 1403 162 31 209 30 5 15 6007 0.2336 0.1435 2014

Analyze the data

To analyze single variable data, I remember from studies earlier this year that I must comment on:

Shape of the data

To illustrate the shape of the data, I will use a histogram:

h_pinch_hit  <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014")
## Loading required package: ggplot2
h_regular  <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014")

multiplot(h_pinch_hit, h_regular, cols=1)
## Loading required package: grid

plot of chunk unnamed-chunk-20

The default binwidth for a histogram is 1 unit, which does not make sense for a batting average. I will specify the binwidth as 0.01:

h_pinch_hit  <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014", binwidth = 0.01)

h_regular  <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014", binwidth = 0.01)

multiplot(h_pinch_hit, h_regular, cols=1)

plot of chunk unnamed-chunk-21

I have noticed that the horizontal scale for each plot is different. This makes comparison of the two plots difficult. I will generate the graphs again, this time specifying the minimum and maximum values on the horizontal scale. I will also specify the vertical scale min and max values, so that the scale does not show counts with decimal values (which is not very helpful, as counts are discrete values):

h_pinch_hit  <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014", binwidth = 0.01, xmin = 0.100, xmax = 0.325, ymin = 0, ymax = 15)

h_regular  <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014", binwidth = 0.01, xmin = 0.100, xmax = 0.325, ymin = 0, ymax = 15)

multiplot(h_pinch_hit, h_regular, cols=1)

plot of chunk unnamed-chunk-22

Now I can see that the shape for batting average for “regular” at-bats is somewhat normal, with a very small spread (the range from minimum to maximum values is not large).

There is a much larger spread of values for “pinch-hitting” batting averages.

What this tells me is that batting averages are much more consistent for “regular” at-bats, whereas batting averages for “pinch-hit” at-bats are significantly more varied. Some pinch-hit batting averages are very good (greater than .300), some are very poor (less than .150).

Centre of the data

Here is the five-number summary for “pinch-hit” at bats:

five_number(dataframe = situational_batting_2014, variable = "nPinchHitAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.118"
## [1] "Q1    : 0.189"
## [1] "median: 0.217"
## [1] "Q3    : 0.238"
## [1] "max   : 0.313"

And the mean for “pinch-hit” at bats:

mean(situational_batting_2014$nPinchHitAVG)
## [1] 0.2163

Here is the five-number summary for “regular” at bats:

five_number(dataframe = situational_batting_2014, variable = "nRegularAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.213"
## [1] "Q1    : 0.224"
## [1] "median: 0.233"
## [1] "Q3    : 0.237"
## [1] "max   : 0.262"

And the mean for “regular” at bats:

mean(situational_batting_2014$nRegularAVG)
## [1] 0.2322

What jumps out at me for these numerical summaries is that both measures of the centre (mean and median) are higher for “regular” at-bats vs. “pinch-hit” at-bats.

Spread of the data

Since I have data for 30 teams in 2014, it is appropriate to use a box-and-whisker plot. This is a good graphical summary because it visually illustrates the five-number summary.

The variable I am comparing is batting averages, based on a categorical variable with two values:

  • regular at-bats
  • pinch-hit at-bats

The factor I am using is the year the batting averages are from.

I am being careful to use the same horizontal scale for each box-and-whisker plot so that I can make comparisons.

bw_pinch_hit <- box_and_whisker(dataframe = situational_batting_2014, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2014", min = 0.100, max = 0.325)

bw_regular <- box_and_whisker(dataframe = situational_batting_2014, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2014", min = 0.100, max = 0.325)

multiplot(bw_pinch_hit, bw_regular, cols=1)

plot of chunk unnamed-chunk-27

Here is what I notice about the spread of the data:

  • more than 75% of the data from “regular” at-bats are above the median value for “pinch hitting” batting average
  • the range of data from “pinch-hitting” data is significantly larger than than that of the “regular” at-bats
  • there are several outliers in the “pinch-hitting” at-bats data (this does not mean they can be discarded but it is worth noting)

I will calculate the standard deviation for pinch hitting vs. regular hitting batting averages. The box-and-whisker plots suggest that the standard deviation for pinch hitting will be considerably larger.

Standard deviation for pinch hitting:

sd(situational_batting_2014$nPinchHitAVG)
## [1] 0.04006

Standard deviation for regular hitting situations:

sd(situational_batting_2014$nRegularAVG)
## [1] 0.01045

The standard deviation for pinch-hitting situations is nearly four times the standard deviation for regular hitting situations.

This adds further weight to my earlier observation that pinch-hitting batting averages have a much greater range. One might say that pinch-hitting batting averages are considerably more volatile, or less predictable, than regular hitting batting averages.

Conclusions and Limitations

Conclusions

The graphical and numercal summaries illustrate that in 2014 in Major League Baseball, “pinch hitting”, as measured by the batting average statistic, was considerably less successful than “regular hitting” situations.

As observed earlier, the mean and median batting averages for pinch-hit at-bats were both lower than the mean and median batting averages for regular batting situations.

Further, the spread of the batting averages for pinch hitting situations was considerably larger. In other words, the results are less predictable. Some teams do very well with pinch hitting batting averages. Some teams do poorly. On the whole, pinch hitting batting averages are far more volatile than regular situation batting averages.

Neither of my original hypotheses is supported by the data from 2014. In fact, it seems that pinch-hitters have a lesser ability to get a hit as compared to players who are not in pinch-hit situations (typically, starting players).

Limitations

I have only examined a single season’s worth of data. That is only 30 observations. Perhaps 2014 was an unusual year, and pinch-hitting was not a very successful strategy in 2014, but it may in fact be a very successful strategy in other years.

I would like to extend my analysis to collect data from additional years, and then examine the results.

I would also like to compare results by league (American League vs. National League) as I suspect that pinch hitters would be more successful in the National League (when the starting player being replaced is often the pitcher, who is usually not a good hitter).

However, I do not think it is useful to make a comparison between NL and AL batting averages for a single year, as that would mean each subset would have just 15 data values.

Addendum: Scatterplots

This is an addendum for any students who are doing a two-variable analysis.

There was no need, for the question I explored in the exemplar, to do a two-variable analysis.

However, R does have the ability to produce scatterplots.

Here is a basic example - from the data set used in my exemplar - plotting RBI vs. Home Runs in pinch-hit situations:

scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI")

plot of chunk unnamed-chunk-30

Here is the same scatterplot, with a linear regression applied:

scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI", regression = TRUE)

plot of chunk unnamed-chunk-31

You will notice that besides the best fit line, there is a shaded region.

This shaded region is called a 95% confidence band.

It shows the region that you can be 95% sure contains the true best-fit line. With more data points (a larger sample to work with) the 95% confidence band will typically be smaller and closer to best-fit line. In other words, as usual, with more data, we have more confidence that our predictions (which are based on a sample) will be accurate.

Finally, here is a scatterplot with a title and labelled axes:

scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI", regression = TRUE, x_label = "Home Runs", y_label = "RBI", title = "RBI vs. Home Runs (Pinch-Hitting, 2014)")

plot of chunk unnamed-chunk-32

Extension: More Data

In this part of my project, I will use the work I have already done to analyze data for a single season to extend my analysis and examine multiple seasons.

Most of the hard work is done. I know how to obtain and normalize the data for a single season.

I will use a function that obtains the data for a single season.

Identify key commands

Here is how it works. First, here are the bare minimum commands I used earlier to obtain data for a single season:

# Set the website to retrieve data from
url <- "http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml"
# Read the data from that page
tables <- readHTMLTable(url)
# Save the data from the first table on that page
situational_batting_2014 <- tables[[1]]
# Remove league average information
situational_batting_2014 <- subset(situational_batting_2014, Tm != "LgAvg")
# Keep only the columns I really need
situational_batting_2014 <- situational_batting_2014[,c(1:11)]
# Convert the columns I _do_ need to numeric
situational_batting_2014$nPA  <- as.numeric(as.character(situational_batting_2014$PA))
situational_batting_2014$nH  <- as.numeric(as.character(situational_batting_2014$H))
situational_batting_2014$nInf  <- as.numeric(as.character(situational_batting_2014[["Inf"]]))
situational_batting_2014$nBnt  <- as.numeric(as.character(situational_batting_2014$Bnt))
situational_batting_2014$nAB  <- as.numeric(as.character(situational_batting_2014$AB))
situational_batting_2014$nH1  <- as.numeric(as.character(situational_batting_2014$H.1))
situational_batting_2014$nHR  <- as.numeric(as.character(situational_batting_2014$HR))
situational_batting_2014$nRBI  <- as.numeric(as.character(situational_batting_2014$RBI))
# Calculate number of "regular" situation at bats
situational_batting_2014$nRegularAB  <- situational_batting_2014$nPA - situational_batting_2014$nAB
# Calculate "regular" situation batting averages
situational_batting_2014$nRegularAVG <- situational_batting_2014$nH / situational_batting_2014$nRegularAB
# Calculate "pinch-hit" situation batting averages
situational_batting_2014$nPinchHitAVG <- situational_batting_2014$nH1 / situational_batting_2014$nAB
# Tag data with year
situational_batting_2014$Year = "2014"

Write a function

Now, I will take these commands and put them in a function. By doing this, I can re-use this function to get data for many years.

# situationalStats
# Purpose: Gets situational batting statistics fora single year
# Returns: Data frame containing desired data
situationalStats<-function(URL, year) {
  # Read the data from provided page
  tables <- readHTMLTable(URL)
  # Save the data from the first table on that page
  stats <- tables[[1]]
  # Remove league average information
  stats <- subset(stats, Tm != "LgAvg")
  # Keep only the columns I really need
  stats <- stats[,c(1:11)]
  # Convert the columns I _do_ need to numeric
  stats$nPA  <- as.numeric(as.character(stats$PA))
  stats$nH  <- as.numeric(as.character(stats$H))
  stats$nInf  <- as.numeric(as.character(stats[["Inf"]]))
  stats$nBnt  <- as.numeric(as.character(stats$Bnt))
  stats$nAB  <- as.numeric(as.character(stats$AB))
  stats$nH1  <- as.numeric(as.character(stats$H.1))
  stats$nHR  <- as.numeric(as.character(stats$HR))
  stats$nRBI  <- as.numeric(as.character(stats$RBI))
  # Calculate number of "regular" situation at bats
  stats$nRegularAB  <- stats$nPA - stats$nAB
  # Calculate "regular" situation batting averages
  stats$nRegularAVG <- stats$nH / stats$nRegularAB
  # Calculate "pinch-hit" situation batting averages
  stats$nPinchHitAVG <- stats$nH1 / stats$nAB
  # Tag data with year
  stats$Year = year
  # Return the desired data
  return(stats)
}

Apply the function to get more data

Here is how I use the function. I only need a one line command. I pass in the website address that has the source data as the first argument. As the second argument, I pass in the year the data is from.

Here, I will use the function to get data from the last 10 years and then combine that data into a single dataframe:

# Get data for last 10 years
stats_2014 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml", "2014")
stats_2013 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2013-situational-batting.shtml", "2013")
stats_2012 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2012-situational-batting.shtml", "2012")
stats_2011 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2011-situational-batting.shtml", "2011")
stats_2010 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2010-situational-batting.shtml", "2010")
stats_2009 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2009-situational-batting.shtml", "2009")
stats_2008 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2008-situational-batting.shtml", "2008")
stats_2007 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2007-situational-batting.shtml", "2007")
stats_2006 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2006-situational-batting.shtml", "2006")
stats_2005 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2005-situational-batting.shtml", "2005")

# Combine the 10 separate data frames into a single data frame
stats <- rbind(stats_2014, stats_2013, stats_2012, stats_2011, stats_2010, stats_2009, stats_2008, stats_2007, stats_2006, stats_2005)

Repeat analyses, but with more data

Now that I have data for the last 10 years, I can repeat prior analyses, but have greater confidence that the results are accurate.

Shape of the data

Basic histograms:

h_pinch_hit  <- histogram(dataframe = stats, variable = "nPinchHitAVG", title = "Pinch Hitting, 2005-2014", binwidth = 0.005)

h_regular  <- histogram(dataframe = stats, variable = "nRegularAVG", title = "Regular Hitting, 2005-2014", binwidth = 0.005)

multiplot(h_pinch_hit, h_regular, cols=1)
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk unnamed-chunk-36

Ensuring same horizontal and vertical scale for accurate comparisons:

h_pinch_hit  <- histogram(dataframe = stats, variable = "nPinchHitAVG", title = "Pinch Hitting, 2005-2014", binwidth = 0.005, xmin = 0.075, xmax = 0.350, ymin = 0, ymax = 60)

h_regular  <- histogram(dataframe = stats, variable = "nRegularAVG", title = "Regular Hitting, 2005-2014", binwidth = 0.005, xmin = 0.075, xmax = 0.350, ymin = 0, ymax = 60)

multiplot(h_pinch_hit, h_regular, cols=1)
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk unnamed-chunk-37

There is still a much greater spread for pinch-hit batting averages vs. regular situation batting averages.

As one migh expect, with 300 data points (instead of just 30) both distributions now appear more Normal.

Centre of the data

Here is the five-number summary for “pinch-hit” at bats:

five_number(dataframe = stats, variable = "nPinchHitAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.092"
## [1] "Q1    : 0.200"
## [1] "median: 0.222"
## [1] "Q3    : 0.244"
## [1] "max   : 0.338"

And the mean for “pinch-hit” at bats:

mean(stats$nPinchHitAVG)
## [1] 0.2203

Here is the five-number summary for “regular” at bats:

five_number(dataframe = stats, variable = "nRegularAVG")
## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.213"
## [1] "Q1    : 0.231"
## [1] "median: 0.238"
## [1] "Q3    : 0.246"
## [1] "max   : 0.266"

And the mean for “regular” at bats:

mean(stats$nRegularAVG)
## [1] 0.2382

Median and mean batting averages for pinch hit situations are both lower than median and mean batting averages for regular situations.

Spread of the data

Finally, we will explore the spread of the data

bw_pinch_hit <- box_and_whisker(dataframe = stats, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2005-2014")

bw_regular <- box_and_whisker(dataframe = stats, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2005-2014")

multiplot(bw_pinch_hit, bw_regular, cols=1)

plot of chunk unnamed-chunk-42

One of the great things about having data for more than one year is that we can now make a comparison of box-and-whisker plots by year. This helps us see if a trend in the spread of data continues year after year.

One important note, however: these comparisons are meaningless unless we look at the data with the same horizontal scale. Let’s make that adjustment:

bw_pinch_hit <- box_and_whisker(dataframe = stats, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2005-2014", min = 0.075, max = 0.325)

bw_regular <- box_and_whisker(dataframe = stats, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2005-2014", min = 0.075, max = 0.325)

multiplot(bw_pinch_hit, bw_regular, cols=1)

plot of chunk unnamed-chunk-43