Statistics Project

Introduction

I have a long-standing personal interest in the sport of baseball. To “pinch hit” in baseball is to replace the starting player at a position in the batting order with another player who, presumably, has a better chance at getting a hit against the current pitcher. A pinch-hitter is often called on in “high-stakes” situations, where the outcome of the game is on the line.

The specific question I want to explore:

“Do pinch hitters in Major League Baseball, as a group, have a greater ability to get a hit than the starting players they have replaced?”

An extension that I might pursue – does the answer to this question change when data from different spans of time in baseball history is analyzed?

A second extension I might pursue – is there any difference in results when we examine National League games as compared to American League games? It should be noted that in National League games, the player being replaced for an at-bat is usually the pitcher (who is often a poor hitter).

This question, and the possible extensions, would be useful to anyone who enjoys the sport of baseball: fans, players, coaches, and team staff. The population of interest is Major League Baseball players.

Hypotheses

H0: Major League Baseball players during the 2014 baseball season who are pinch hitting have the same ability to get a hit as starting players.

Ha: Major League Baseball players during the 2014 baseball season who are pinch hitting have a greater ability to get a hit as starting players.

Load helper functions

Load functions that make it easier to create numerical and graphical summaries:

source('http://russellgordon.ca/rsgc/r/functions.r')

Obtaining the data

Load required packages

First, load packages that make it possible to automatically scrape data from the web.

# We need the 'XML' package.  Install if not present.
if (!require(XML)) {
  install.packages("XML")
  require(XML)
}

## Loading required package: XML

# We need the 'stringr' package.  Install if not present.
if (!require(stringr)) {
  install.packages("stringr")
  require(stringr)
}

## Loading required package: stringr

Rationale for selecting data source

I have chosen to work with the data from Baseball Reference.

The data at Baseball Reference provides pinch-hit statistics with more detail than ESPN. Specifically, the following situational hitting stats are provided by Baseball Reference:

at bats
hits
home runs
RBI

In my proposal, I had mentioned that the primary statistic I would use to evaluate pinch-hitting effectiveness was batting average. Although batting average is not provided at the Baseball Reference data source, Major League Baseball defines batting average as the “number of base hits divided by the total number of at-bats”. With the information provided by my Baseball Reference source I can calculate batting average myself, and then plot the results to explore whether any of my hypotheses are supported by the data.

On the same Baseball Reference page, overall batting statistics (in any situation) are provided. This will permit me to a comparison based on one of the categorical variables I identified in my proposal (at-bat type: “pinch hit” or “regular”). I plan to subtract the pinch-hit batting statistics from the overall batting statistics so that I can identify batting results in “regular” situations.

Here, I set the data source - this is data for the 2014 regular season:

url <- "http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml"

Retrieve the data

Now we read data in all the table(s) found at that URL (web site address):

tables <- readHTMLTable(url)

Identify the table that has the data I need

My Baseball Reference data source contains multiple tables of data.

The first table at the page appears to contain the data that I need, so I will load that into a data frame.

situational_batting_2014 <- tables[[1]]

Normalize the data

Review what we have

First, I will review the data that I have found:

view(situational_batting_2014)

## Loading required package: knitr

	Tm	R/G	PA	Ptn%	H	Inf	Bnt	AB	H	HR	RBI	PHlev	All	GS	GSo	vRH	vLH	Hm	Rd	IP	Att	Suc	%	Opp	DP	%	Opp	Suc	%	BR	BRS	BRS%	<2,3B	Scr	%	0,2B	Adv	%
1	ARI	3.80	6089	50%	1379	149	15	227	56	7	18	1.25	118	4	107	91	27	62	56	0	73	56	77%	1113	115	10%	586	187	32%	3523	483	14%	303	153	51%	212	126	59%
2	ATL	3.54	6064	44%	1316	155	13	190	34	2	7	1.18	123	2	130	91	32	62	61	0	77	53	69%	1061	121	11%	563	155	28%	3514	441	13%	272	137	50%	218	112	51%
3	BAL	4.35	6130	46%	1434	157	16	67	21	3	11	1.85	211	6	149	155	56	107	104	0	53	35	66%	1108	112	10%	504	143	28%	3533	485	14%	257	133	52%	228	115	50%
4	BOS	3.91	6226	50%	1355	140	4	86	20	2	8	1.75	123	5	151	84	39	49	74	0	33	20	61%	1227	138	11%	563	165	29%	3837	503	13%	326	164	50%	224	105	47%
5	CHC	3.79	6102	57%	1315	131	18	244	45	1	14	1.26	157	1	108	119	38	69	88	0	81	57	70%	1076	94	9%	568	169	30%	3441	447	13%	288	143	50%	207	98	47%
6	CHW	4.07	6077	51%	1400	164	16	75	23	1	10	1.61	155	4	124	110	45	74	81	0	28	19	68%	1096	127	12%	528	141	27%	3540	497	14%	299	157	53%	233	122	52%
7	CIN	3.67	5978	52%	1282	145	25	203	50	6	17	1.30	131	6	106	96	35	77	54	0	92	76	83%	1003	88	9%	551	186	34%	3335	449	13%	318	164	52%	225	129	57%
8	CLE	4.13	6222	74%	1411	155	31	105	23	1	11	1.36	142	1	134	111	31	72	70	0	61	51	84%	1117	126	11%	542	180	33%	3698	516	14%	315	168	53%	246	147	60%
9	COL	4.66	6164	47%	1551	199	20	233	61	5	30	1.19	186	4	121	126	60	119	67	2	78	59	76%	1096	121	11%	576	201	35%	3608	556	15%	334	189	57%	268	145	54%
10	DET	4.67	6202	45%	1557	140	16	74	14	2	6	1.66	155	3	149	109	46	76	79	0	40	24	60%	1181	137	12%	598	212	35%	3842	594	15%	393	209	53%	292	171	59%
11	HOU	3.88	6055	61%	1317	153	13	76	9	0	4	1.38	163	5	124	120	43	90	73	1	35	22	63%	1118	122	11%	506	141	28%	3479	451	13%	290	151	52%	218	118	54%
12	KCR	4.02	6058	48%	1456	181	23	43	9	2	5	1.50	95	3	110	64	31	43	52	1	54	33	61%	1119	131	12%	560	186	33%	3604	540	15%	332	190	57%	242	139	57%
13	LAA	4.77	6285	52%	1464	166	22	105	25	1	12	1.36	155	1	143	113	42	73	82	0	39	26	67%	1175	112	10%	569	189	33%	3844	605	16%	336	184	55%	272	142	52%
14	LAD	4.43	6231	46%	1476	203	27	199	46	1	23	1.20	134	0	129	99	35	71	63	0	71	47	66%	1128	119	11%	615	187	30%	3861	573	15%	354	176	50%	262	148	56%
15	MIA	3.98	6185	46%	1399	191	20	246	45	5	27	1.56	122	4	162	98	24	59	63	0	85	71	84%	1186	143	12%	602	181	30%	3771	511	14%	330	165	50%	221	121	55%
16	MIL	4.01	6065	40%	1366	171	25	212	47	4	22	1.50	150	4	118	109	41	77	73	0	100	70	70%	1026	135	13%	558	178	32%	3385	491	15%	292	142	49%	219	134	61%
17	MIN	4.41	6233	60%	1412	163	18	86	18	0	12	1.26	128	1	168	94	34	67	61	1	34	25	74%	1177	97	8%	553	159	29%	3976	576	14%	384	185	48%	253	124	49%
18	NYM	3.88	6145	52%	1306	144	13	212	39	3	12	1.38	125	3	148	103	22	59	66	0	76	59	78%	1129	112	10%	582	180	31%	3704	492	13%	318	149	47%	232	126	54%
19	NYY	3.91	6082	70%	1349	161	16	90	22	2	8	1.61	147	2	137	108	39	88	59	0	45	29	64%	1085	111	10%	539	186	35%	3554	468	13%	330	175	53%	247	150	61%
20	OAK	4.50	6245	72%	1354	164	25	156	32	3	16	1.52	146	5	172	102	44	74	72	0	41	19	46%	1207	118	10%	551	159	29%	3895	567	15%	334	168	50%	262	141	54%
21	PHI	3.82	6198	65%	1356	143	12	220	40	5	19	1.36	125	4	142	84	41	64	61	0	73	59	81%	1103	94	9%	563	174	31%	3676	480	13%	306	157	51%	242	135	56%
22	PIT	4.21	6224	48%	1436	185	16	281	61	7	32	1.34	156	1	155	133	23	62	94	1	87	54	62%	1166	127	11%	584	174	30%	3856	512	13%	334	172	52%	224	117	52%
23	SDP	3.30	5905	67%	1199	131	14	280	61	11	29	1.38	109	3	127	82	27	54	55	0	79	56	71%	989	118	12%	544	175	32%	3310	411	12%	308	141	46%	222	118	53%
24	SEA	3.91	5977	65%	1328	162	20	82	16	1	7	1.72	136	0	102	108	28	73	63	0	56	35	63%	1019	112	11%	482	152	32%	3292	484	15%	284	151	53%	177	98	55%
25	SFG	4.10	6087	60%	1407	142	12	207	46	4	29	1.43	132	4	145	92	40	53	79	0	58	45	78%	1134	113	10%	585	182	31%	3681	522	14%	326	152	47%	243	129	53%
26	STL	3.82	6086	51%	1371	144	15	218	49	2	24	1.11	105	1	140	72	33	57	48	0	90	64	71%	1201	140	12%	583	181	31%	3784	503	13%	288	160	56%	236	143	61%
27	TBR	3.78	6205	54%	1361	168	20	142	30	2	20	1.79	117	1	155	85	32	51	66	3	70	43	61%	1211	135	11%	562	183	33%	3859	485	13%	324	160	49%	212	105	50%
28	TEX	3.93	6026	49%	1400	189	31	85	21	1	10	1.14	111	2	109	73	38	51	60	0	51	41	80%	1061	148	14%	554	209	38%	3514	507	14%	297	152	51%	226	142	63%
29	TOR	4.46	6167	68%	1435	176	20	176	36	9	21	1.17	177	2	138	135	42	98	79	0	58	35	60%	1197	128	11%	561	177	32%	3732	536	14%	349	193	55%	249	132	53%
30	WSN	4.23	6216	52%	1403	162	31	209	30	5	15	1.31	152	2	138	117	35	63	89	0	91	60	66%	1169	115	10%	601	171	28%	3790	507	13%	328	161	49%	250	124	50%
31	LgAvg	4.07	6131	55%	1387	161	19	161	34	3	16	0	140	3	135	103	37	70	70	0	64	45	70%	1123	120	11%	561	175	31%	3648	506	14%	318	163	51%	235	129	55%

The data looks clean - for example, there are no repeated headers.

However, I do have one row that contains league average information (the final row). I will remove this row so that scatterplots are comparing data on a per-team basis only:

situational_batting_2014 <- subset(situational_batting_2014, Tm != "LgAvg")

Now I have only data for the 30 teams in the MLB in 2014:

view(situational_batting_2014)

	Tm	R/G	PA	Ptn%	H	Inf	Bnt	AB	H.1	HR	RBI	PHlev	All	GS	GSo	vRH	vLH	Hm	Rd	IP	Att	Suc	%	Opp	DP	%.1	Opp.1	Suc.1	%.2	BR	BRS	BRS%	<2,3B	Scr	%.3	0,2B	Adv	%.4
1	ARI	3.80	6089	50%	1379	149	15	227	56	7	18	1.25	118	4	107	91	27	62	56	0	73	56	77%	1113	115	10%	586	187	32%	3523	483	14%	303	153	51%	212	126	59%
2	ATL	3.54	6064	44%	1316	155	13	190	34	2	7	1.18	123	2	130	91	32	62	61	0	77	53	69%	1061	121	11%	563	155	28%	3514	441	13%	272	137	50%	218	112	51%
3	BAL	4.35	6130	46%	1434	157	16	67	21	3	11	1.85	211	6	149	155	56	107	104	0	53	35	66%	1108	112	10%	504	143	28%	3533	485	14%	257	133	52%	228	115	50%
4	BOS	3.91	6226	50%	1355	140	4	86	20	2	8	1.75	123	5	151	84	39	49	74	0	33	20	61%	1227	138	11%	563	165	29%	3837	503	13%	326	164	50%	224	105	47%
5	CHC	3.79	6102	57%	1315	131	18	244	45	1	14	1.26	157	1	108	119	38	69	88	0	81	57	70%	1076	94	9%	568	169	30%	3441	447	13%	288	143	50%	207	98	47%
6	CHW	4.07	6077	51%	1400	164	16	75	23	1	10	1.61	155	4	124	110	45	74	81	0	28	19	68%	1096	127	12%	528	141	27%	3540	497	14%	299	157	53%	233	122	52%
7	CIN	3.67	5978	52%	1282	145	25	203	50	6	17	1.30	131	6	106	96	35	77	54	0	92	76	83%	1003	88	9%	551	186	34%	3335	449	13%	318	164	52%	225	129	57%
8	CLE	4.13	6222	74%	1411	155	31	105	23	1	11	1.36	142	1	134	111	31	72	70	0	61	51	84%	1117	126	11%	542	180	33%	3698	516	14%	315	168	53%	246	147	60%
9	COL	4.66	6164	47%	1551	199	20	233	61	5	30	1.19	186	4	121	126	60	119	67	2	78	59	76%	1096	121	11%	576	201	35%	3608	556	15%	334	189	57%	268	145	54%
10	DET	4.67	6202	45%	1557	140	16	74	14	2	6	1.66	155	3	149	109	46	76	79	0	40	24	60%	1181	137	12%	598	212	35%	3842	594	15%	393	209	53%	292	171	59%
11	HOU	3.88	6055	61%	1317	153	13	76	9	0	4	1.38	163	5	124	120	43	90	73	1	35	22	63%	1118	122	11%	506	141	28%	3479	451	13%	290	151	52%	218	118	54%
12	KCR	4.02	6058	48%	1456	181	23	43	9	2	5	1.50	95	3	110	64	31	43	52	1	54	33	61%	1119	131	12%	560	186	33%	3604	540	15%	332	190	57%	242	139	57%
13	LAA	4.77	6285	52%	1464	166	22	105	25	1	12	1.36	155	1	143	113	42	73	82	0	39	26	67%	1175	112	10%	569	189	33%	3844	605	16%	336	184	55%	272	142	52%
14	LAD	4.43	6231	46%	1476	203	27	199	46	1	23	1.20	134	0	129	99	35	71	63	0	71	47	66%	1128	119	11%	615	187	30%	3861	573	15%	354	176	50%	262	148	56%
15	MIA	3.98	6185	46%	1399	191	20	246	45	5	27	1.56	122	4	162	98	24	59	63	0	85	71	84%	1186	143	12%	602	181	30%	3771	511	14%	330	165	50%	221	121	55%
16	MIL	4.01	6065	40%	1366	171	25	212	47	4	22	1.50	150	4	118	109	41	77	73	0	100	70	70%	1026	135	13%	558	178	32%	3385	491	15%	292	142	49%	219	134	61%
17	MIN	4.41	6233	60%	1412	163	18	86	18	0	12	1.26	128	1	168	94	34	67	61	1	34	25	74%	1177	97	8%	553	159	29%	3976	576	14%	384	185	48%	253	124	49%
18	NYM	3.88	6145	52%	1306	144	13	212	39	3	12	1.38	125	3	148	103	22	59	66	0	76	59	78%	1129	112	10%	582	180	31%	3704	492	13%	318	149	47%	232	126	54%
19	NYY	3.91	6082	70%	1349	161	16	90	22	2	8	1.61	147	2	137	108	39	88	59	0	45	29	64%	1085	111	10%	539	186	35%	3554	468	13%	330	175	53%	247	150	61%
20	OAK	4.50	6245	72%	1354	164	25	156	32	3	16	1.52	146	5	172	102	44	74	72	0	41	19	46%	1207	118	10%	551	159	29%	3895	567	15%	334	168	50%	262	141	54%
21	PHI	3.82	6198	65%	1356	143	12	220	40	5	19	1.36	125	4	142	84	41	64	61	0	73	59	81%	1103	94	9%	563	174	31%	3676	480	13%	306	157	51%	242	135	56%
22	PIT	4.21	6224	48%	1436	185	16	281	61	7	32	1.34	156	1	155	133	23	62	94	1	87	54	62%	1166	127	11%	584	174	30%	3856	512	13%	334	172	52%	224	117	52%
23	SDP	3.30	5905	67%	1199	131	14	280	61	11	29	1.38	109	3	127	82	27	54	55	0	79	56	71%	989	118	12%	544	175	32%	3310	411	12%	308	141	46%	222	118	53%
24	SEA	3.91	5977	65%	1328	162	20	82	16	1	7	1.72	136	0	102	108	28	73	63	0	56	35	63%	1019	112	11%	482	152	32%	3292	484	15%	284	151	53%	177	98	55%
25	SFG	4.10	6087	60%	1407	142	12	207	46	4	29	1.43	132	4	145	92	40	53	79	0	58	45	78%	1134	113	10%	585	182	31%	3681	522	14%	326	152	47%	243	129	53%
26	STL	3.82	6086	51%	1371	144	15	218	49	2	24	1.11	105	1	140	72	33	57	48	0	90	64	71%	1201	140	12%	583	181	31%	3784	503	13%	288	160	56%	236	143	61%
27	TBR	3.78	6205	54%	1361	168	20	142	30	2	20	1.79	117	1	155	85	32	51	66	3	70	43	61%	1211	135	11%	562	183	33%	3859	485	13%	324	160	49%	212	105	50%
28	TEX	3.93	6026	49%	1400	189	31	85	21	1	10	1.14	111	2	109	73	38	51	60	0	51	41	80%	1061	148	14%	554	209	38%	3514	507	14%	297	152	51%	226	142	63%
29	TOR	4.46	6167	68%	1435	176	20	176	36	9	21	1.17	177	2	138	135	42	98	79	0	58	35	60%	1197	128	11%	561	177	32%	3732	536	14%	349	193	55%	249	132	53%
30	WSN	4.23	6216	52%	1403	162	31	209	30	5	15	1.31	152	2	138	117	35	63	89	0	91	60	66%	1169	115	10%	601	171	28%	3790	507	13%	328	161	49%	250	124	50%

It’s also important to note what column headers contain the data I need.

By looking at my original data source I can see that the columns:

PA
H
Inf
Bnt

… are overall batting statistics (regular and pinch-hit).

The column headers that contain situational batting statistics are:

AB
H.1
HR
RBI

Only the first 11 columns contain data that I need for my project. I will trim the dataframe so that I keep only these columns:

situational_batting_2014 <- situational_batting_2014[,c(1:11)]

The result:

view(situational_batting_2014)

	Tm	R/G	PA	Ptn%	H	Inf	Bnt	AB	H.1	HR	RBI
1	ARI	3.80	6089	50%	1379	149	15	227	56	7	18
2	ATL	3.54	6064	44%	1316	155	13	190	34	2	7
3	BAL	4.35	6130	46%	1434	157	16	67	21	3	11
4	BOS	3.91	6226	50%	1355	140	4	86	20	2	8
5	CHC	3.79	6102	57%	1315	131	18	244	45	1	14
6	CHW	4.07	6077	51%	1400	164	16	75	23	1	10
7	CIN	3.67	5978	52%	1282	145	25	203	50	6	17
8	CLE	4.13	6222	74%	1411	155	31	105	23	1	11
9	COL	4.66	6164	47%	1551	199	20	233	61	5	30
10	DET	4.67	6202	45%	1557	140	16	74	14	2	6
11	HOU	3.88	6055	61%	1317	153	13	76	9	0	4
12	KCR	4.02	6058	48%	1456	181	23	43	9	2	5
13	LAA	4.77	6285	52%	1464	166	22	105	25	1	12
14	LAD	4.43	6231	46%	1476	203	27	199	46	1	23
15	MIA	3.98	6185	46%	1399	191	20	246	45	5	27
16	MIL	4.01	6065	40%	1366	171	25	212	47	4	22
17	MIN	4.41	6233	60%	1412	163	18	86	18	0	12
18	NYM	3.88	6145	52%	1306	144	13	212	39	3	12
19	NYY	3.91	6082	70%	1349	161	16	90	22	2	8
20	OAK	4.50	6245	72%	1354	164	25	156	32	3	16
21	PHI	3.82	6198	65%	1356	143	12	220	40	5	19
22	PIT	4.21	6224	48%	1436	185	16	281	61	7	32
23	SDP	3.30	5905	67%	1199	131	14	280	61	11	29
24	SEA	3.91	5977	65%	1328	162	20	82	16	1	7
25	SFG	4.10	6087	60%	1407	142	12	207	46	4	29
26	STL	3.82	6086	51%	1371	144	15	218	49	2	24
27	TBR	3.78	6205	54%	1361	168	20	142	30	2	20
28	TEX	3.93	6026	49%	1400	189	31	85	21	1	10
29	TOR	4.46	6167	68%	1435	176	20	176	36	9	21
30	WSN	4.23	6216	52%	1403	162	31	209	30	5	15

Convert data to numeric

I intend to analyze my data in R.

R, by default, sees all data as “factors” - that is, plain text.

You can see this here:

str(situational_batting_2014)

## 'data.frame':    30 obs. of  11 variables:
##  $ Tm  : Factor w/ 31 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ R/G : Factor w/ 26 levels "3.30","3.54",..: 6 2 19 9 5 14 3 16 24 25 ...
##  $ PA  : Factor w/ 31 levels "5905","5977",..: 13 7 15 27 14 9 3 25 18 22 ...
##  $ Ptn%: Factor w/ 21 levels "40%","44%","45%",..: 8 2 4 8 13 9 10 21 5 3 ...
##  $ H   : Factor w/ 30 levels "1199","1282",..: 15 5 23 10 4 18 2 21 29 30 ...
##  $ Inf : Factor w/ 24 levels "131","140","142",..: 7 9 10 2 1 14 6 9 23 2 ...
##  $ Bnt : Factor w/ 14 levels "12","13","14",..: 4 2 5 14 6 5 11 13 8 5 ...
##  $ AB  : Factor w/ 28 levels "105","142","156",..: 14 6 21 27 16 23 8 1 15 22 ...
##  $ H.1 : Factor w/ 22 levels "14","16","18",..: 20 11 5 4 15 7 19 7 21 1 ...
##  $ HR  : Factor w/ 10 levels "0","1","11","2",..: 9 4 5 4 2 2 8 2 7 4 ...
##  $ RBI : Factor w/ 23 levels "10","11","12",..: 8 22 2 23 4 1 7 2 17 21 ...

So, I will convert the columns of data that I intend to use as numeric data to be seen as numeric data:

situational_batting_2014$nPA  <- as.numeric(as.character(situational_batting_2014$PA))
situational_batting_2014$nH  <- as.numeric(as.character(situational_batting_2014$H))
situational_batting_2014$nInf  <- as.numeric(as.character(situational_batting_2014[["Inf"]]))
situational_batting_2014$nBnt  <- as.numeric(as.character(situational_batting_2014$Bnt))
situational_batting_2014$nAB  <- as.numeric(as.character(situational_batting_2014$AB))
situational_batting_2014$nH1  <- as.numeric(as.character(situational_batting_2014$H.1))
situational_batting_2014$nHR  <- as.numeric(as.character(situational_batting_2014$HR))
situational_batting_2014$nRBI  <- as.numeric(as.character(situational_batting_2014$RBI))

The new columns are numeric:

str(situational_batting_2014)

## 'data.frame':    30 obs. of  19 variables:
##  $ Tm  : Factor w/ 31 levels "ARI","ATL","BAL",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ R/G : Factor w/ 26 levels "3.30","3.54",..: 6 2 19 9 5 14 3 16 24 25 ...
##  $ PA  : Factor w/ 31 levels "5905","5977",..: 13 7 15 27 14 9 3 25 18 22 ...
##  $ Ptn%: Factor w/ 21 levels "40%","44%","45%",..: 8 2 4 8 13 9 10 21 5 3 ...
##  $ H   : Factor w/ 30 levels "1199","1282",..: 15 5 23 10 4 18 2 21 29 30 ...
##  $ Inf : Factor w/ 24 levels "131","140","142",..: 7 9 10 2 1 14 6 9 23 2 ...
##  $ Bnt : Factor w/ 14 levels "12","13","14",..: 4 2 5 14 6 5 11 13 8 5 ...
##  $ AB  : Factor w/ 28 levels "105","142","156",..: 14 6 21 27 16 23 8 1 15 22 ...
##  $ H.1 : Factor w/ 22 levels "14","16","18",..: 20 11 5 4 15 7 19 7 21 1 ...
##  $ HR  : Factor w/ 10 levels "0","1","11","2",..: 9 4 5 4 2 2 8 2 7 4 ...
##  $ RBI : Factor w/ 23 levels "10","11","12",..: 8 22 2 23 4 1 7 2 17 21 ...
##  $ nPA : num  6089 6064 6130 6226 6102 ...
##  $ nH  : num  1379 1316 1434 1355 1315 ...
##  $ nInf: num  149 155 157 140 131 164 145 155 199 140 ...
##  $ nBnt: num  15 13 16 4 18 16 25 31 20 16 ...
##  $ nAB : num  227 190 67 86 244 75 203 105 233 74 ...
##  $ nH1 : num  56 34 21 20 45 23 50 23 61 14 ...
##  $ nHR : num  7 2 3 2 1 1 6 1 5 2 ...
##  $ nRBI: num  18 7 11 8 14 10 17 11 30 6 ...

Note that I placed the converted data into new columns. I will compare the converted data to the data in the original columns to be certain that no data values were changed unexpectedly during the conversion process from “factors” to “numeric” data.

Reviewing the converted data:

view(situational_batting_2014)

	Tm	R/G	PA	Ptn%	H	Inf	Bnt	AB	H.1	HR	RBI	nPA	nH	nInf	nBnt	nAB	nH1	nHR	nRBI
1	ARI	3.80	6089	50%	1379	149	15	227	56	7	18	6089	1379	149	15	227	56	7	18
2	ATL	3.54	6064	44%	1316	155	13	190	34	2	7	6064	1316	155	13	190	34	2	7
3	BAL	4.35	6130	46%	1434	157	16	67	21	3	11	6130	1434	157	16	67	21	3	11
4	BOS	3.91	6226	50%	1355	140	4	86	20	2	8	6226	1355	140	4	86	20	2	8
5	CHC	3.79	6102	57%	1315	131	18	244	45	1	14	6102	1315	131	18	244	45	1	14
6	CHW	4.07	6077	51%	1400	164	16	75	23	1	10	6077	1400	164	16	75	23	1	10
7	CIN	3.67	5978	52%	1282	145	25	203	50	6	17	5978	1282	145	25	203	50	6	17
8	CLE	4.13	6222	74%	1411	155	31	105	23	1	11	6222	1411	155	31	105	23	1	11
9	COL	4.66	6164	47%	1551	199	20	233	61	5	30	6164	1551	199	20	233	61	5	30
10	DET	4.67	6202	45%	1557	140	16	74	14	2	6	6202	1557	140	16	74	14	2	6
11	HOU	3.88	6055	61%	1317	153	13	76	9	0	4	6055	1317	153	13	76	9	0	4
12	KCR	4.02	6058	48%	1456	181	23	43	9	2	5	6058	1456	181	23	43	9	2	5
13	LAA	4.77	6285	52%	1464	166	22	105	25	1	12	6285	1464	166	22	105	25	1	12
14	LAD	4.43	6231	46%	1476	203	27	199	46	1	23	6231	1476	203	27	199	46	1	23
15	MIA	3.98	6185	46%	1399	191	20	246	45	5	27	6185	1399	191	20	246	45	5	27
16	MIL	4.01	6065	40%	1366	171	25	212	47	4	22	6065	1366	171	25	212	47	4	22
17	MIN	4.41	6233	60%	1412	163	18	86	18	0	12	6233	1412	163	18	86	18	0	12
18	NYM	3.88	6145	52%	1306	144	13	212	39	3	12	6145	1306	144	13	212	39	3	12
19	NYY	3.91	6082	70%	1349	161	16	90	22	2	8	6082	1349	161	16	90	22	2	8
20	OAK	4.50	6245	72%	1354	164	25	156	32	3	16	6245	1354	164	25	156	32	3	16
21	PHI	3.82	6198	65%	1356	143	12	220	40	5	19	6198	1356	143	12	220	40	5	19
22	PIT	4.21	6224	48%	1436	185	16	281	61	7	32	6224	1436	185	16	281	61	7	32
23	SDP	3.30	5905	67%	1199	131	14	280	61	11	29	5905	1199	131	14	280	61	11	29
24	SEA	3.91	5977	65%	1328	162	20	82	16	1	7	5977	1328	162	20	82	16	1	7
25	SFG	4.10	6087	60%	1407	142	12	207	46	4	29	6087	1407	142	12	207	46	4	29
26	STL	3.82	6086	51%	1371	144	15	218	49	2	24	6086	1371	144	15	218	49	2	24
27	TBR	3.78	6205	54%	1361	168	20	142	30	2	20	6205	1361	168	20	142	30	2	20
28	TEX	3.93	6026	49%	1400	189	31	85	21	1	10	6026	1400	189	31	85	21	1	10
29	TOR	4.46	6167	68%	1435	176	20	176	36	9	21	6167	1435	176	20	176	36	9	21
30	WSN	4.23	6216	52%	1403	162	31	209	30	5	15	6216	1403	162	31	209	30	5	15

The data appears to be have been converted successfully (that is, no values have obviously been changed to something incorrect).

Calculate batting averages

My data source did not include batting averages for regular vs. pinch-hitting situations.

In this section, I will use the data provided to directly calculate this information.

First, I will determine how many at-bats occured in “regular” batting situations.

To do this, I will subtract pinch hit at-bat totals from overall plate appearances.

situational_batting_2014$nRegularAB  <- situational_batting_2014$nPA - situational_batting_2014$nAB

Next, I will determine batting averages for “regular” situations:

situational_batting_2014$nRegularAVG <- situational_batting_2014$nH / situational_batting_2014$nRegularAB

Finally, I will determine batting averages for “pinch-hit” situations:

situational_batting_2014$nPinchHitAVG <- situational_batting_2014$nH1 / situational_batting_2014$nAB

Tag data with year

One of my possible extensions is to compare data by year.

Therefore, before I continue, I will add a column to the dataframe, identifying what year this data is from.

This also makes it possible to generate box-and-whisker plots (as those graphs require a “factor” that is used to group the data being plotted).

situational_batting_2014$Year = "2014"

Here is what the data now looks like, after calculating the batting averages (see last section) and tagging with the year that the data is from:

view(situational_batting_2014)

	Tm	R/G	PA	Ptn%	H	Inf	Bnt	AB	H.1	HR	RBI	nPA	nH	nInf	nBnt	nAB	nH1	nHR	nRBI	nRegularAB	nRegularAVG	nPinchHitAVG	Year
1	ARI	3.80	6089	50%	1379	149	15	227	56	7	18	6089	1379	149	15	227	56	7	18	5862	0.2352	0.2467	2014
2	ATL	3.54	6064	44%	1316	155	13	190	34	2	7	6064	1316	155	13	190	34	2	7	5874	0.2240	0.1789	2014
3	BAL	4.35	6130	46%	1434	157	16	67	21	3	11	6130	1434	157	16	67	21	3	11	6063	0.2365	0.3134	2014
4	BOS	3.91	6226	50%	1355	140	4	86	20	2	8	6226	1355	140	4	86	20	2	8	6140	0.2207	0.2326	2014
5	CHC	3.79	6102	57%	1315	131	18	244	45	1	14	6102	1315	131	18	244	45	1	14	5858	0.2245	0.1844	2014
6	CHW	4.07	6077	51%	1400	164	16	75	23	1	10	6077	1400	164	16	75	23	1	10	6002	0.2333	0.3067	2014
7	CIN	3.67	5978	52%	1282	145	25	203	50	6	17	5978	1282	145	25	203	50	6	17	5775	0.2220	0.2463	2014
8	CLE	4.13	6222	74%	1411	155	31	105	23	1	11	6222	1411	155	31	105	23	1	11	6117	0.2307	0.2190	2014
9	COL	4.66	6164	47%	1551	199	20	233	61	5	30	6164	1551	199	20	233	61	5	30	5931	0.2615	0.2618	2014
10	DET	4.67	6202	45%	1557	140	16	74	14	2	6	6202	1557	140	16	74	14	2	6	6128	0.2541	0.1892	2014
11	HOU	3.88	6055	61%	1317	153	13	76	9	0	4	6055	1317	153	13	76	9	0	4	5979	0.2203	0.1184	2014
12	KCR	4.02	6058	48%	1456	181	23	43	9	2	5	6058	1456	181	23	43	9	2	5	6015	0.2421	0.2093	2014
13	LAA	4.77	6285	52%	1464	166	22	105	25	1	12	6285	1464	166	22	105	25	1	12	6180	0.2369	0.2381	2014
14	LAD	4.43	6231	46%	1476	203	27	199	46	1	23	6231	1476	203	27	199	46	1	23	6032	0.2447	0.2312	2014
15	MIA	3.98	6185	46%	1399	191	20	246	45	5	27	6185	1399	191	20	246	45	5	27	5939	0.2356	0.1829	2014
16	MIL	4.01	6065	40%	1366	171	25	212	47	4	22	6065	1366	171	25	212	47	4	22	5853	0.2334	0.2217	2014
17	MIN	4.41	6233	60%	1412	163	18	86	18	0	12	6233	1412	163	18	86	18	0	12	6147	0.2297	0.2093	2014
18	NYM	3.88	6145	52%	1306	144	13	212	39	3	12	6145	1306	144	13	212	39	3	12	5933	0.2201	0.1840	2014
19	NYY	3.91	6082	70%	1349	161	16	90	22	2	8	6082	1349	161	16	90	22	2	8	5992	0.2251	0.2444	2014
20	OAK	4.50	6245	72%	1354	164	25	156	32	3	16	6245	1354	164	25	156	32	3	16	6089	0.2224	0.2051	2014
21	PHI	3.82	6198	65%	1356	143	12	220	40	5	19	6198	1356	143	12	220	40	5	19	5978	0.2268	0.1818	2014
22	PIT	4.21	6224	48%	1436	185	16	281	61	7	32	6224	1436	185	16	281	61	7	32	5943	0.2416	0.2171	2014
23	SDP	3.30	5905	67%	1199	131	14	280	61	11	29	5905	1199	131	14	280	61	11	29	5625	0.2132	0.2179	2014
24	SEA	3.91	5977	65%	1328	162	20	82	16	1	7	5977	1328	162	20	82	16	1	7	5895	0.2253	0.1951	2014
25	SFG	4.10	6087	60%	1407	142	12	207	46	4	29	6087	1407	142	12	207	46	4	29	5880	0.2393	0.2222	2014
26	STL	3.82	6086	51%	1371	144	15	218	49	2	24	6086	1371	144	15	218	49	2	24	5868	0.2336	0.2248	2014
27	TBR	3.78	6205	54%	1361	168	20	142	30	2	20	6205	1361	168	20	142	30	2	20	6063	0.2245	0.2113	2014
28	TEX	3.93	6026	49%	1400	189	31	85	21	1	10	6026	1400	189	31	85	21	1	10	5941	0.2357	0.2471	2014
29	TOR	4.46	6167	68%	1435	176	20	176	36	9	21	6167	1435	176	20	176	36	9	21	5991	0.2395	0.2045	2014
30	WSN	4.23	6216	52%	1403	162	31	209	30	5	15	6216	1403	162	31	209	30	5	15	6007	0.2336	0.1435	2014

Analyze the data

To analyze single variable data, I remember from studies earlier this year that I must comment on:

shape of the data (draw at least one appropriate graphical summary)
centre of the data (mean, median, or five-number summary, when appropriate - e.g.: at least 10-20 data values)
spread of the data (standard deviation)

Shape of the data

To illustrate the shape of the data, I will use a histogram:

h_pinch_hit  <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014")

## Loading required package: ggplot2

h_regular  <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014")

multiplot(h_pinch_hit, h_regular, cols=1)

## Loading required package: grid

plot of chunk unnamed-chunk-20

The default binwidth for a histogram is 1 unit, which does not make sense for a batting average. I will specify the binwidth as 0.01:

h_pinch_hit  <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014", binwidth = 0.01)

h_regular  <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014", binwidth = 0.01)

multiplot(h_pinch_hit, h_regular, cols=1)

plot of chunk unnamed-chunk-21

I have noticed that the horizontal scale for each plot is different. This makes comparison of the two plots difficult. I will generate the graphs again, this time specifying the minimum and maximum values on the horizontal scale. I will also specify the vertical scale min and max values, so that the scale does not show counts with decimal values (which is not very helpful, as counts are discrete values):

h_pinch_hit  <- histogram(dataframe = situational_batting_2014, variable = "nPinchHitAVG", title = "Pinch Hitting, 2014", binwidth = 0.01, xmin = 0.100, xmax = 0.325, ymin = 0, ymax = 15)

h_regular  <- histogram(dataframe = situational_batting_2014, variable = "nRegularAVG", title = "Regular Hitting, 2014", binwidth = 0.01, xmin = 0.100, xmax = 0.325, ymin = 0, ymax = 15)

multiplot(h_pinch_hit, h_regular, cols=1)

plot of chunk unnamed-chunk-22

Now I can see that the shape for batting average for “regular” at-bats is somewhat normal, with a very small spread (the range from minimum to maximum values is not large).

There is a much larger spread of values for “pinch-hitting” batting averages.

What this tells me is that batting averages are much more consistent for “regular” at-bats, whereas batting averages for “pinch-hit” at-bats are significantly more varied. Some pinch-hit batting averages are very good (greater than .300), some are very poor (less than .150).

Centre of the data

Here is the five-number summary for “pinch-hit” at bats:

five_number(dataframe = situational_batting_2014, variable = "nPinchHitAVG")

## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.118"
## [1] "Q1    : 0.189"
## [1] "median: 0.217"
## [1] "Q3    : 0.238"
## [1] "max   : 0.313"

And the mean for “pinch-hit” at bats:

mean(situational_batting_2014$nPinchHitAVG)

## [1] 0.2163

Here is the five-number summary for “regular” at bats:

five_number(dataframe = situational_batting_2014, variable = "nRegularAVG")

## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.213"
## [1] "Q1    : 0.224"
## [1] "median: 0.233"
## [1] "Q3    : 0.237"
## [1] "max   : 0.262"

And the mean for “regular” at bats:

mean(situational_batting_2014$nRegularAVG)

## [1] 0.2322

What jumps out at me for these numerical summaries is that both measures of the centre (mean and median) are higher for “regular” at-bats vs. “pinch-hit” at-bats.

Spread of the data

Since I have data for 30 teams in 2014, it is appropriate to use a box-and-whisker plot. This is a good graphical summary because it visually illustrates the five-number summary.

The variable I am comparing is batting averages, based on a categorical variable with two values:

regular at-bats
pinch-hit at-bats

The factor I am using is the year the batting averages are from.

I am being careful to use the same horizontal scale for each box-and-whisker plot so that I can make comparisons.

bw_pinch_hit <- box_and_whisker(dataframe = situational_batting_2014, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2014", min = 0.100, max = 0.325)

bw_regular <- box_and_whisker(dataframe = situational_batting_2014, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2014", min = 0.100, max = 0.325)

multiplot(bw_pinch_hit, bw_regular, cols=1)

plot of chunk unnamed-chunk-27

Here is what I notice about the spread of the data:

more than 75% of the data from “regular” at-bats are above the median value for “pinch hitting” batting average
the range of data from “pinch-hitting” data is significantly larger than than that of the “regular” at-bats
there are several outliers in the “pinch-hitting” at-bats data (this does not mean they can be discarded but it is worth noting)

I will calculate the standard deviation for pinch hitting vs. regular hitting batting averages. The box-and-whisker plots suggest that the standard deviation for pinch hitting will be considerably larger.

Standard deviation for pinch hitting:

sd(situational_batting_2014$nPinchHitAVG)

## [1] 0.04006

Standard deviation for regular hitting situations:

sd(situational_batting_2014$nRegularAVG)

## [1] 0.01045

The standard deviation for pinch-hitting situations is nearly four times the standard deviation for regular hitting situations.

This adds further weight to my earlier observation that pinch-hitting batting averages have a much greater range. One might say that pinch-hitting batting averages are considerably more volatile, or less predictable, than regular hitting batting averages.

Conclusions and Limitations

Conclusions

The graphical and numercal summaries illustrate that in 2014 in Major League Baseball, “pinch hitting”, as measured by the batting average statistic, was considerably less successful than “regular hitting” situations.

As observed earlier, the mean and median batting averages for pinch-hit at-bats were both lower than the mean and median batting averages for regular batting situations.

Further, the spread of the batting averages for pinch hitting situations was considerably larger. In other words, the results are less predictable. Some teams do very well with pinch hitting batting averages. Some teams do poorly. On the whole, pinch hitting batting averages are far more volatile than regular situation batting averages.

Neither of my original hypotheses is supported by the data from 2014. In fact, it seems that pinch-hitters have a lesser ability to get a hit as compared to players who are not in pinch-hit situations (typically, starting players).

Limitations

I have only examined a single season’s worth of data. That is only 30 observations. Perhaps 2014 was an unusual year, and pinch-hitting was not a very successful strategy in 2014, but it may in fact be a very successful strategy in other years.

I would like to extend my analysis to collect data from additional years, and then examine the results.

I would also like to compare results by league (American League vs. National League) as I suspect that pinch hitters would be more successful in the National League (when the starting player being replaced is often the pitcher, who is usually not a good hitter).

However, I do not think it is useful to make a comparison between NL and AL batting averages for a single year, as that would mean each subset would have just 15 data values.

Addendum: Scatterplots

This is an addendum for any students who are doing a two-variable analysis.

There was no need, for the question I explored in the exemplar, to do a two-variable analysis.

However, R does have the ability to produce scatterplots.

Here is a basic example - from the data set used in my exemplar - plotting RBI vs. Home Runs in pinch-hit situations:

scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI")

plot of chunk unnamed-chunk-30

Here is the same scatterplot, with a linear regression applied:

scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI", regression = TRUE)

plot of chunk unnamed-chunk-31

You will notice that besides the best fit line, there is a shaded region.

This shaded region is called a 95% confidence band.

It shows the region that you can be 95% sure contains the true best-fit line. With more data points (a larger sample to work with) the 95% confidence band will typically be smaller and closer to best-fit line. In other words, as usual, with more data, we have more confidence that our predictions (which are based on a sample) will be accurate.

Finally, here is a scatterplot with a title and labelled axes:

scatterplot(dataframe = situational_batting_2014, x = "nHR", y = "nRBI", regression = TRUE, x_label = "Home Runs", y_label = "RBI", title = "RBI vs. Home Runs (Pinch-Hitting, 2014)")

plot of chunk unnamed-chunk-32

Extension: More Data

In this part of my project, I will use the work I have already done to analyze data for a single season to extend my analysis and examine multiple seasons.

Most of the hard work is done. I know how to obtain and normalize the data for a single season.

I will use a function that obtains the data for a single season.

Identify key commands

Here is how it works. First, here are the bare minimum commands I used earlier to obtain data for a single season:

# Set the website to retrieve data from
url <- "http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml"
# Read the data from that page
tables <- readHTMLTable(url)
# Save the data from the first table on that page
situational_batting_2014 <- tables[[1]]
# Remove league average information
situational_batting_2014 <- subset(situational_batting_2014, Tm != "LgAvg")
# Keep only the columns I really need
situational_batting_2014 <- situational_batting_2014[,c(1:11)]
# Convert the columns I _do_ need to numeric
situational_batting_2014$nPA  <- as.numeric(as.character(situational_batting_2014$PA))
situational_batting_2014$nH  <- as.numeric(as.character(situational_batting_2014$H))
situational_batting_2014$nInf  <- as.numeric(as.character(situational_batting_2014[["Inf"]]))
situational_batting_2014$nBnt  <- as.numeric(as.character(situational_batting_2014$Bnt))
situational_batting_2014$nAB  <- as.numeric(as.character(situational_batting_2014$AB))
situational_batting_2014$nH1  <- as.numeric(as.character(situational_batting_2014$H.1))
situational_batting_2014$nHR  <- as.numeric(as.character(situational_batting_2014$HR))
situational_batting_2014$nRBI  <- as.numeric(as.character(situational_batting_2014$RBI))
# Calculate number of "regular" situation at bats
situational_batting_2014$nRegularAB  <- situational_batting_2014$nPA - situational_batting_2014$nAB
# Calculate "regular" situation batting averages
situational_batting_2014$nRegularAVG <- situational_batting_2014$nH / situational_batting_2014$nRegularAB
# Calculate "pinch-hit" situation batting averages
situational_batting_2014$nPinchHitAVG <- situational_batting_2014$nH1 / situational_batting_2014$nAB
# Tag data with year
situational_batting_2014$Year = "2014"

Write a function

Now, I will take these commands and put them in a function. By doing this, I can re-use this function to get data for many years.

# situationalStats
# Purpose: Gets situational batting statistics fora single year
# Returns: Data frame containing desired data
situationalStats<-function(URL, year) {
  # Read the data from provided page
  tables <- readHTMLTable(URL)
  # Save the data from the first table on that page
  stats <- tables[[1]]
  # Remove league average information
  stats <- subset(stats, Tm != "LgAvg")
  # Keep only the columns I really need
  stats <- stats[,c(1:11)]
  # Convert the columns I _do_ need to numeric
  stats$nPA  <- as.numeric(as.character(stats$PA))
  stats$nH  <- as.numeric(as.character(stats$H))
  stats$nInf  <- as.numeric(as.character(stats[["Inf"]]))
  stats$nBnt  <- as.numeric(as.character(stats$Bnt))
  stats$nAB  <- as.numeric(as.character(stats$AB))
  stats$nH1  <- as.numeric(as.character(stats$H.1))
  stats$nHR  <- as.numeric(as.character(stats$HR))
  stats$nRBI  <- as.numeric(as.character(stats$RBI))
  # Calculate number of "regular" situation at bats
  stats$nRegularAB  <- stats$nPA - stats$nAB
  # Calculate "regular" situation batting averages
  stats$nRegularAVG <- stats$nH / stats$nRegularAB
  # Calculate "pinch-hit" situation batting averages
  stats$nPinchHitAVG <- stats$nH1 / stats$nAB
  # Tag data with year
  stats$Year = year
  # Return the desired data
  return(stats)
}

Apply the function to get more data

Here is how I use the function. I only need a one line command. I pass in the website address that has the source data as the first argument. As the second argument, I pass in the year the data is from.

Here, I will use the function to get data from the last 10 years and then combine that data into a single dataframe:

# Get data for last 10 years
stats_2014 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2014-situational-batting.shtml", "2014")
stats_2013 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2013-situational-batting.shtml", "2013")
stats_2012 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2012-situational-batting.shtml", "2012")
stats_2011 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2011-situational-batting.shtml", "2011")
stats_2010 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2010-situational-batting.shtml", "2010")
stats_2009 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2009-situational-batting.shtml", "2009")
stats_2008 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2008-situational-batting.shtml", "2008")
stats_2007 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2007-situational-batting.shtml", "2007")
stats_2006 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2006-situational-batting.shtml", "2006")
stats_2005 <- situationalStats("http://www.baseball-reference.com/leagues/MLB/2005-situational-batting.shtml", "2005")

# Combine the 10 separate data frames into a single data frame
stats <- rbind(stats_2014, stats_2013, stats_2012, stats_2011, stats_2010, stats_2009, stats_2008, stats_2007, stats_2006, stats_2005)

Repeat analyses, but with more data

Now that I have data for the last 10 years, I can repeat prior analyses, but have greater confidence that the results are accurate.

Shape of the data

Basic histograms:

h_pinch_hit  <- histogram(dataframe = stats, variable = "nPinchHitAVG", title = "Pinch Hitting, 2005-2014", binwidth = 0.005)

h_regular  <- histogram(dataframe = stats, variable = "nRegularAVG", title = "Regular Hitting, 2005-2014", binwidth = 0.005)

multiplot(h_pinch_hit, h_regular, cols=1)

## Warning: position_stack requires constant width: output may be incorrect

plot of chunk unnamed-chunk-36

Ensuring same horizontal and vertical scale for accurate comparisons:

h_pinch_hit  <- histogram(dataframe = stats, variable = "nPinchHitAVG", title = "Pinch Hitting, 2005-2014", binwidth = 0.005, xmin = 0.075, xmax = 0.350, ymin = 0, ymax = 60)

h_regular  <- histogram(dataframe = stats, variable = "nRegularAVG", title = "Regular Hitting, 2005-2014", binwidth = 0.005, xmin = 0.075, xmax = 0.350, ymin = 0, ymax = 60)

multiplot(h_pinch_hit, h_regular, cols=1)

## Warning: position_stack requires constant width: output may be incorrect

plot of chunk unnamed-chunk-37

There is still a much greater spread for pinch-hit batting averages vs. regular situation batting averages.

As one migh expect, with 300 data points (instead of just 30) both distributions now appear more Normal.

Centre of the data

Here is the five-number summary for “pinch-hit” at bats:

five_number(dataframe = stats, variable = "nPinchHitAVG")

## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.092"
## [1] "Q1    : 0.200"
## [1] "median: 0.222"
## [1] "Q3    : 0.244"
## [1] "max   : 0.338"

And the mean for “pinch-hit” at bats:

mean(stats$nPinchHitAVG)

## [1] 0.2203

Here is the five-number summary for “regular” at bats:

five_number(dataframe = stats, variable = "nRegularAVG")

## [1] "Five number summary"
## [1] "==================="
## [1] "min   : 0.213"
## [1] "Q1    : 0.231"
## [1] "median: 0.238"
## [1] "Q3    : 0.246"
## [1] "max   : 0.266"

And the mean for “regular” at bats:

mean(stats$nRegularAVG)

## [1] 0.2382

Median and mean batting averages for pinch hit situations are both lower than median and mean batting averages for regular situations.

Spread of the data

Finally, we will explore the spread of the data

bw_pinch_hit <- box_and_whisker(dataframe = stats, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2005-2014")

bw_regular <- box_and_whisker(dataframe = stats, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2005-2014")

multiplot(bw_pinch_hit, bw_regular, cols=1)

plot of chunk unnamed-chunk-42

One of the great things about having data for more than one year is that we can now make a comparison of box-and-whisker plots by year. This helps us see if a trend in the spread of data continues year after year.

One important note, however: these comparisons are meaningless unless we look at the data with the same horizontal scale. Let’s make that adjustment:

bw_pinch_hit <- box_and_whisker(dataframe = stats, variable = "nPinchHitAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Pinch Hitting, 2005-2014", min = 0.075, max = 0.325)

bw_regular <- box_and_whisker(dataframe = stats, variable = "nRegularAVG", variable_label = "Batting Average", factor = "Year", factor_label = "Year", title = "Regular Hitting, 2005-2014", min = 0.075, max = 0.325)

multiplot(bw_pinch_hit, bw_regular, cols=1)

plot of chunk unnamed-chunk-43

Statistics Project

Russell Gordon

Saturday, May 9, 2015

Introduction

Hypotheses

Load helper functions

Obtaining the data

Load required packages

Rationale for selecting data source

Retrieve the data

Identify the table that has the data I need

Normalize the data

Review what we have

Convert data to numeric

Calculate batting averages

Tag data with year

Analyze the data

Shape of the data

Centre of the data

Spread of the data

Conclusions and Limitations

Conclusions

Limitations

Addendum: Scatterplots

Extension: More Data

Identify key commands

Write a function

Apply the function to get more data

Repeat analyses, but with more data

Shape of the data

Centre of the data

Spread of the data