Introduction to Probability and Statistics

Outline

Give a brief overview of probabilty and statistics need for data analytics

Summary Statistics

A few common summary statistics for interpretting data are:

mean: Average value
median: 50th percentile (middle) value
mode: Most frequently occuring value
minimum: Minimum value
maximum: Maximum value
range: Distance from largest to smallest value
standard deviation: Measures spread of data

Example with Passing Yard of Aaron Rodgers

Below we create a vector where each element consists of seasonal passing yards of Aaron Rodgers of the Green Bay Packers over his 12 year career, that is, upto the 2017-2018 NFL season .

passing.yards.ar <- c(65,46,218,4038,4434,3922,4643,4295,2536,4381,3821,4428)
passing.yards.ar

##  [1]   65   46  218 4038 4434 3922 4643 4295 2536 4381 3821 4428

mean(passing.yards.ar)    # Mean

## [1] 3068.917

median(passing.yards.ar)  # Median

## [1] 3980

mode(passing.yards.ar)    # Mode

## [1] "numeric"

min(passing.yards.ar)     # Min

## [1] 46

max(passing.yards.ar)     # Max

## [1] 4643

range(passing.yards.ar)   # Range

## [1]   46 4643

sd(passing.yards.ar)      # Standard deviation

## [1] 1863.987

Median vs. Mean

Both are statistics measures that try to understand the central tendency of a set of data points. In some cases, using one is better than the other.

Example Median vs. Mean

Let's take Aaron Rodgers first 4 years in the NFL. Which value would be a more accurate indicator of his passing ability in his first 4 years?

passing.yards.ar.4 <- passing.yards.ar[1:4] # Selects first 4 elements

passing.yards.ar.4

## [1]   65   46  218 4038

median(passing.yards.ar.4) # Median

## [1] 141.5

mean(passing.yards.ar.4)   # Mean

## [1] 1091.75

Certainly an average is a popular and most natural measure of a midpoint. However, it suffers because it can be greatly affected if there is one value that is significantly higher or lower than the other data points. This is an example of why one may choose the median over the mean.

Your Turn

Apply the summary() function to Aaron Rodgers' passing yard over 12 years.
Based on Aaron Rodger's passing yard values, would take the mean or mode to be a better estimate of his career passing yards?

Answers

1.

summary(passing.yards.ar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      46    1956    3980    3069    4393    4643

2.

From a statistical point of view, I would take his median value of 3980. This is because his first 3 years are not a real representation of his passing yardage because of lack of playing time. He is consistently around 4000 passing yards each season.

Confidence Intervals

A confidence interval is a range of values such that a true mean will lie inside the interval with a high probability.

Example of Confidence Interval with Passing Yards of Brett Favre

Below we create a vector where each element consists of seasonal passing yards of Brett Favre over his entire career.

# Create vector of Brett Favre's passing yards 
passing.yards.bf <- c(0,3227,3303,3882,4413,3899,3867,4212,4091,
                      3812,3921,3658,3361,4088,3881,3885,4155,
                      3472,4202,2509)

passing.yards.bf

##  [1]    0 3227 3303 3882 4413 3899 3867 4212 4091 3812 3921 3658 3361 4088
## [15] 3881 3885 4155 3472 4202 2509

Computing the confidence interval along with other values:

t.test(passing.yards.bf)

## 
##  One Sample t-test
## 
## data:  passing.yards.bf
## t = 16.89, df = 19, p-value = 6.712e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3146.778 4037.022
## sample estimates:
## mean of x 
##    3591.9

Keeping our eyes on the important part, we see our confidence interval is (3146.78,4037.02). From this, since \(p<.05\), we can say that we are 95% sure that true mean of Brett Favre's career passing yards will lie in this interval.

Comparing Brett Favre and Aaron Rodgers

We can use the t.test(x,y) function with two inputs to quickly compute the means of two lists.

t.test(passing.yards.bf,passing.yards.ar)

## 
##  Welch Two Sample t-test
## 
## data:  passing.yards.bf and passing.yards.ar
## t = 0.90389, df = 14.5, p-value = 0.3808
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -713.9632 1759.9299
## sample estimates:
## mean of x mean of y 
##  3591.900  3068.917

Without going into too much detail about the statistics. Looking at the means of \(x\), Brett Favre, and \(y\), Aaron Rodgers, we can see that there is a clear difference between the two.

Your Turn

Number of interceptions from Aaron Rodgers and Brett Favre by season:

intercepts.ar <- c(1,0,0,13,7,11,6,8,6,5,8,7) # Aaron Rodgers
intercepts.bf <- c(2,13,24,14,13,13,16,23,23, # Brett Favre
                   16,15,16,21,17,29,18,15,22,7,19)

What is the 95% confidence interval for Aaron Rodgers' and Brett Favre's mean interceptions per season?
Is there a clear difference in the number of interceptions between the two?

Answers

1.

t.test(intercepts.ar)

## 
##  One Sample t-test
## 
## data:  intercepts.ar
## t = 5.1098, df = 11, p-value = 0.0003388
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3.415564 8.584436
## sample estimates:
## mean of x 
##         6

t.test(intercepts.bf)

## 
##  One Sample t-test
## 
## data:  intercepts.bf
## t = 12.35, df = 19, p-value = 1.592e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  13.95277 19.64723
## sample estimates:
## mean of x 
##      16.8

Aaron Rodgers is (3.4,8.6). Brett Favre is (14.0,19.6).

2.

t.test(intercepts.ar,intercepts.bf)

## 
##  Welch Two Sample t-test
## 
## data:  intercepts.ar and intercepts.bf
## t = -6.0099, df = 29.538, p-value = 1.442e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.472434  -7.127566
## sample estimates:
## mean of x mean of y 
##       6.0      16.8

There is a clear distinction between their average number of interceptions!