- Give a brief overview of probabilty and statistics need for data analytics

- Give a brief overview of probabilty and statistics need for data analytics

A few common summary statistics for interpretting data are:

**mean**: Average value**median**: 50th percentile (middle) value**mode**: Most frequently occuring value**minimum**: Minimum value**maximum**: Maximum value**range**: Distance from largest to smallest value**standard deviation**: Measures spread of data

Below we create a vector where each element consists of seasonal passing yards of Aaron Rodgers of the Green Bay Packers over his 12 year career, that is, upto the 2017-2018 NFL season .

passing.yards.ar <- c(65,46,218,4038,4434,3922,4643,4295,2536,4381,3821,4428) passing.yards.ar

## [1] 65 46 218 4038 4434 3922 4643 4295 2536 4381 3821 4428

mean(passing.yards.ar) # Mean

## [1] 3068.917

median(passing.yards.ar) # Median

## [1] 3980

mode(passing.yards.ar) # Mode

## [1] "numeric"

min(passing.yards.ar) # Min

## [1] 46

max(passing.yards.ar) # Max

## [1] 4643

range(passing.yards.ar) # Range

## [1] 46 4643

sd(passing.yards.ar) # Standard deviation

## [1] 1863.987

Both are statistics measures that try to understand the central tendency of a set of data points. In some cases, using one is better than the other.

Let's take Aaron Rodgers first 4 years in the NFL. Which value would be a more accurate indicator of his passing ability in his first 4 years?

passing.yards.ar.4 <- passing.yards.ar[1:4] # Selects first 4 elements passing.yards.ar.4

## [1] 65 46 218 4038

median(passing.yards.ar.4) # Median

## [1] 141.5

mean(passing.yards.ar.4) # Mean

## [1] 1091.75

Certainly an average is a popular and most natural measure of a midpoint. However, it suffers because it can be greatly affected if there is one value that is significantly higher or lower than the other data points. This is an example of why one may choose the median over the mean.

Apply the

`summary()`

function to Aaron Rodgers' passing yard over 12 years.Based on Aaron Rodger's passing yard values, would take the mean or mode to be a better estimate of his career passing yards?

summary(passing.yards.ar)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 46 1956 3980 3069 4393 4643

From a statistical point of view, I would take his median value of 3980. This is because his first 3 years are not a real representation of his passing yardage because of lack of playing time. He is consistently around 4000 passing yards each season.

A confidence interval is a range of values such that a true mean will lie inside the interval with a high probability.

Below we create a vector where each element consists of seasonal passing yards of Brett Favre over his entire career.

# Create vector of Brett Favre's passing yards passing.yards.bf <- c(0,3227,3303,3882,4413,3899,3867,4212,4091, 3812,3921,3658,3361,4088,3881,3885,4155, 3472,4202,2509) passing.yards.bf

## [1] 0 3227 3303 3882 4413 3899 3867 4212 4091 3812 3921 3658 3361 4088 ## [15] 3881 3885 4155 3472 4202 2509

Computing the confidence interval along with other values:

t.test(passing.yards.bf)

## ## One Sample t-test ## ## data: passing.yards.bf ## t = 16.89, df = 19, p-value = 6.712e-13 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 3146.778 4037.022 ## sample estimates: ## mean of x ## 3591.9

Keeping our eyes on the important part, we see our confidence interval is `(3146.78,4037.02)`

. From this, since \(p<.05\), we can say that we are 95% sure that true mean of Brett Favre's career passing yards will lie in this interval.

We can use the `t.test(x,y)`

function with two inputs to quickly compute the means of two lists.

t.test(passing.yards.bf,passing.yards.ar)

## ## Welch Two Sample t-test ## ## data: passing.yards.bf and passing.yards.ar ## t = 0.90389, df = 14.5, p-value = 0.3808 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -713.9632 1759.9299 ## sample estimates: ## mean of x mean of y ## 3591.900 3068.917

Without going into too much detail about the statistics. Looking at the means of \(x\), Brett Favre, and \(y\), Aaron Rodgers, we can see that there is a clear difference between the two.

Number of interceptions from Aaron Rodgers and Brett Favre by season:

intercepts.ar <- c(1,0,0,13,7,11,6,8,6,5,8,7) # Aaron Rodgers intercepts.bf <- c(2,13,24,14,13,13,16,23,23, # Brett Favre 16,15,16,21,17,29,18,15,22,7,19)

What is the 95% confidence interval for Aaron Rodgers' and Brett Favre's mean interceptions per season?

Is there a clear difference in the number of interceptions between the two?

t.test(intercepts.ar)

## ## One Sample t-test ## ## data: intercepts.ar ## t = 5.1098, df = 11, p-value = 0.0003388 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 3.415564 8.584436 ## sample estimates: ## mean of x ## 6

t.test(intercepts.bf)

## ## One Sample t-test ## ## data: intercepts.bf ## t = 12.35, df = 19, p-value = 1.592e-10 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 13.95277 19.64723 ## sample estimates: ## mean of x ## 16.8

Aaron Rodgers is `(3.4,8.6)`

. Brett Favre is `(14.0,19.6)`

.

t.test(intercepts.ar,intercepts.bf)

## ## Welch Two Sample t-test ## ## data: intercepts.ar and intercepts.bf ## t = -6.0099, df = 29.538, p-value = 1.442e-06 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -14.472434 -7.127566 ## sample estimates: ## mean of x mean of y ## 6.0 16.8

There is a clear distinction between their average number of interceptions!