Lecture 2

Measure of center and variability

Published

August 31, 2023

Introduction

Why do we need numerical measures?
Parameters - associated with population
Statistics - associated with sample from population

2.1 Measures of Center

Mean

Table 1: Birthweights of 30 Full-Term Newborn Babies
7.2	7.8	6.8	6.2	8.2
8.0	8.2	5.6	8.6	7.1
8.2	7.7	7.5	7.2	7.7
5.8	6.8	6.8	8.5	7.5
6.1	7.9	9.4	9.0	7.8
8.5	9.0	7.7	6.7	7.7

\[ \sum_{i = 1}^n{x_i} = x_1 + x_2 + x_3 + \dotsb + x_n \]

Sample Mean

\[ \bar{x} = \frac{\sum{x_i}}{n} \]

Population Mean

\[ \mu \]

Example 2.1

[1]  2  9 11  5  6

\[ \bar{x} = \frac{\sum{x_i}}{n} = \frac{2 + 9 + 11 + 5 + 6}{5} = 6.6 \]

(2 + 9 + 11 + 5 + 6)/5

Mean of birth weights:

[1] 7.57

Median

The median of a set of n measurements is the value of x that falls in the middle position whent he measurements are ordered from smallest to largest.

[1]  2  9 11  5  6

Sort the data an find the median

[1]  2  5  6  9 11

[1] 6

What if you have an even number of measurements?

[1]  2  9 11  5  6 27

[1]  2  5  6  9 11 27

[1] 7.5

Comparing

Median - less sensitive to outliers

Add a new measurement to birth weights (50 lbs!)

data	mean	median
original	7.6	7.7
with added	8.9	7.7

if distribution is skewed right, the mean shifts to the right and mean is greater than median
Opposite if skewed left
if distribution is symmetric, the mean and the median are equal

Mode

The category that occurs the most frequently. It’s possible to apply to classes created in histogram but I haven’t really seen it in the wild.

It’s possible to have more than one mode (ties). In this case

Homework

Homework asks you to create dot plots. These are basically histograms where each number is plotted individually and stacked on top of each other as dots. See figure 2.2 on page 56 as an example.

[1] "2.1.1, 2.1.2, 2.1.3, 2.1.9-10, 2.1.13, 2.1.20"

Answers: Chapter 2 - Section 1

Excel File

Download a sample Exel file here

2.2 Measures of Variability

Motivation

now we can described data using a single number to represent the center.

We need to be able to describe the variability in the data

Range

The range is the difference bettween the largest and smallest measurements.

[1]  2  9 11  5  6

[1]  2 11

\[ 11 - 2 = 9 \]

very sensative to outliers
does not let you know what’s going on between the two extremes

Variance

The deviation of points i the difference between thr point and the mean

Variance of population \[ \sigma^2 = \frac{\sum\left(x_i - \mu\right)^2}{N} \]

Variance of sample \[ s^2 = \frac{\sum\left(x_i - \bar{x}\right)^2}{n - 1} \]

[1]  2  9 11  5  6

Sample Variance

[1] 12.3

Sample standard deviation

[1] 3.5

What’s the deal with \(n-1\)?
Why do we need to square it?

Standard deviation

\[ \sigma = \sqrt{\sigma^2} \] \[ s = \sqrt{s^2} \]

Computing formula for calculating variance or shortcut formula

\[ s^2 = \frac{\sum{x_i^2} - \frac{(\sum{x_i})^2}{n}}{n-1} \]

Going back to the two histograms from before: