Hi I need someone to do an online quiz for 3 ppl 25 min long in the lecture provide as well I do post one example for the quiz

STAT2800U: Lecture 18

CHAPTER 7: CORRELATION AND SIMPLE
LINEAR REGRESSION
Correlation (Section 7.1, page 505)
BIVARIATE DATA
Most of the data sets we have encountered up to now deal with
measurements of one variable on each of several ?individuals? (i.e.
incomes of individuals, test scores etc.). As discussed in our first
lecture, such type of data is called univariate data.
In many situations it may be of interest to take measurements of two
variables on each of several individuals. For example we may be
interested in both the number of hours an individual spends studying
on a midterm, and the midterm grade. In such a case each
observation will consist of two values (a pair of measurements):
(number of hours studying, midterm grade)
Such type of data is called bivariate data and is often denoted in the
form; (x,y), the ?x-observation? and ?y-observation?. A bivariate
sample of ?n? observations would then be written as

( x1 , y1 ), ( x 2 , y 2 ),…, ( x n , y n )
The reason we take observations on two variables is that we are
interested in exploring the relationship between the variables. For
example ?Do A average students tend to study more??
The simplest possible relationship between two variables x and y is
that of a straight line.
This leads us to scatter plots:
Since we seem to be interested in both the number of hours an
individual spends studying on a midterm, and the midterm grade, the
following is a data set based on 5 individuals midterm grade (in
1

STAT2800U: Lecture 18

percent) and the amount of studying spent per week (in hours); where
x = studying and y = grades: (14, 95), (3, 53), (7, 76), (9, 88), (0, 28)
Draw a scatter plot of this data:

What does this show?
1)
2)
A scatter plot of bivariate numerical data gives a visual impression of
how strongly x values and y values are related.
The following scatter plots display different types of relationships
between the x and y values:

2

STAT2800U: Lecture 18

However, to make precise statements about a data set, we must go
beyond just a scatter plot. A correlation coefficient is a quantitative
assessment of the strength of the relationship between x and y.
For example, we know that plot (b) scatter plot (above) is linear and
positive, but the question is, how positive is it? This is where the
correlation coefficient comes in, this coefficient will tell you whether
the relationship is a ?strong?, ?moderate? or ?weak? pos/neg
relationship.
PEARSON?S SAMPLE CORRELATION COEFFICIENT
Definition:
Pearson?s sample correlation r is given by

r=

S xy
( x i x )( y i y )

(xi x )

2

=

( yi y)

S xx

2

S yy .

Computing formulas for the three summation quantities are

S xx
S yy

S xy

x

( xi ) 2

y

( y i ) 2

2
i

2
i

xi yi

n

n

( x i )( y i )
n

Example: Consider the following bivariate data based on 2 class
quizzes taken by 8 different individuals in Paula?s Stats class:
Observation:
x (quiz one) :
y (quiz two) :

1
1
5

2
1
4

3
2
4

4
4
2

5
5
1

3

6
5
2

7
2
2

8
4
4

STAT2800U: Lecture 18

Scatter plot:

Pearson?s Correlation Coefficient:

PROPERTIES AND INTERPRETATION OF r
Note: The value of ?r? is always between -1 and +1. A value near the
upper limit, +1, tells us there is a positive relationship, whereas an ?r?
close to the lower limit, -1, suggests there is a negative relationship.
Also, the value of ?r? does not depend on which of the two variables is
labeled x.
4

STAT2800U: Lecture 18

One more note, the value of r does not depend on the unit of
measurement for either variable.
Describing the strength of the relationship:

So, for our above example:

Inference on the Population Correlation
By this point, we should all have a good understanding of a
population and a sample. The difference between population mean
and a sample mean:
The following data represents a population of 10 different weights (lb)
of college students in a certain small program:
5

STAT2800U: Lecture 18

145 167 158 140 135 177 149 158 189 193
The population mean is therefore:

Now, the following data represents a sample of 3 different weights
(lb) of college students in a certain small program:
158 167 140
The sample mean is therefore:

When the points ( xi , yi ) are a random sample from a population of
ordered pairs, then each point can be thought of as an observation of
an ordered pair of random variables (X, Y). The correlation
coefficient, or sample correlation, r is then an estimate of the
population correlation X ,Y .
In practice, if X and Y are both normally distributed, then it is a virtual
certainty that X and Y will be bivariate normal, so the confidence
intervals and tests described next will be valid.
Confidence intervals, and most tests, on X ,Y are based on the
following result:

6

STAT2800U: Lecture 18

Let X and Y be random variables with the bivariate normal
distribution. Let denote the population correlation between X
and Y. Let ( x1 , y1 ),…, ( xn , yn ) be a random sample from the joint
distribution of X and Y. Let r be the simple correlation of the n
points. Then the quantity
1 1 r
W ln
2 1 r
is approximately normally distributed, with mean given by
1 1
W ln
2 1
and variance given by
1
W 2
n3
Note that W is a function of the population correlation . To
construct confidence intervals, we will need to solve W for :

Example: The accompanying data on y = glucose concentration
(g/L) and x = fermentation time (days) for a particular brand of malt
liquor is given:
x: 1 2 3 4 5 6 7 8
y: 44 54 52 55 53 57 58 71
Find a 95% confidence interval for the correlation between x and y.
7

STAT2800U: Lecture 18

Note: the formula for the confidence interval is

8

STAT2800U: Lecture 18

To obtain a 95% confidence interval for we transform the inequality
using (*), obtaining

For testing null hypotheses of the form o , o , and o ,
where o is a constant not equal to 0, the quantity W forms the basis
of a test. Refer to the above example, Find the p-value for testing
H o : 0.4 versus H a : 0.4 .

9

STAT2800U: Lecture 18

For testing null hypotheses of the form 0 , 0 , or 0 , a
somewhat simpler procedure is available. When 0 , the quantity

U

r n2

1 r2
has a Student?s t distribution with n-2 degrees of freedom. Refer to
the same above example, test the hypothesis H o : 0 versus
Ha : 0.

CORRELATION AND CAUSATION
Just because the value of r is close to 1, this does not mean that x
?causes? y.
A strong (negative or positive) correlation does not necessarily imply
a cause and effect relationship between x and y. Often there is a
third hidden variable which creates an apparent relationship between
x and y. Consider an example below.
Example: A sample of students from a grade school were given a
vocabulary test. A high positive correlation was found between
x= ?student?s height? and y= ?student?s score on the test?.

10

STAT2800U: Lecture 18

Should one infer that growing taller will increase one?s vocabulary?
Explain. Also, indicate a third variable which could offer a plausible
explanation for this apparent relationship.

Note: A cause and effect relationship is best established by an
experiment in which other variables which influence x and y are
controlled.
Question: In the example above, how might a more controlled
experiment be performed?

The Least-Squares Line (Section 7.2, page 523)
Given two numerical variables x and y , the general objective of
regression analysis is to use information about x to draw some
type of conclusion concerning y .
Sometimes investigators would like to ?predict? the y -value that
would result from making a single observation at a specified x -value.
Terminology: y is called the dependent or response variable and
x is referred to as the independent, predictor, or explanatory
variable.
11

STAT2800U: Lecture 18

We now know how to create a scatter plot of y versus x . Note,
when we create a scatter plot, we often draw a line through the
points. This line, in fact, summarizes the relationship between the
variables, and is called the least squares regression line. Linear
models with only one independent variable, is known as simple
linear regression models. Linear models with more than one
independent variable are called multiple regression models. We
will only be looking at simple linear regression.
Computing the Equation of the Least-Squares Line
A set of points (bivariate observations) may or may not have a linear
relationship. If we want to draw a straight line which ?fits? these
points, which line fits best? Our choice depends on what we use to
define a ?good? fit.
Given a sample of bivariate data (x1, y1), (x2,y2), . . . ,(xn,yn), the least
squares line L is the line fitted to the points in such a way that e12 +
e22 + ? + en2 is as small as possible:

The linear model is:

yi 0 1 xi i
where yi is called the dependent variable, xi is called the
independent variable, 0 and 1 are the regression coefficients,
and i is called the error.

12

STAT2800U: Lecture 18

To compute the equation of the least-squares line:

?
? ?
y 0 1 x
?
?
we must determine the values for the slope 1 and the intercept 0
n

that minimize the sum of the squared residuals

e
i 1

2
i

.

The formula to get your slope and intercept are here:

?
? ?
y 0 1 x ;

where

?
1

13

Sxy
Sxx

and

?
?
0 y 1 x