Hi I need someone to do an online quiz for 3 ppl 25 min long in the lecture provide as well I do post one example for the quiz

STAT2800U: Lecture 18

CHAPTER 7: CORRELATION AND SIMPLE

LINEAR REGRESSION

Correlation (Section 7.1, page 505)

BIVARIATE DATA

Most of the data sets we have encountered up to now deal with

measurements of one variable on each of several ?individuals? (i.e.

incomes of individuals, test scores etc.). As discussed in our first

lecture, such type of data is called univariate data.

In many situations it may be of interest to take measurements of two

variables on each of several individuals. For example we may be

interested in both the number of hours an individual spends studying

on a midterm, and the midterm grade. In such a case each

observation will consist of two values (a pair of measurements):

(number of hours studying, midterm grade)

Such type of data is called bivariate data and is often denoted in the

form; (x,y), the ?x-observation? and ?y-observation?. A bivariate

sample of ?n? observations would then be written as

( x1 , y1 ), ( x 2 , y 2 ),…, ( x n , y n )

The reason we take observations on two variables is that we are

interested in exploring the relationship between the variables. For

example ?Do A average students tend to study more??

The simplest possible relationship between two variables x and y is

that of a straight line.

This leads us to scatter plots:

Since we seem to be interested in both the number of hours an

individual spends studying on a midterm, and the midterm grade, the

following is a data set based on 5 individuals midterm grade (in

1

STAT2800U: Lecture 18

percent) and the amount of studying spent per week (in hours); where

x = studying and y = grades: (14, 95), (3, 53), (7, 76), (9, 88), (0, 28)

Draw a scatter plot of this data:

What does this show?

1)

2)

A scatter plot of bivariate numerical data gives a visual impression of

how strongly x values and y values are related.

The following scatter plots display different types of relationships

between the x and y values:

2

STAT2800U: Lecture 18

However, to make precise statements about a data set, we must go

beyond just a scatter plot. A correlation coefficient is a quantitative

assessment of the strength of the relationship between x and y.

For example, we know that plot (b) scatter plot (above) is linear and

positive, but the question is, how positive is it? This is where the

correlation coefficient comes in, this coefficient will tell you whether

the relationship is a ?strong?, ?moderate? or ?weak? pos/neg

relationship.

PEARSON?S SAMPLE CORRELATION COEFFICIENT

Definition:

Pearson?s sample correlation r is given by

r=

S xy

( x i x )( y i y )

(xi x )

2

=

( yi y)

S xx

2

S yy .

Computing formulas for the three summation quantities are

S xx

S yy

S xy

x

( xi ) 2

y

( y i ) 2

2

i

2

i

xi yi

n

n

( x i )( y i )

n

Example: Consider the following bivariate data based on 2 class

quizzes taken by 8 different individuals in Paula?s Stats class:

Observation:

x (quiz one) :

y (quiz two) :

1

1

5

2

1

4

3

2

4

4

4

2

5

5

1

3

6

5

2

7

2

2

8

4

4

STAT2800U: Lecture 18

Scatter plot:

Pearson?s Correlation Coefficient:

PROPERTIES AND INTERPRETATION OF r

Note: The value of ?r? is always between -1 and +1. A value near the

upper limit, +1, tells us there is a positive relationship, whereas an ?r?

close to the lower limit, -1, suggests there is a negative relationship.

Also, the value of ?r? does not depend on which of the two variables is

labeled x.

4

STAT2800U: Lecture 18

One more note, the value of r does not depend on the unit of

measurement for either variable.

Describing the strength of the relationship:

So, for our above example:

Inference on the Population Correlation

By this point, we should all have a good understanding of a

population and a sample. The difference between population mean

and a sample mean:

The following data represents a population of 10 different weights (lb)

of college students in a certain small program:

5

STAT2800U: Lecture 18

145 167 158 140 135 177 149 158 189 193

The population mean is therefore:

Now, the following data represents a sample of 3 different weights

(lb) of college students in a certain small program:

158 167 140

The sample mean is therefore:

When the points ( xi , yi ) are a random sample from a population of

ordered pairs, then each point can be thought of as an observation of

an ordered pair of random variables (X, Y). The correlation

coefficient, or sample correlation, r is then an estimate of the

population correlation X ,Y .

In practice, if X and Y are both normally distributed, then it is a virtual

certainty that X and Y will be bivariate normal, so the confidence

intervals and tests described next will be valid.

Confidence intervals, and most tests, on X ,Y are based on the

following result:

6

STAT2800U: Lecture 18

Let X and Y be random variables with the bivariate normal

distribution. Let denote the population correlation between X

and Y. Let ( x1 , y1 ),…, ( xn , yn ) be a random sample from the joint

distribution of X and Y. Let r be the simple correlation of the n

points. Then the quantity

1 1 r

W ln

2 1 r

is approximately normally distributed, with mean given by

1 1

W ln

2 1

and variance given by

1

W 2

n3

Note that W is a function of the population correlation . To

construct confidence intervals, we will need to solve W for :

Example: The accompanying data on y = glucose concentration

(g/L) and x = fermentation time (days) for a particular brand of malt

liquor is given:

x: 1 2 3 4 5 6 7 8

y: 44 54 52 55 53 57 58 71

Find a 95% confidence interval for the correlation between x and y.

7

STAT2800U: Lecture 18

Note: the formula for the confidence interval is

8

STAT2800U: Lecture 18

To obtain a 95% confidence interval for we transform the inequality

using (*), obtaining

For testing null hypotheses of the form o , o , and o ,

where o is a constant not equal to 0, the quantity W forms the basis

of a test. Refer to the above example, Find the p-value for testing

H o : 0.4 versus H a : 0.4 .

9

STAT2800U: Lecture 18

For testing null hypotheses of the form 0 , 0 , or 0 , a

somewhat simpler procedure is available. When 0 , the quantity

U

r n2

1 r2

has a Student?s t distribution with n-2 degrees of freedom. Refer to

the same above example, test the hypothesis H o : 0 versus

Ha : 0.

CORRELATION AND CAUSATION

Just because the value of r is close to 1, this does not mean that x

?causes? y.

A strong (negative or positive) correlation does not necessarily imply

a cause and effect relationship between x and y. Often there is a

third hidden variable which creates an apparent relationship between

x and y. Consider an example below.

Example: A sample of students from a grade school were given a

vocabulary test. A high positive correlation was found between

x= ?student?s height? and y= ?student?s score on the test?.

10

STAT2800U: Lecture 18

Should one infer that growing taller will increase one?s vocabulary?

Explain. Also, indicate a third variable which could offer a plausible

explanation for this apparent relationship.

Note: A cause and effect relationship is best established by an

experiment in which other variables which influence x and y are

controlled.

Question: In the example above, how might a more controlled

experiment be performed?

Answer:

The Least-Squares Line (Section 7.2, page 523)

Given two numerical variables x and y , the general objective of

regression analysis is to use information about x to draw some

type of conclusion concerning y .

Sometimes investigators would like to ?predict? the y -value that

would result from making a single observation at a specified x -value.

Terminology: y is called the dependent or response variable and

x is referred to as the independent, predictor, or explanatory

variable.

11

STAT2800U: Lecture 18

We now know how to create a scatter plot of y versus x . Note,

when we create a scatter plot, we often draw a line through the

points. This line, in fact, summarizes the relationship between the

variables, and is called the least squares regression line. Linear

models with only one independent variable, is known as simple

linear regression models. Linear models with more than one

independent variable are called multiple regression models. We

will only be looking at simple linear regression.

Computing the Equation of the Least-Squares Line

A set of points (bivariate observations) may or may not have a linear

relationship. If we want to draw a straight line which ?fits? these

points, which line fits best? Our choice depends on what we use to

define a ?good? fit.

Given a sample of bivariate data (x1, y1), (x2,y2), . . . ,(xn,yn), the least

squares line L is the line fitted to the points in such a way that e12 +

e22 + ? + en2 is as small as possible:

The linear model is:

yi 0 1 xi i

where yi is called the dependent variable, xi is called the

independent variable, 0 and 1 are the regression coefficients,

and i is called the error.

12

STAT2800U: Lecture 18

To compute the equation of the least-squares line:

?

? ?

y 0 1 x

?

?

we must determine the values for the slope 1 and the intercept 0

n

that minimize the sum of the squared residuals

e

i 1

2

i

.

The formula to get your slope and intercept are here:

?

? ?

y 0 1 x ;

where

?

1

13

Sxy

Sxx

and

?

?

0 y 1 x