Loading...

Summer 2003

Appleglo Region Maine New Hampshire Vermont Massachusetts Connecticut Rhode Island New York New Jersey Pennsylvania Delaware Maryland West Virginia Virginia Ohio

First-Year Advertising Expenditures ($ millions) x 1.8 1.2 0.4 0.5 2.5 2.5 1.5 1.2 1.6 1.0 1.5 0.7 1.0 0.8

First-Year Sales ($ millions) y 104 68 39 43 127 134 87 77 102 65 101 46 52 33

First Year Sales ($Millions)

Does Advertising Increase Sales? 160 120 80 (x3,y3)

40 0 0

0.5

1

1.5

2

2.5

Advertising Expenditures ($Millions)

Questions: i) How to relate advertising expenditure to sales? ii) What is expected first-year sales if advertising expenditure is $2.2 million? iii) How confident are you in your estimate? 15.063 Summer 2003

2

Regression Analysis GOAL: Develop a formula that relates two quantities x:

“independent” (also called “explanatory”) variable quantity typically under managerial control

Y:

“dependent” variable magnitude is determined (to some degree) by value of x quantity to be predicted

Examples:

Y

X

(dependent variable) College GPA

(independent variable) SAT score

Lung cancer rate

Amount of cigarette smoking

Stock return First-year sales

15.063 Summer 2003

Spending in R&D Advertising expenditures

3

Outline • Simple Linear Regression • Multiple Regression • Understanding Regression Output • Coefficient of Determination R2 • Validating the Regression Model

15.063 Summer 2003

4

The Basic Model: Simple Linear Regression Data: (x1, y1), (x2, y2), . . . , (xn, yn) (a sample of size n taken from the population of all (X,Y) values)

Model of the population*:

Yi = E 0 + E1 xi + H i

Comments: The model assumes a linear relationship between x and Y, with y intercept E0 and slope E1 E0 and E1 are the parameters for the whole population. We do not know them and will estimate them using b0 and b1 to be calculated from the data (i.e. from the sample of size n) Hi is the called the error term. Since the Y’s do not fall precisely on the line (i.e. they are r.v.’s) we need to add an error term to obtain an equality. Hi is N(0, V). Thus, H1, H2, . . . , Hn are i.i.d. Normally distributed r.v.’s. E (Yi | xi) = E0 + E1x i Is the expected value of Y for a given x value. It is just the value on the line as that is where on average the Yi value would fall for a given xi value. SD(Yi | xi) = V Notice that The SD of Yi is equal to the SD of Hi and is a constant independent of the value of x. 15.063 Summer 2003

5

How do we choose the line that “best” fits the data? First YearS ales( $M)

80

(xi, ^ yi)

Best choices: bo = 13.82 b1 = 48.60

60

ei

bo=13.82

40

(xi, yi) 20

Slope b1 = 48.60 0 0

0.5

1

Advertising Expenditures ($M)

Regression coefficients: b0 and b1 are estimates of E0 and E1 Regression estimate for Y at xi : y^i = b0 + b1xi (prediction) Value of Y at xi : yi = b0 + b1xi + ei (use the error to obtain equality) Residual (error):

ei = yi - y^i

The “best” regression line is the one that chooses b0 and b1 to minimize the total n 2 n squared errors: SSR = 6i=1ei = 6i=1 (yi - y^ i )2 SSR is the residual sum of squares, analogous to a variance calculation 15.063 Summer 2003

6

How Good a Fit to the Line? 9 8 7 6

std error s estimates ı, the std deviation of error Hi lower figure has 10 times the error

5 4 3 2 1 0 0

5

10

15

20

0

5

10

15

20

9 8 7 6 5 4 3 2 1 0

15.063 Summer 2003

7

Coefficient of Determination: R2 • It is a measure of the overall quality of the regression. Specifically, it is the percentage of total variation exhibited in the yi data that is accounted for or predicted by the sample regression line.

_ - The sample mean of Y: y = (y1 + y2 + . . . + yn)/ n - Total variation in Y =

n 6i=1

_

(yi - y )2

n 2 n - Residual (unaccounted) variation in Y = 6i=1 ei = 6i=1 (even the linear model, y^i , does not explain all the the variability in yi) R2

=

(yi - y^ i )2

variation accounted for by x variables total variation =1-

variation not accounted for by x variables total variation

=1-

n 6i=1 (yi - y^ i )2 _ n 6i=1 (yi - y )2 15.063 Summer 2003

8

35 30 25 20 15 10 5 0

First Year Sales ($Millions)

R2 takes values between 0 and 1 (it is a percentage).

160 120 80 40 0 0

0.5

1

1.5

2

2.5

Advertising Expenditures ($Millions)

0

5

10

15

20

25

R2 = 0.833 in our Appleglo Example

30

X

R2 = 1; x values account for all variation in the Y values

30 25 20 15 10 5

R2 = 0; x values account for no variation in the Y values

0

15.063 Summer 2003

0

5

10

15

20

25

30

X

9

Correlation and Regression Simple regression is correlation in disguise Coefficient of Determination = squared correlation coefficient Regression coefficient: b1 = correlation * sy/sx Appleglo: Sales = 13.82 + 48.60 * Advertising The coefficients are in units of sales and advertising. If advertising is $2.2 Million, then sales will be 13.82 + 48.60 * 2.2 = $120.74 M What if there are >1 predictor variable? 15.063 Summer 2003

10

Sales of Nature-Bar ($ million) Y

x1

x2

x3

region

sales advertising promotions competitor’s sales Selkirk 101.8 1.3 0.2 20.40 Susquehanna 44.4 0.7 0.2 30.50 Kittery 108.3 1.4 0.3 24.60 Acton 85.1 0.5 0.4 19.60 Finger Lakes 77.1 0.5 0.6 25.50 Berkshire 158.7 1.9 0.4 21.70 Central 180.4 1.2 1.0 6.80 Providence 64.2 0.4 0.4 12.60 Nashua 74.6 0.6 0.5 31.30 Dunster 143.4 1.3 0.6 18.60 Endicott 120.6 1.6 0.8 19.90 Five-Towns 69.7 1.0 0.3 25.60 Waldeboro 67.8 0.8 0.2 27.40 Jackson 106.7 0.6 0.5 24.30 Stowe 119.6 1.1 0.3 13.70 15.063 Summer 2003

11

Multiple Regression • In general, there are many factors in addition to advertising expenditures that affect sales • Multiple regression allows more than one independent variable Independent variables: Data:

x1, x2, . . . , xk

(k of them)

(y1, x11, x21, . . . , xk1), . . . , (yn1, xn1, xn2, . . . , xkn),

Population Model:

Yi = E0 + E1x1i + . . . + Ekxki + Hi

H1, H2, . . . , Hn are i.i.d random variables, ~ N(0, V) Regression coefficients: b0, b1,…, bk are estimates of E0, E1,…, Ek . ^

Regression Estimate of yi : yi = b0 + b1x1i + . . . + bkxki Goal: Choose b0, b1, ... , bk to minimize the residual sum of squares. i.e., minimize:

SSR

n 2 = 6i=1ei =

n 6i=1

(yi - y^ i )2

15.063 Summer 2003

12

Regression Output (from Excel) Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.913 0.833 0.787 17.600 15

Analysis of Variance

Standard error s: an estimate of V s2 estimate of variance

df Regression Residual Total

Sum of Mean F Significance Squares Square F 3 16997.537 5665.85 18.290 0.000 11 3407.473 309.77 14 20405.009

Coefficients Standard Error Intercept Advertising Promotions Competitor’s Sales

65.71 48.98 59.65 -1.84

27.73 10.66 23.63 0.81

t PLower Statistic value 95% 2.37 4.60 2.53 -2.26

15.063 Summer 2003

Upper 95%

0.033 4.67 126.74 0.000 25.52 72.44 0.024 7.66 111.65 0.040 -3.63 -0.047 13

Understanding Regression Output 1) Regression coefficients: b0, b1, . . . , bk are estimates of E0, E1, . . . , Ek based on sample data. Fact: E[bj ] =Ej (i.e., if we run the multiple regression many many times, the average value of the bj’s we get is Ej)

Example: b0 = 65.705 (its interpretation is context dependent, in this case, sales if no advertising, no promotions, and no competition) b1 = 48.979 (an additional $1 million in advertising is expected to result in an additional $49 million in sales) b2 = 59.654 (an additional $1 million in promotions is expected to result in an additional $60 million in sales) b3 = -1.838 (an increase of $1 million in competitor sales is expected to decrease sales by $1.8 million) 15.063 Summer 2003

14

Understanding Regression Output, Continued 2) Standard error s: an estimate of V, the SD of each Hi. It is a measure of the amount of “noise” in the model. Example: s = 17.60 3) Degrees of freedom: to be explained later. 4) Standard errors of the coefficients: sb0 , sb1 , . . . , sbk They are just the standard deviations of the estimates b0 , b1, . . . , bk. They are useful in assessing the quality of the coefficient estimates and validating the model. (Explained later). 15.063 Summer 2003

15

Coefficient of Determination: R2 • A high R2 means that most of the variation we observe in the yi data can be attributed to their corresponding x values a desired property. • In multiple regression, R is called “Multiple R” if the data points are • In simple regression, the R2 is higher ^ better aligned along a line. The corresponding picture in multiple regression is a plot of predicted yi vs. the actual yi data. • How high a R2 is “good” enough depends on the situation (for example, the intended use of the regression, and complexity of the problem). • Users of regression tend to be fixated on R2, but it’s not the whole story. It is important that the regression model is “valid.” 15.063 Summer 2003

16

Caution about R2 • One should not include x variables unrelated to Y in the model, just to make the R2 fictitiously high. New x variables will account for some additional variance by chance alone (“fishing”), but these would not be validated in new samples. • Adjusted R2 modifies R2 to account for the number of variables and the sample size, therefore counteracting “fishing”: 2

(n – 1)

Adjusted R = 1 –

2

(1 – R )

[n – (k + 1)] Rule of thumb: n >= 5(k+2) where n = sample size and k = number of predictor variables 15.063 Summer 2003

17

Validating the Regression Model Assumptions about the population: Yi = b0 + b1x1i + . . . + bkxki + Hi (i = 1, . . . , n) H1, H2, . . . , Hn are i.i.d random variables, ~ N(0, V) 1) Linearity • If k = 1 (simple regression), one can check visually from scatter plot. • “Sanity check”: the sign of the coefficients, reason for non-linearity? 2) Normality of H i ^ ). • Plot the residuals (ei = yi - y i • They should look evenly random – i.e. scattered. • Then plot a histogram of the residuals. The resulting distribution should be approximately normal. Usually, results are fairly robust with respect to this assumption. 15.063 Summer 2003

18

Residual Plots

X

0

X

0

15.063 Summer 2003

Healthy

Nonlinear Can sometimes be fixed, e.g., Insert x2 as a variable. 19

3) Heteroscedasticity • Do error terms have constant Std. Dev.? (i.e., SD(Hi ) = Vfor all i?) • Check scatter plot of residuals vs. Y and x variables. Residuals

Residuals

20.00

20.00

10.00

10.00

R es 0.00 id 0.0 u -10.00

0.00 1.0

2.0

0.0

1.0

2.0

-10.00

-20.00

-20.00

Advertising

Advertising Expenditures

No evidence of heteroscedasticity

Evidence of heteroscedasticity

• May be fixed by introducing a transformation (e.g. use x2 instead of x) • May be fixed by introducing or eliminating some independent variables 15.063 Summer 2003

20

4) Autocorrelation: Are error terms independent? - Plot residuals in order and check for patterns Time Plot

Time Plot 6

4 2 0 0

5

10

15

20

-2

Residual

R esidual

6

4 2 0 0

-4

-2

-6

-4

No evidence of autocorrelation

5

10

15

20

Evidence of autocorrelation

• Autocorrelation may be present if observations have a natural sequential order (for example, time). • May be fixed by introducing a variable (frequently time) or transforming a variable. 15.063 Summer 2003

21

Validating the Regression Model: Autocorrelation Sales ($ Thousands)

Promotions ($ Thousands)

Month

63.00

26

January

65.25

25

February

69.18

38.5

March

74.34

42

April

68.62

25.1

May

63.71

24.7

June

64.41

24.3

July

64.06

24.1

August

70.36

42.1

September

75.71

43

October

67.61

22

November

62.93

25

December

Evidence of Autocorrelation in Simple Regression in Toothpaste monthly sales and promotions Residuals

R esidua l Va lu e

5 4 3 2 1 0 -1 0 -2 -3

15.063 Summer 2003

2

4

6

8

10

12

Time

22

Graphs of Non-independent Error Terms (Autocorrelation)

0

X

0

X

Possible solution: Insert time (sequence) of observation as a variable. 15.063 Summer 2003

23

Pitfalls and Issues 1) Overspecification • Including too many x variables to make R2 fictitiously high.

• Rule of thumb: we should maintain that n >= 5(k+2) 2) Extrapolating beyond the range of data (Carter Racing!!) 120 90 60 30 0 0.0

1.0

2.0

3.0

Advertising

15.063 Summer 2003

24

Pitfalls and Issues 3) Multicollinearity • Occurs when two of the x variable are strongly correlated. • Can give very wrong estimates for Ei’s. • Tell-tale signs: - Regression coefficients (bi’s) have the “wrong” sign. - Addition/deletion of an independent variable results in large changes of regression coefficients - Regression coefficients (bi’s) not significantly different from 0 • May be fixed by deleting one or more independent variables 15.063 Summer 2003

25

Can We Predict Graduate GPA from College GPA and GMAT? Student Graduate Number GPA 1 4.0 2 4.0 3 3.1 4 3.1 5 3.0 6 3.5 7 3.1 8 3.5 9 3.1 10 3.2 11 3.8 12 4.1 13 2.9 14 3.7 15 3.8 16 3.9 17 3.6 18 3.1 19 3.3 20 4.0 21 3.1 22 3.7 23 3.7 24 3.9 25 3.8

College GPA 3.9 3.9 3.1 3.2 3.0 3.5 3.0 3.5 3.2 3.2 3.7 3.9 3.0 3.7 3.8 3.9 3.7 3.0 3.2 3.9 3.1 3.7 3.7 4.0 3.8

15.063 Summer 2003

GMAT 640 644 557 550 547 589 533 600 630 548 600 633 546 602 614 644 634 572 570 656 574 636 635 654 633 26

Regression Output R Square Standard Error Observations

Intercept College GPA GMAT

R Square Standard Error Observations

0.96 0.08 25

What happened?

Coefficients Standard Error 0.09540 0.28451 1.12870 0.10233 -0.00088 0.00092

College GPA and GMAT are highly correlated! Graduate College Graduate 1 College 0.98 1 GMAT 0.86 0.90

0.958 0.08 25

GMAT

1

Eliminate GMAT(HBS?) Intercept College GPA

Coefficients Standard Error -0.1287 0.1604 1.0413 0.0455 15.063 Summer 2003

27

Checklist for Evaluating a Linear Regression Model • Linearity: scatter plot, common sense, and knowing your problem. • Signs of Regression Coefficients: do they agree with intuition? • Normality: plot residual histogram • R2: is it reasonably high in the context? • Heteroscedasticity: plot residuals against each x variable • Autocorrelation: time series plot • Multicollinearity: compute correlations between x variables • Statistical test: are the coefficients significantly different from zero? (next time) 15.063 Summer 2003

28

Summary and Look Ahead Regression is a way to make predictions from one or more predictor variables There are a lot of assumptions that must be checked to make sure the regression model is valid We may not get to Croq’Pain

15.063 Summer 2003

29

Loading...