Let the n observed values of x and y be termed xi and yi, where i = 1, 2, 3, ... , n.
∑ε2 is minimized when b0 and b1 take on the following values:
Province
Income
Alcohol
Newfoundland
26.8
8.7
Prince Edward Island
27.1
8.4
Nova Scotia
29.5
8.8
New Brunswick
28.4
7.6
Quebec
30.8
8.9
Ontario
36.4
10
Manitoba
30.4
9.7
Saskatchewan
29.8
8.9
Alberta
35.1
11.1
British Columbia
32.5
10.9
Income is family income in thousands of dollars per capita, 1986. (independent variable)
Alcohol is litres of alcohol consumed per person 15 years of age or over, 1985-86. (dependent variable)
Is alcohol a superior good?
Sources: Saskatchewan Alcohol and Drug Abuse Commission,
Fast Factsheet, Regina, 1988
Statistics Canada, EconomIc Families – 1986 [machine-readable data file, 1988.
Hypotheses
H0: β1 = 0. Income has no effect on alcohol consumption.
H1: β1 > 0. Income has a positive effect on alcohol consumption.
Province
x
y
x-barx
y-bary
(x-barx)(y-bary)
x-barx sq
Newfoundland
26.8
8.7
-3.88
-0.6
2.328
15.0544
PEI
27.1
8.4
-3.58
-0.9
3.222
12.8164
Nova Scotia
29.5
8.8
-1.18
-0.5
0.59
1.3924
New Brunswick
28.4
7.6
-2.28
-1.7
3.876
5.1984
Quebec
30.8
8.9
0.12
-0.4
-0.048
0.0144
Ontario
36.4
10
5.72
0.7
4.004
32.7184
Manitoba
30.4
9.7
-0.28
0.4
-0.112
0.0784
Saskatchewan
29.8
8.9
-0.88
-0.4
0.352
0.7744
Alberta
35.1
11.1
4.42
1.8
7.956
19.5364
British Columbia
32.5
10.9
1.82
1.6
2.912
3.3124
sum
306.8
93
-6.8E-14
-7.1E-15
25.08
90.896
mean
30.68
9.3
b1
0.275919732
b0
0.834782609
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.790288
R Square
0.624555
Adjusted R Square
0.577624
Standard Error
0.721104
Observations
10
ANOVA
df
SS
MS
F
Significance F
Regression
1
6.920067
6.920067
13.30803
0.006513
Residual
8
4.159933
0.519992
Total
9
11.08
Coefficients
Standard Error
t Stat
P-value
Intercept
0.834783
2.331675
0.358018
0.729592
X Variable 1
0.27592
0.075636
3.648018
0.006513
Analysis. b1 = 0.276 and its standard error is 0.076, for a t value of 3.648. At α = 0.01, the null hypothesis can be rejected (ie. with H0, the probability of a t this large or larger is 0.0065) and the alternative hypothesis accepted. At 0.01 significance, there is evidence that alcohol is a superior good, ie. that income has a positive effect on alcohol consumption.
Draw line – select two x values (eg. 26 and 36) and compute the predicted y values (8.1 and 10.8, respectively). Plot these points and draw line.
Interpolation. If a city had a mean income of $32,000, the expected level of alcohol consumption would be 9.7 litres per capita.
Extrapolation
Suppose a city had a mean income of $50,000 in 1986. From the equation, expected alcohol consumption would be 14.6 litres per capita.
Cautions:
Model was tested over the range of income values from 26 to 36 thousand dollars. While it appears to be close to a straight line over this range, there is no assurance that a linear relation exists outside this range.
Model does not fit all points – only 62% of the variation in alcohol consumption is explained by this linear model.
Confidence intervals for prediction become larger the further the independent variable x is from its mean.
Change in y resulting from change in x
Estimate of change in y resulting from a change in x is b1.
For the alcohol consumption example, b1 = 0.276.
A 10.0 thousand dollar increase in income is associated with a 2.76 per litre increase in annual alcohol consumption per capita, at least over the range estimated.
This can be used to calculate the income elasticity for alcohol consumption.
How much of y is explained statistically from the regression model, in this case the line?
Total variation in y is termed the total sum of squares, or SST.
The common measure of goodness of fit of the line is the coefficient of determination, the proportion of the variation or SST that is “explained” by the line.
Difference of any observed value of y from the mean is the difference between the observed and predicted value plus the difference of the predicted value from the mean of y. From this, it can be proved that:
Difference from mean
“Error” of prediction
Value of y “explained” by the line
SST= Total variation of y
SSE = “Unexplained” or “error” variation of y
SSR = “Explained” variation of y
Variation in y
x
y
ŷ = b0 + b1x
yi
ŷi
xi
Variation in y “explained” by the line
x
y
ŷ = b0 + b1x
yi
ŷi
xi
“Explained” portion
Variation in y that is “unexplained” or error
x
y
yi
ŷi
xi
yi – ŷi
ŷ = b0 + b1x
‘Unexplained” or error
Coefficient of determination
The coefficient of determination, r2 or R2 (the notation used in many texts),is defined as the ratio of the “explained” or regression sum of squares, SSR, to the total variation or sum of squares, SST.
The coefficient of determination is the square of the correlation coefficient r. As noted by ASW (483), the correlation coefficient, r, is the square root of the coefficient of determination, but with the same sign (positive or negative) as b1.
Calculations for:
Province
x
y
Predicted Y
Residuals
SSE
SSR
SST
Nfld
26.8
8.7
8.229431
0.470569
0.221435
1.146117
0.36
PEI
27.1
8.4
8.312207
0.087793
0.007708
0.975734
0.81
NS
29.5
8.8
8.974415
-0.17441
0.03042
0.106006
0.25
NB
28.4
7.6
8.670903
-1.0709
1.146833
0.395763
2.89
Que
30.8
8.9
9.33311
-0.43311
0.187585
0.001096
0.16
Ont
36.4
10
10.87826
-0.87826
0.771342
2.490907
0.49
Man
30.4
9.7
9.222742
0.477258
0.227775
0.005969
0.16
SK
29.8
8.9
9.057191
-0.15719
0.024709
0.058956
0.16
Alb
35.1
11.1
10.51957
0.580435
0.336905
1.487339
3.24
BC
32.5
10.9
9.802174
1.097826
1.205222
0.252179
2.56
4.159933
6.920067
11.08
R squared
0.624555
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.790288
R Square
0.624555
Adjusted R Square
0.577624
Standard Error
0.721104
Observations
10
ANOVA
df
SS
MS
F
Significance F
Regression
1
6.920067
6.920067
13.30803
0.006513
Residual
8
4.159933
0.519992
Total
9
11.08
Interpretation of R2
Proportion, or percentage if multiplied by 100, of the variation in the dependent variable that is statistically explained by the regression line.
0 R2 1.
Large R2 means the line fits the observed points well and the line explains a lot of the variation in the dependent variable, at least in statistical terms.
Small R2 means the line does not fit the observed points very well and the line does not explain much of the variation in the dependent variable.