Basic Econometrics - The Economist

Breaking

Friday, 10 March 2017

Basic Econometrics

                                               ECONOMETRICS NOTES

Econometrics: Econometrics may be defined as the social science in which the tools of economic theory, mathematics and statistical inference are applied to the analysis of economic phenomena.

Scope of Econometrics: The scope of econometrics is much broader than economic measurements for Econometrician.
1. Econometricians, at time they are mathematicians, always busy in formulating economic theory in models that make it appropriate for statistical testing.
2. At times they are accountants, concerned with problems of finding and collecting economic data and relating theoretical economic variable to observable ones.
3. At times they are applied statisticians, spending lot times with computer trying to estimate economic relationships or predicting economic events.
4. Econometrics attempts to quantify economic reality and bridge the gap between the abstract world of economic theory and the real world of human activity.

Importance of Econometrics: It deals with empirical study of economic laws. It is the combination of three important areas;
1) Economic and Business.
2) Mathematical Methods.
3) Statistical Tools.

Econometrics is an empirical test. In almost all fields, we need Econometrics, because different theories or principles can be tested empirically by using basic Econometric tools.

Uses of Econometrics: There are three major uses of econometrics:
1. Describing economics reality (Modelling and Estimating): Description is one of the simplest uses of econometrics. Econometrics allow to qualify economic activity and to put number in equations that previously contained only abstract symbols. For example, supply is completely determent by four factors i.e quantity supply (Qs) and the commodity price (P), the cost of production (C), the techniques of production (T) and weather conditions (W) which expresses functional relationships between quantity supply (Qs) and the commodity price (P), the cost of production (C), the techniques of production (T) and weather conditions (W) in the economy that is;
Qs = f (P,C,T, W)
In mathematical economics, the above abstract relationship of supply may express as:

Qs = a0 + a1P +  a2 C + a3 T + a4 W

Where a0 intercept, while a1, a2, a3, a4 are coefficients/ slopes of supply equation. The above equation tells no other factors have an effect of supply but in economic reality, a lot of factors may effect supply. These factors may may, the invention of new product, improvement in technology, a change in size of market, institutional changes and changes in taxes etc.
In econometrics, a random variable is introduced to taken into account the influence of these other factors;
Qs = a0 + a1P +  a2 C + a3 T + a4 W + V
Where V stands for the random (stochastic) factors, which effect the quantity supplied.



2. Testing Hypothesis about Economic Theory (Analysis):
Hypothesis testing is a procedure which enables us to decide on the basis of information of economic phenomena obtained form sample data whether to accept or reject a statement.
The procedure for testing a hypothesis about a population parameter involves the following six steps,

1. State your problem & formulate an appropriate null hypothesis Ho with an alternative hypothesis H1, which to be accepted when Ho is rejected.

2. Decide upon a significance level, α of the test, which is probability of rejecting the null hypothesis if it is true.

3. Choose an appropriate test-statistics, determine & sketch the sampling distribution of the test-statistics, assuming Ho is true.

4. Determine the rejection or critical region in such a way that a probability of rejecting the null hypothesis Ho, if it is true, is equal to the significance level, α the location of the critical region depends upon the form of H1. The significance level will separate the acceptance region from the rejection region.

5. Compute the value of the test-statistics from the sample data in order to decide whether to accept or reject the null hypothesis Ho.

6. Formulate the decision rule as below.
a) Reject the null hypothesis Ho, if the computed value of the test-statistics falls in the rejection region & conclude that H1 is true.
b) Accept the null hypothesis Ho, otherwise when a hypothesis is rejected, we can give α measure of the strength of the rejection by giving the P-value, the smallest significance level at which the null hypothesis is being rejected.

3. Forecasting future economic activity:  If the chosen model does not confirm the hypothesis or theory under consideration, we may use it to predict the future value(s) of the dependent, or forecast, variable Y on the basis of known or expected future value (s) of the explanatory or predictor; variable X.

Division of Econometrics: Econometrics may be divided into two types as;
1. Theoretical Econometrics
2. Applied Econometrics
1. Theoretical Econometrics: It involves the development of appropriate methods for the measurement of economic phenomena (relationship). Since the economic data are observations of real life and not derived from controlled experiments, so econometric methods have been developed for such non-experimental data. These econometric methods may be classified into two groups namely single equation techniques that involve the one relationship at a time and simultaneous equation techniques that applied to all relationships of a model simultaneously.
2. Applied Econometrics: It involves the application of the techniques of theoretical econometrics for the analysis and forecasting economic relationships. Applied econometric research predicts the value of economic variable by means of the measurements of economic parameters.  


Econometric Modelling: A Model is a simplified picture of reality in which only the most important factors are selected for study, ignoring the others. i.e . It is humanly impossible to study all such influences simultaneously. This difficulty is removed by making models. We assume that only few factors are really very important and by concentrating our study on these few variables, we can come to correct conclusion. We feel that all other factors have so small influence that those can be ignored.
Or a model is simply a set of mathematical equations.
i.e If the model has only one equation, it is called a single equation  model, whereas it has more than one equations, it is known as multiple equation model. The variable appearing on the left side of the equality sign is called the dependent variable and the variable (s) on the right side are called the independent or explanatory variable (s). Thus in the Keynesian consumption function (Y = β1 + β2 X ), the consumption expenditure is the dependent variable and the income is the independent / explanatory variable.
In econometric modelling, a random variable is introduced to taken into account the influence of these other factors;
Qs = a0 + a1P +  a2 C + a3 T + a4 W + V
Where
Qs= Quantity supply
P= Price
C= Cost of production
T= Techniques of production
W= Weather
a1, a2, a3, a4 = coefficients of supply equation
V= Random (stochastic) factors, which effect the quantity supplied.






Image result for Econometrics





Methodology of Econometrics:  Traditional or Classical Methodology of Econometrics proceeds along the following steps or lines;
1. Statement of theory or hypothesis.
2. Specification of Mathematical Model.
3. Specification of Econometric Model.
4. Obtaining the data.
5. Estimating of the Econometric model.
6. Hypothesis Testing.
7. Forecasting or prediction.
8. Using of the Model for Control or Policy Purposes.
To illustrate the preceding steps, let us consider the well-known Keynesian theory of consumption.
1. Statement of Theory or Hypothesis: Keynes postulated that the marginal propensity to consume (MPC), the rate of change of consumption for a unit change in income, is greater than zero but less than 1.
2.  Specification of Mathematical Model: Keynes postulated positive relationship between consumption and income. The Mathematical Model can be expressed as;
Y = β1 + β2 X             0 < β2 < 1    ---------------------- 1
Where Y = Consumption expenditure, X = Income, where β1 and β2 known as parameters of the Model, are respectively the intercept and slope coefficients.
The slope coefficient β2 measures the MPC. Geometrically equation 1 is shown in Figure 1.1. This equation, which states that consumption is linearly related to income, is an example of mathematical model of the relationship between consumption and income that is called the consumption function in economics. As this is called one-equation model, because it contains one equation in the model. The variable appearing on the left side of equality sign is called the independent variable and the variable(s) on the right side are called independent or explanatory variables(s). In the Keynesian consumption function, the consumption (expenditure) is the dependent variable and income is the explanatory variable.
C = f (Y)
C = β1  + β2 Y
Where β1 is intercept and β2 is slope
          Y                                                          
 

                                                          Î²2
Consumption                                   
Expenditure                            1
 

                                Î²1
                                                                                                    X
                                                    Income
Figure: 1  Keynesian Consumption Function
3. Specification of Econometric Model. The purely mathematical model for consumption is of limited interest to the econometrician, for it assumes that there is an exact relationship between consumption and income. But relationship between economic variables is generally inexact. Thus if we were to obtain data on consumption expenditure and disposable income and graph these data on graph paper with consumption expenditure on the vertical axis and disposable income on the horizontal axis, we would not expect all 500 observations to lie exactly on straight line because in addition to income, other variables affect consumption expenditure. For example, size of family, age of members in the family, family religion etc are likely to exert some influence on consumption.
The Econometric Model can be expressed as;
Y = β1 + β2 X + μ   --------------------------------- 2                     
Where μ known as disturbance,  or error term is a random variable that has well defined probabilistic properties. The disturbance μ may well represent all those factors that affect consumption but are not taken into account explicitly.
Equation 2 is an example of an econometric model. More technically it is an example of a linear regression model. The Econometric consumption Function hypotheses that the dependent variable Y (consumption) is linearly related to the explanatory variable X (income) but the relationship between the two is not exact; it is subject to individual variation. The Econometric model of the consumption function can be depicted as shown in Figure 2.1
                                                     .  
                                            .        .      .        .    .
                                      .     .         .     .
Consumption         Î¼                   .    .  
Expenditure                        .     .
                                    .   
  
 
                                                    Income
Figure 2.1   Econometric Model of the Keynesian Consumption Function.
4. Obtaining the data. To estimate the econometric model given in equation 2, that is, to obtain the numerical values of β1 and β2, we need data.
Data on Y (personal consumption expenditure) and X (gross domestic product).

Observation
Y
X
1
10
15
2
15
30
3
18
40
4
20
45
5
30
50
6
40
60
7
50
70
8
70
75
9
80
85
10
90
100


   
Figure: 4   Personal consumption expenditure in relation to GDP (X)

4. Estimation of Econometric Model: Now we have the data, our next task is to estimate the parameters of the consumption function. The numerical estimates of parameters give empirical content to the consumption function. While using OLS (ordinary least square) or regression analysis (a main tool used to obtain the estimates) the parameters β1 and β2 can be obtained and the estimated consumption function can be stated.
Example: The following marks have been obtained by a class of students in agriculture economics in the subject of statistics (out of 100) in Paper 1 and 11.

Paper 1
45
55
56
58
60
65
68
70
75
80
85
Paper 11
56
50
48
60
62
64
65
70
74
82
90

Compute the coefficient of correlation for the above data. Find also the equation of the lines of regression.


Solution:

X
Y
XY
X2
Y2
45
56
2520
2025
3136
55
50
2750
3025
2500
56
48
2688
3136
2304
58
60
3480
3364
3600
60
62
3720
3600
3844
65
64
4160
4225
4096
68
65
4420
4624
4225
70
70
4900
4900
4900
75
74
5550
5625
5476
80
82
6560
4600
6724
85
90
7650
7225
8100
717
721
48398
48149
48905

Coefficient of Correlation


= 0.91884    (-1 ≤ r ≤ 1)

Coefficient of Determination (r2): It tells us percent of variations in response variable that is explained (determined) by the model and the explanatory variable.
If  r2 = 0.84 means that 84% variability in the amount of y is explained by x using this model.

Case 1                                
Estimated linear regression of Y on X is  Å· = a + b
Where
a= Intercept (a point where the line crosses the y axis)
b= Slope

           n Σ XY – (ΣX) (ΣY)     11(48398) – (717) (721)
  b =   ------------------------- =  ----------------------------- = 0.991
              n Σ X2 – (Σ X)2            11 (48149) – (717)2
      _        _
a = Y – b X

= 65.54 -1.00 (65.18) = 0.9047
                                           
Thus the equation for the line of best fit is  Å· = a + b   = 0.904 + 0.991    ---------(3)


Case 2                                            _
Estimated linear regression of X on Y is X = a+bY
Where
a= Intercept (a point where the line crosses the y axis)
b= Slope

           n Σ XY – (ΣX) (ΣY)     11(48398) – (717) (721)
b =   -------------------------   =  ----------------------------- = 0.8513
              n Σ Y2 – (Σ Y)2            11 (48905) – (721)2
      _        _
a = X – b Y   

= 65.18 – 0.8513 (65.54) = 9.38
                        _                  _
Thus the equation for the line of best fit is X  = a + b Y = 9.38 + 0.8513 Y  ---------(4)   
   
                                                     .  
                                            .        .      .        .    .
                                      .     .         .     .
Consumption         Î¼                   .    .  
Expenditure                        .     .
                                    .   
  
 
                                                    Income
Figure 1   Econometric Model of the Keynesian Consumption Function 

The Equation 4 indicated that regression line fits the data quite well in that the data points are very close to the regression line. From this result, we see that slope coefficient (i.e MPC) was about 0.85 suggesting that for the sample period an increase in real income on average to increase of about 85 % in real consumption expenditure. We say on average because the relationship between consumption and income is inexact; as is clear from figure 1 not all the data points lie exactly on the regression line.

                                                                              ^   
How to compute or calculate the value of  Âµ1 
               ^           ^
µ1 = yi - yi                   Î£  yi  = 0
^
Y =  100 + 0.75 X + µ      ------------ Econometric Model

Due to increase in income, the consumption will increase, but at less rate. i.e the income of consumer will increased by 100 %, the consumption would also be increased but by 75 %.
To see whether, these results are statistically significant, hence in this regard, we use Standard Error (Se)

              ^        Ïƒ2
i.e  Se (  ÃŸ2 ) =  ------        ( Generally the value of Se is calculated through computer)
                            Î£ xi2  
The value of Standard Error (Se)  must be less than half the value of parameters, if it do so, then these results would be statistically significant, otherwise not.

i.e the value of parameter ß2 = 0.99, if the value of Se is worked out say 0.33 (as less than the half of the value of parameter), hence the results are statistical significant.

For Multiple Regression Model, we use R2 ( Overall estimate  0 <  R2 < 1 ), while
For Single Regression Model, we use r2 ( Overall estimate  0 <  r2 < 1 ).
i.e  r2 = 0.55 (It means, the value is closer to 1, hence it is stronger value)
      r2 = 0.33 (It means, the value is farther to 1, hence it is weaker value)

Coefficient of Determination: Which types/ nature of data i.e time series or cross sectional data, so we use Durbin Watsun to check serial co-relation.

6. Hypothesis Testing: Assuming that fitted model is reasonably good approximation of reality, we to develop suitable criteria to find out whether estimates obtained in equation 5 are in accord with the expectations of the theory that is being tested. Keynes expected that MPC or ß2 to be positive but less than 1. In our example we found the MPC or ß2 to be about 0.8 and which supports Keynes theory as “MPC or ß2 to be positive but less than 1”. Such confirmation of economic theory on the basis of sample evidence is based on a branch of statistical theory known as statistical inference (hypothesis testing).

How to Interpret the theory for its validity, we use two hypothesis are as under;

· Hypothesis No.1     ß1 > 0   or   Î±  > 0  (It means, the value of ß1 or α must be greater or positive than zero. Closer the value of ß1 or α to zero, the stronger would be their values, otherwise weaker.)
· Hypothesis No.1          0 <  ÃŸ2 < 1  or 0 < Î± 1  < 1  ( It means, The value of ß2  or  Î± 1  must lies between 1 and 0.  Closer the value of ß1 or α to one, the stronger would be their values, otherwise weaker.)

7. Forecasting or Prediction: If the chosen model does not confirm the hypothesis or theory under consideration, we may use it to predict the future value(s) of the dependent, or forecast, variable Y on the basis of known or expected future value (s) of the explanatory or predictor; variable X.
To illustrate we want to predict the mean consumption expenditure, the value of Y will be as;
     ^
Or   Y = -3.3  -  0.8 (57) = - 48.9 
8. Use of Model for Control or Policy Purposes: Suppose we have the estimated consumption function given in equation 5. Suppose further the Government believes that consumer expenditure of about 4900 Billions will keep the unemployment rate at its current level of about 4.2 %. What level of income will guarantee the target amount of consumption expenditure ? .
If the regression results given in equation 5 seem reasonable, simple arithmetic will show that
 - 48.9 = -3.3  -  0.8  X
Which gives X = 57, approximately an income level given as MPC of about 0.8, will produce an expenditure of about 4900 Billion. As these calculations suggest, an estimated model may be used for control, or policy purposes. By appropriate fiscal and monetary policy mix, the government can manipulate the control variable X to produce the desired level of target variable Y.

 
 
 

 
 

 
 

 


 

 

 

 


Figure: 6   Anatomy of Econometric Modeling.

Structure of Economic Data: The success of any econometric analysis ultimately depends on the availability of the appropriate data. It is therefore essential that we spend some time discussing the nature, sources and limitations of the data that one may encounter in empirical analysis.
Types of Data: Three types of data may be available for empirical analysis: time series, cross-section, and pooled (i.e combination of time series and cross section) data.
Time Series Data: The data shown in Table 7 are the example of time series data.


Table No. 7      U.S Egg Production           
States
Y1
Y2
X1
X2
AL
2.206
2.186
92.7
91.4
AK
0.7
0.7
151.0
149.0
AZ
73
74
61.0
56.0
AR
3.620
3.737
86.3
91.8
CA
7.472
7.444
63.4
58.4
CO
788
873
77.8
73.0
CT
1.029
948
106.0
104.0
Note: Y1 = Eggs produced in 1990 (millions)
Y2 = Eggs produced in 1991 (millions)
X1 = Price per dozen (cents) in 1990
X2 = Price per dozen (cents) in 1991
A time series is asset of observations on the values takes at different times. Such data may be collected at regular time intervals, such as daily (e.g stock prices, weather reports), weekly (e.g money supply figures), monthly (e.g the unemployment rate, the Consumer Price Index (CPI), quarterly (e.g GDP), annually ( e.g  government budgets), quinquennially, that is, every 5 years (e.g census of  manufactures), or decennially (e.g the census of population). Sometimes data are available both quarterly as well as annually, as in the case of GDP and consumer expenditure. Time series data are used heavily in econometric studies, they present special problems for econometricians. Most empirical work based on time series data assumes that the underlying time series is stationary. As we can see from the figure 8, which depicts the behaviour of the MI money supply in the United States from Jan 1, 1959 to July 31, 1999. As we can see from this figure, the MI money supply shows a steady upward trend as well as variability over the years, suggesting that MI time series is not stationary.
 









 
Figure No. 8  M I supply: United States, 1951:01-1999:09.
Cross-Section Data: Cross-section data are data on one or more variables collected at the same point in time. Such as census of population in Pakistan conducted every 10 years. Time series data create their own special problems (because of their stationary issue), cross-sectional data too have their own problems, specifically the problems of heterogeneity. We have some areas of country that produce huge amounts of eggs and some that produce very little. When we include such heterogeneous units in statistical analysis, the size or scale effect must be taken into account so as not to mix apples with oranges. To see this clearly, we plot in Figure 9, the data on eggs produced and their prices in 50 areas for the year 1999. This figure shows how widely scattered the observations are.




 
Price of eggs
per dozen
 
 
 
 
                    0
Number of eggs produced
Figure No. 9  Relationship between eggs produced and prices, 1999.
Pooled Data: In pooled, or combined, data are elements of both time series and cross section data. The data in Table No. 7 are an example of pool data. Likewise the data given in Table No.7 are pooled data in that the Consumer Price Index (CPI) for each country for 1973-1997 is time series data, whereas the data on the CPI for seven countries for a single year are cross-sectional data. In the pooled data we have 175 observations--- 25 annual observations for each of the seven countries. Panel, longitudinal or micropanel data is a special type of pooled data in which the same cross-sectional unit (say, a family or a firm) is surveyed over time.
Measurements scales of variables: The variables that fall into four broad categories: nominal, ordinal, interval and ratio scales.
Nominal Scales: When we assign a code to express the meaning or identity. i.e Good or bad, success or failure, literate or illiterate etc.
Ordinal Scales: When there is proper order in performance of any task, duty or work or attaining position in sequence after making evaluation of any job. i.e A, B, C, D or Excellent, very good, good, satisfactory, bad etc. Ordinal scales shows sequence and direction.
Interval Scales: In order to measure the exact performance, duration and extent of one job in accordance to assigned duties, we assign interval scales because, we cannot assess how much A or Excellent is better in quantity than B or Very Good. i.e If A attains 90 marks and B attains 80 marks, then the difference of marks can be calculated as 10, which is worked out due to interval scales.
Ratio Scales: Which is based on absolute zero. i.e Centigrade or Fohren height scales don’t starts from absolute zero, while Kelvin scale starts from zero, which is ratio scales.

Regression Analysis: Regression analysis is concerned with the study of one variable, the dependent variable, on one or more other variables, the explanatory variables, with a view to estimating or predicting the average value of the former in terms of the known or fixed values of the later.
For example, an agronomist may be interested in studying the dependent of crop yield, say of wheat, on seed quality, temperature, irrigation and fertilizer. Such dependent analysis may enable prediction or forecasting of the average crop yield, given information about the explanatory variables.
Output = f (Inputs)
Dependent variable = f (Independent variable(s) )
Crop Yield = f (Seed quality, temperature, irrigation, fertilizer)
Difference between Correlation and Regression
Correlation
Regression
1.It express relationship between two or more variables.


2. There is no mention of dependent or independent variable.
3. In correlation, we donot see the impact of Y on X or X on Y or more.
4. There are no parameters or coefficients.
5. Here r as correlation coefficients is mentioned.
1. Regression Analysis is concerned with the study of the dependent of one variable, the dependable variable, on one or more other variables.
2. There is clear cut mention that which one is dependent or independent variable.
3. In regression analysis, we see the impact of one variable on other variable (s).
4. There are parameters or coefficients.
5. Here r2 as Coefficient of Determination is mentioned.
Regression Analysis is a quantitative analysis of inverse relationship between Y variable (amount of income) and X variable (inflation rate) will enable Monetary Economist to predict the amount of money, as proportion of their income, that people would want to hold at various rates of inflation.


               Y(money)





                                                              X2         X2          X3   (Inflation rates)
Figure: 10  Regression Line
Repression Equation: A line in two dimensional or two variable space is defined by the equation;
Y = a + b* X
Y variable can be expressed in terms of a constant (a) and a slope (b) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient or β coefficient. For example, GPA (Y variable) may best be predicted as 1 + 0.02* IQ. Thus knowing that a student has an IQ (X variable) of 130 would lead us to predict that his GPA would be 3.6 (since, 1 + 0.02* 130 = 3.6)
To understand the regression line, we see that as and when weekly income increases, the expenditure will also be increased as shown in the regression line in Figure-6


 
                                                                      E (Y/X)

                                                             Expenditure = f (Income)           
Weekly
Consumption
Expenditure
 
Weekly Income
Figure- 6  Regression line
Sample Regression: When we get the scattergram in result of drawing two sample regression lines to fit the scatters reasonably well, such regression lines are called as the sample regression lines as shown in Figure-19 below.
 
                                                               .   .      .    .
                             .       .
  Weekly                         .   .    .
  Consumption                                       
   Expenditure             

 
                                           Weekly Income
Figure- 19    Sample Regression Lines

Correlation and Regression
Correlation: Correlation is measure of association between numerical values.
Strength of Linear Association
r value
Interpretation
1
Perfect Positive Linear Relationship
0
No Linear Relationship
-1
Perfect Negative Linear Relationship
            
       Y    Strong Positive Linear Correlation
                                                                   X
            0      2       4       6        8      10      12
                                     No Linear Correlation
 
 
 
 
 

 
                                  
    
                                     Strong Negative Linear Correlation
 
 
             







r =   

r = Correlation Coefficient
-1 ≤ r ≤ 1
r2 = Coefficient of Determination
Coefficient of Determination: It tells us percent of variations in response variable that is explained (determined) by the model and the explanatory variable.
If  r2 = 92.7% means that 93% variability in the amount of y is explained by x using this model.
Covariance: Covariance provides a measure of the strength of the correlation between two or more sets of random variates. Covariance measures how much movement in one variable predicts the movement in corresponding variable. The covariance for two random variates  and . Symbolically;

Calculating Covariance
X (Smoker)
Y(Lung capacity)
(X-X-bar)
(Y-Y-bar)
(X-X-bar) – (Y-Y-bar)
0
45
-10
9
-90
5
42
-5
6
-30
10
33
0
-3
0
15
31
5
-5
-25
20
29
10
-7
-70
X-bar=10
Y-bar = 36
0
0
-27.5

Cov (x,y) = -215/5 = -43.0


Regression: Regression is line of best fit for one response (dependent) numerical value based on one or more explanatory (independent) variables.
y = a + bx   ------------------------  Simple Regression Equation
y = a + bx + µ  --------------------- Econometric Regression Equation
a = Intercept
b = Slope
y = Dependent Variable
x= Independent Variable
µ = error term
How to Interpret the theory for its validity, we use two hypothesis are as under;

· Hypothesis No.1     α  > 0  (It means, the value of α must be greater or positive than zero. Closer the value of α to zero, the stronger would be their values, otherwise weaker.)
· Hypothesis No.1          0 <  b < 1  ( It means, The value of b must lies between 1 and 0.  Closer the value of b to one, the stronger would be their values, otherwise weaker.)

Example: The following marks have been obtained by a class of students in agriculture economics in the subject of statistics (out of 100) in Paper 1 and 11.

Paper 1
45
55
56
58
60
65
68
70
75
80
85
Paper 11
56
50
48
60
62
64
65
70
74
82
90

Compute the coefficient of correlation for the above data. Find also the equation of the lines of regression.

Solution:

X
Y
XY
X2
Y2
45
56
2520
2025
3136
55
50
2750
3025
2500
56
48
2688
3136
2304
58
60
3480
3364
3600
60
62
3720
3600
3844
65
64
4160
4225
4096
68
65
4420
4624
4225
70
70
4900
4900
4900
75
74
5550
5625
5476
80
82
6560
4600
6724
85
90
7650
7225
8100
717
721
48398
48149
48905

Coefficient of Correlation


= 0.91884    (-1 ≤ r ≤ 1)

Coefficient of Determination (r2): It tells us percent of variations in response variable that is explained (determined) by the model and the explanatory variable.
If  r2 = 0.84 means that 84% variability in the amount of y is explained by x using this model.

Case 1                                
Estimated linear regression of Y on X is  Å· = a + b
Where
a= Intercept (a point where the line crosses the y axis)
b= Slope

           n Î£ XY – (ΣX) (ΣY)     11(48398) – (717) (721)
  b =   ------------------------- =  ----------------------------- = 0.991
              n Î£ X2 – (Σ X)2            11 (48149) – (717)2
      _        _
a = Y – b X

= 65.54 -1.00 (65.18) = 0.9047
                                           
Thus the equation for the line of best fit is  Å· = a + b   = 0.904 + 0.991    

Case 2                                            _
Estimated linear regression of X on Y is X = a+bY
Where
a= Intercept (a point where the line crosses the y axis)
b= Slope

           n Σ XY – (ΣX) (ΣY)     11(48398) – (717) (721)
b =   -------------------------   =  ----------------------------- = 0.8513
              n Σ Y2 – (Σ Y)2            11 (48905) – (721)2
      _        _
a = X – b Y   

= 65.18 – 0.8513 (65.54) = 9.38
                        _                  _
Thus the equation for the line of best fit is X  = a + b Y = 9.38 + 0.8513 Y         
     
Ordinary Least Square (OLS):
A line can be drawn that constitutes the "best fit" in the sense that it minimizes the squared deviations of observed Ys from any alternative line.  Such a criterion for drawing a line is referred to as ordinary least squares (OLS). The square root of these mean squared deviations is the standard error of estimate.
Assumptions underlying the Method of Least Squares:
Are as follows;
Assumption No.1. Linear Regression Model. The regression model is linear in the parameters.
    Yi = β1 + β2 Xi  + μi
Since linear in parameters regression models are the starting point of the CLRM.
Assumption No.2. X values are fixed in repeated sampling. Value taken by regressor X are considered fixed in repeated samples. More technically , X is assumed to be nonstochastic.
Yi                     Xi  
10 5
14 10
20 15
22 20
30 25
Consider the various Y populations corresponding to the levels of income X. Keeping the values of income X fixed, say at level $ 80, we draw at random a family and observe its weekly family consumption expenditure Y as, say, $ 60. Still keeping income X at $ 80, we draw at random another family and observe its Y value as $ 75. In each of these drawings (i.e repeated sampling), the value of X is fixed at $ 80. we can repeat this process for all the X values. What all this means is that our regression analysis id conditional regression analysis, that is, conditional on the given values of the regressor (S) X.
Assumption No.3  Zero mean values of disturbance μi . Given the values of Xi, the mean or expected value of the random disturbance term μi is zero. Technically, the conditional mean value of μi is zero, Symbolically we have
E (μi / Xi) = 0
Assumption No.4  Homoscedasticity or equal variance of Ui. Given the value of X, the variance of Ui is the same for all observations. That is the conditional variance of Ui are identical. Symbolically we have
Var (Ui/Xi) = E{Ui – E(Ui/Xi}2 = σ 2
Where Var stands for variance
Assumption No.5  No autocorrelation between the disturbances. Given any two two X values, xi and Xj(i=j) the correlation between any two Ui and Uj (i # j) is zero. Symbolically,
Cov {U. Uj/ Xi. Xj} = E {(Ui- E(Ui)}/ Xi}{(Ui – E(Ui)}/ Xj}
= 0
Assumption No.6  Zero covariance between Ui and Xi or E (ui Xi) = 0

Assumption No. 7  The number of observation n must be greater than the number of parameters to be estimated. Alternatively, the number of observations n must be greater than the number of explanatory variables.

Assumption No.8  Variability in X values. The X values in given sample must not all be the same. Technically, var (x) must be a finite positive number.

Assumption No. 9 The regression model is correctly specified. Alternately, there is no specification bias or error in the model used in empirical analysis.

Assumption No. 10  There is no perfect multicollinearity. That is, there are no perfect linear relationships among the explanatory variables.

Properties of Ordinary Least Square Estimators:  Least –Square Estimates possess some ideal or optimum properties. These properties are contained in the well-known Gauss-Markov Theorem. To understand this theorem, we need to consider the best linear unbiasedness property of an estimator. OLS estimator β2, is said to be a best linear unbiased estimator (BLUE) of β2 if the following hold;
1. It is linear, that is, a linear function of a random variable, such as the dependent variable Y in the regression model.
2. It is unbiased, that is, its average or expected value, E(β2), is equal to the true value, β2.
3. It has minimum variance in the class of all such linear unbiased estimators; an unbiased estimator with the least variance is known as an efficient estimator.
Overall Goodness of Fit or Coefficient of Determination r 2  : It is fitted regression line to set a data; that is, we shall find out how “well” the sample regression line fits the data.
Computational Approach: The general computational problem that needs to be solved in Regression Analysis is to fit a straight line to a number of points.

 
 
 
 
 
VAR_Y

 
    Graph No. 20                 VAR_X

In the scatterplot, we have an independent or X variable, and a dependent or Y variable. These variables may, for example, represent IQ (intelligence as measured by a test) and University achievements (Grade Point Average or GPA), respectively.  Each point in the plot represents one student, that is, the respective student’s IQ and GPA. The goal of linear regression procedures is to fit a line through the points. Specifically, the program will compute a line so that the squared deviations of the observed points from that line are minimized. Thus, this general procedure is sometimes also referred to as Least Square Estimation.  The Coefficient of Determination r2 (two-variable case) or R2 (more than two

Statistical Inference and Estimation
Statistical Inference: The process of drawing inference about a population on the basis of information contained in a sample taken from the population is called statistical inference.
Statistical inference is traditionally divided into two main branches: Estimation of parameters and Testing of Hypothesis.
Estimation of Parameters: It is a procedure by which we obtain an estimate of the true but unknown value of a population parameter by using the sample observations X1, X2,…..,Xn. For example we may estimate the mean and the variance of population by computing the mean and variance of the sample drawn from the population.
Testing of Hypothesis: It is procedure which enables us to decide on the basis of information obtained from sampling whether to accept or reject any specified statement.
Estimates and Estimators: An estimate is a numerical value of the unknown parameter obtained by applying a rule or a formula called as estimator, to a sample X1, X2,…, Xn of size n, taken from population.
For example, if X1, X2,… Xn is a random sample of size n from a population with mean
, then   = 1/n ∑Xi is an estimator of μ and , the value of  is an estimate of μ.

Kinds of Estimates
There are two kinds of estimates as;
1. Point Estimates
2. Interval Estimates
1. Point Estimates: When an estimate of unknown population parameter is expressed by a single value, it is called point estimate.
2. Interval Estimates: An estimate expressed by a range of values within which true value of the population parameters is believed to lie, is referred to as an interval estimate.
Suppose we wish to estimate the average height of very large group of student on the basis a sample. If we find sample average height to be 64', then 64' is a point estimate of the unknown population mean. If on the other hand, we state that the true average height is a value between 62' and 66' is an interval estimate.   

Example: A random sample of n=6 has the elements of 6, 10, 13, 14, 18 and 20. Compute a point estimate of i) The population mean ii) The Population Standard Deviation iii) The Standard Error of Mean.
i) The sample mean is
Then   = 1/n ∑Xi  = (6+10+!3+14+18+20)/6 = 81/6=13.5
Thus point estimate of population mean μ is 13.5 and  is the estimator.
ii) The sample standard deviation is
S = √ 1/n ∑ (Xi -)2 = (6-13.5)2 + (10-13.5)2 +(13-13.5)2 +(14-13.5)2 +(18-13.5)2 + (20-13.5)2 /6 = 4.68
Thus the point estimate of the population standard deviation σ is 4.68 and S is the estimator
iii) When the sample size is less than 5% of the population size, the standard error of mean is
S = S/√n = 4.68/√6 = 1.91
Hence S is the estimator for σ  and 1.91 is the point estimate of the standard error of mean.
Properties of Estimators
A point estimator is considered to be a good estimator if it possesses the following four properties:
i. Unbiasedness ii. Consistency iii. Efficiency iv. Sufficiency
i. Unbiasedness: An estimator is defined to be unbiased when statistic used as estimator has its expected value to the true value of the population parameter being estimated.
ii. Consistency: We toss a coin n times, the probability of having head is p. Tosses are independent. Let Y = Number of heads
Theorem: An unbiased estimator θn for θ is consistent, if Lim n             ∞ Var (θn) =0
iii. Efficiency: If we have unbiased estimators of parameter, θ1 and θ2, we say that θ1 is relatively more efficient than θ2,  if Var (θ2) > Var (θ1)
eff (θ1, θ2) =  Var (θ2)/ Var (θ1)
If  eff (θ1, θ2) >1   Choose θ1
If  eff (θ1, θ2) <1   Choose θ2
iv. Sufficiency: Let Y1, Y2, …….., Yn denote a random sample from a probability distribution with unknown parameter θ, then the statistic U= g (Y1, Y2, …….., Yn) is sufficient for θ if the conditional distribution of  Y1, Y2, …….., Yn, gives U, does not depend on θ.
Properties of Least Square Estimators:
To understand this theorem, we need to consider the best linear unbiasedness property of an estimator. Say the Ordinary Least Square (OLS) estimator estimated value of ß2 is said to be a best linear unbiased estimator (BLUE) of ß2 if the following hold;
1. It is linear, that is, a linear function of a random variable. Linear regression econometric model is symbolically expressed as;
Yi = ß1 + ß2 Xi + µ      
Where
Yi = Dependent Variable
Xi = Independent Variable
ß1 = Intercept or coefficient or parameter
ß2  = Slope or coefficient or parameter
µ = Error as a sign of econometric.
2. It is unbiased, that is, its average or expected value E (estimated value of ß2) is equal to the true value ß2.
3. It has minimum variance in the class of all such linear unbiased estimators; an unbiased estimator with the least variance is known as an efficient estimator.

Autocorrelation:  The correlation between error terms in time series data. This time of correlation is called autocorrelation or serial   correlation.   The   error   term  Ut   at   time  period  t  is correlated with error term Ut+1, Ut+2,…… Ut-1, Ut-2,…….   and so  on.  The  correlation   between   Ut   and  Ut-k  is  called an autocorrelation of order k  and  is  usually denoted by p1. The correlation  between   Ut   and  Ut-2   is  called  second   order  autocorrelation and is denoted by p2 and so on.

Durbin    Watson   Statistics (DW):  The   simplest    and   most commonly used  model  is  one  where  the errors  Ut and Ut-1 have a  correlation pI . If d lies outside the critical values, the decision can be made regarding  the  presence  of positive or negative serial correlation.

d = Σ (ût – ût-1)2 / Σ Ã»2t  = 3   (D.I.Khan)   shows    zone    of  indecision,  while    d= 2.8  (Peshawar)  shows  no  serial  and autocorrelation.

If the p = +1     then   d = 0
P = -1      then   d = 4

If d is either close to 0 or 4, the residuals are highly correlated.
If d < d L, we reject the hull hypothesis of no autocorrelation.
If d > d U, we do not reject the hull hypothesis.
If d L < d < d U, the test is inconclusive or uncertain.

Whenever consecutive errors or residuals are correlated, we have autocorrelation or serial correlation. i.e When OLS assumption is violated.
Say   E(Ui, Uj) =0   OLS assumption
If       E(Ui, Uj) ≠ 0 OLS assumption is violated
That results in autocorrelation or serial correlation.
When consecutive errors have the same signs, we have positive autocorrelation. When they change their signs frequently, we have negative autocorrelation. Autocorrelation is frequently found in time series data (i.e the data where there is one observation on each variable for each time period). In economics, positive autocorrelation is more common than negative autocorrelation. In the presence of autocorrelation, the estimated coefficients are not biased, but the standard errors are biased (so that the values of their t-statistics is exaggerated). The value of adjusted R2 and F-Statistics will also be unreliable in the presence of autocorrelation. Autocorrelation can arise from the existence of trends and cycles in economic variable, from the exclusion of an important variable from the regression, or from the nonlinearities in the data. Autocorrelation can be detected by plotting the residuals or errors or more usually by Dorbin Watson statistic (d). The values of d ranges between 0 and 4. For example,     0 ≤ d ≤ 4.
D-W test can be conduced either at the 5 percent or at the 1 percent level of significance. We cannot use this test for checking autocorrelation, if the number of observations is less than 15. If the D-W test calculated value exceeds the critical vale of du in the table, we conclude that there is no evidence of autocorrelation at the 5 or I percent level of significance. In general the value of d = 2 indicates the absence of autocorrelation. If the value of D-W test calculated value falls between the critical values of dl and du of the  table, the  test  is  inconclusive. Finally if  the D-W statistics is smaller than the critical value of dl given in the table, there is evidence of autocorrelation. We must then adjust for its effect.

Multicollinearity: Multicollinearity refers to the situation when two or more explanatory variables in the regression are highly correlated. For example, cotton crop in NWFP in general and D.I.Khan in particular had kept its production as regressor nearly constant as fraction of its cost of productions over time (during 2004-05 to 2007-08) are highly collinear with area and yield as independent variables. In such a case, running a regression on such have led to the exaggerated standard errors and therefore to low t-value followed by poor p-value 0.470 (0.720) in case of its area and -3.119 (0.197) in respect of its yield of cotton for both estimated coefficients. This could lead to the conclusion that both slope coefficients are statistically insignificant, even though the R2 may be very high (0.98). In such case, serious multicollinearity could be overcomed or reduced by;
1. Extending the sample size (collecting more data).
2. Utilizing the priori information (i.e  b2= 2b1).
3. Transforming the functional relationship.
4. Dropping one of the highly collinear variables.

Heteroscedasticity:
Another serious problem may face in the regression analysis is Heteroscedasticity. This arises when the assumption that the variance of the error term is constant for all values of the independent variables is violated. This often occurs in cross-sectional data (i.e Production, Area  and Yield of major crops of NWFP during each year ranging 2004-05 to 2007-08 as economic unit), when the size of error may rise (the more common form) or fall with the size of an independent variable. Heteroscedastic disturbances leads to biased standard errors and thus to incorrect statistical tests and confident interval for the parameters estimates. The researcher may overcome this problem by using the lag of explanatory variable that leads to heteroscedastic disturbances or by running a weighted least squares regression.
For example,
Qt= ao + a1 SPt + a2 Wt  + a3 FPt-1 + a4 Qt-1  + Vt  --------------1
      Yt = bo + b1 SPt + b2 Wt   + b3 FPt-1 + b4 Yt-1 + Vt  ------------- 2
At = co + c1 SPt + c2 Wt   + c3 FPt-1  + c4 At-1 + Vt  --------------3
Log Qt= Log ao +a1 Log SPt +a2 Log Wt+a3 Log FPt-1+a4 Log Qt-1+ LogVt  ------4                     
Log Yt = Log bo + b1Log SPt + b2LogWt + b3 FPt-1 + b4 LogYt-1 + Log Vt ------ 5
Log At =Log co + c1 Log SPt + c2 Log Wt + c3 Log FPt-1 + c4 Log At-1 + LogVt--6
The adjustment coefficient β is derived by subtracting of lagged variable from unity.

  

No comments:

Post a Comment