The following few sections describe some general statistical terminology that is important for understanding the analysis presented in this Technical Brief, especially in
A1. Definition of a Variable
A variable is a data / questionnaire item on which individual responses (observations) are obtained. Variables are categorized into two major categories.
A2. Descriptive Statistics
Descriptive Statistics is the collection, organization, analysis, and presentation of data.
Table 1 presents some of the descriptive measures for the variables that will be used in
Field 2: The Statistical Model. In column one, the table presents the number of valid observations (N) for each of the variables. The next column shows the minimum value for that variable in the data set. The third column shows the maximum value for the variable. The fourth column shows the arithmetic mean for the variable. Arithmetic mean is often called the average value. The last column shows the standard deviation for the variable. Standard deviation provides a statistical measure for dispersion around the mean. For example, Figure A1 shows the dispersion around the mean of three sample schools. The innermost curve corresponds to smallest dispersion (low standard deviation) around the mean the outermost curve corresponds to the most variability (high standard deviation).
Figure A1: Example of Deviations
Around the Mean
Researchers find it convenient to have a single number to measure the strength of the relationship between two variables and to have that number be independent of the units used to make the measurement. The correlation between the response variables and the explanatory variables is shown in Table 2. The reader thus has a better idea of the predictive ability of a single explanatory variable on the response variable. Table 3, on the other hand, shows the interactions of the explanatory variable amongst themselves. The higher the absolute value of the correlation, the greater the relationship between the two variables. A correlation of zero indicates that there is no positive or negative relationship between the two variables. A positive correlation value indicates that the variables increase together (Figure A2 shows a line with an upward slope – bottom left to top right). A negative correlation indicates that as one variable increases the other decreases (Figure A3 shows a line with a downward slope – top left to bottom right). A correlation of ±1 (or 100%) indicates that there is a perfect linear relationship between the two variables. Correlations of ±1 would be extremely strong relationships. These correlations are rarely observed when exploring relationships between different variables.
Example A2: Example of
Positive Correlation between Two Variables
Example A3: Example of
Negative Correlation between Two Variables
A4. Correlation vs. Causation
Most statisticians know that “Correlation does not imply
causation.” That is, even if two variables are related or correlated, they may not have any causal relationship between them. In other words, changes in one variable may not be directly caused by the independent operation of the other variable. One variable may fluctuate in relation to the other due to chance (coincidence) or both may be strongly affected by one or more
variables. Other possible reasons include both variables changing over time or when it is unclear whether there was causation or a contribution effect. In the absence of any other evidence, data from an observational study cannot be used to establish causation. However, a causal connection probably does exist if we can establish that: 1) there is a reasonable explanation of cause and effect, 2) the connection happens under varying conditions, and 3) potential confounding variables are ruled out. In the statistical model it is our intention to present the significant causal connection between a set of explanatory variables and a response variable.
A5. Statistical Significance
Researchers want to know and be able to say with some degree of confidence whether any relationships they have found between various types of data are different from relationships they would find solely due to chance. A measure for the degree of confidence we have in a statistical relationship is called the
confidence level. A confidence level of 95% and above is considered to have
statistical significance. In other words, most researchers are willing to state that a relationship is statistically significant if the possibility of observing that relationship in the sample purely due to chance is less than 5%.
For interpretational reasons, statisticians generally present a
confidence interval of a prediction. A confidence interval
is the statistical lower and upper bound within which the predicted value will occur 95% of the time. 95% is the value of the confidence level. In
Information Works! this confidence interval is represented as a band (sample range) on Field 2 of the school report. In essence, a school should expect the actual achievement score (presented as a heavy solid line) for a particular sub-test to fall within the band. If the actual achievement score is above or below the band it is assumed that the school achieved those results purely by chance. Of course, if the school consistently performs above or below the band in multiple years and/or in more than one test, this school requires further study. It is increasingly likely if the results hold over multiple years that the results obtained are not due solely to chance (i.e., random fluctuations) but represent “real” under or over achievement as compared to what was expected from the model.
A6. Simple Linear Regression
Figure A4: Regression Line
As mentioned in Section A3, the correlation coefficient measures the degree of linear relationship between two variables. However, it does not describe the exact linear association between a response variable (y) and the explanatory variable (x); that role is played by regression analysis. Regression analysis helps us to determine whether a specific relationship exists between x and y thereby allowing one to use x to predict y.
Figure A47, shows the causal relationship of Spending per Student (X) on the
Per Capita Income (Y). The original data is represented by dots. The line, often termed as the
regression line, is obtained via the principles of Least Squares. As we know, given two points we can draw one line. Given, infinite points we can draw infinite lines. Thus, the problem of finding the
best line through a set of points was solved using the Least Squares principle. The Least Square principle minimizes the deviations from the actual data points to a hypothetical line. This is shown in Figure A58. The formula that defines the
regression line is termed the regression equation and
is explained next.
Figure A5: Regression Line
A simple linear regression equation is written as: y =
a + bx, where y represents the values on the vertical axis (per capita income in our example);
x represents the values on the horizontal axis (spending per student in our example); and a and
b are parameters (values) obtained from the least squares solution. The intercept or
a is the intersection point of the regression line with the vertical axis. The slope or
b determines if there is a positive or negative relationship between
x & y. A positive slope, indicated by a positive value for
b, shows that for every unit increase in x the response variable y has an increase of b units. Conversely, a negative value of b indicates that for every unit increase in
x, y decreases by b units.
A7. Multivariate Analysis
The section above presented an overview of how one explanatory variable can predict the outcome of one response variable. However, in most daily scenarios it is unlikely that one explanatory variable is all one needs to build an effective model. The study of multiple explanatory variables interacting simultaneously to produce the outcome on one or many response variables is termed Multivariate Analysis. For the purposes of this Technical Brief, we will only address models with one response variable and more than one explanatory variables. The Rhode Island Statistical Model is based on one such technique -- Hierarchical Regression. As is shown in the following sections Hierarchical Regression is a variation of two other forms of analysis that we did not use: Multiple Regression and Stepwise Regression. A brief discussion on each of the three techniques follows.
A7.1 Multiple Regression
In a multiple regression model, sets of explanatory variables called independent variables jointly predict the outcome of the response variable or the dependent variable. All of the independent variables are specified simultaneously in the regression equation and the solution is obtained through a technique called Least Squares. The assumption is that all
(n) independent variables (xi) together are necessary to explain the variation in the dependent variable
(y). The regression equation is written as:
y = a + b1x1 +
b2x2 + … + bnxn
Once again, a and bi are parameters obtained from the least squares solution. The intercept or a is often termed as the constant term. Each bi determines if there is a positive or negative relationship between a given
xi and y. A positive slope, indicated by a positive value for
bi, shows that for every unit increase in
xi the response variable y has an increase of
bi units. A negative slope indicated by a negative bi value shows that for every unit increase in
xi the response variable decreases by
A7.2 Stepwise Regression
In stepwise regression, the researcher assumes that the independent variables
(xi) are correlated. The researcher simultaneously specifies all of the variables in the regression equation just like in the Multiple Regression Model. However, the Stepwise procedure will systematically add variables that make a significant contribution to the explanation of the variation in
y. The Stepwise procedure will also eliminate variables that are not significant from the equation. Thus, the final predication equation will have fewer independent variables. Stepwise regression does not address the economic or model building requirements. It simply adds and removes variables based on mathematical criteria.
A7.3 Hierarchical Regression
In hierarchical regression, the approach used in the RI model, a core set of
m independent variables (m < n) form the basis of the regression equation. In essence these are forced into the model. The other variables are introduced one at a time in the model to see the incremental gain. Thus the order in which the variables are entered is important. The variable that explains the most variance is the better predictor of the variance in the dependent variable. For example9, if we know that IQ, age, and parents’ income predict a student’s ability to do math, we would like to investigate if Test A or Test B is the better predictor of a student’s test score. This model can be solved by hierarchical regression by first adding all known factors – IQ, Age, and Parent’s income -- into the model. Next we add the performance on Test A and Test B in every possible combination. The sample results is presented below:
Table A1: Predicting Math Performance Using Hierarchical Regression
|STEPS FOR MODEL 1
|| R2 CHANGE
|| STEPS FOR MODEL 2
|1. Age, IQ, Parents Income
|| 1. Age, IQ, Parent’s Income
|2. Test A
|| 2. Test B
|3. Test B
|| 3. Test A
In this example, Test A accounts for more unique variation in math scores 18% or 16% compared to Test B which accounts for only 4% or 6% of the variation in math scores. Test A is thus considered to be a better predictor of math scores and should be selected in the model.