Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model. There are a variety of coding systems that can be used when recoding categorical variables. Regardless of the coding system you choose, the overall effect of the categorical variable will remain the same.
Ideally, you would choose a coding system that reflects the comparisons that you want to make. For example, you may want to compare each level of the categorical variable to the lowest level or any given level.
In that case you would use a system called simple coding. Or you may want to compare each level to the next higher level, in which case you would want to use repeated coding. By deliberately choosing a coding system, you can obtain comparisons that are most meaningful for testing your hypotheses.
Below is a table listing various types of contrasts and the comparison that they make. We should note that some forms of coding make more sense with ordinal categorical variables than with nominal categorical variables. Below we will show examples using race as a categorical variable, which is a nominal variable. Because dummy coding compares the mean of the dependent variable for each level of the categorical variable to the mean of the dependent variable at for the reference group, it makes sense with a nominal variable.
However, it may not make as much sense to use a coding scheme that tests the linear effect of race. As we describe each type of coding system, we note those coding systems with which it does not make as much sense to use a nominal variable. Within SPSS there are two general commands that you can use for analyzing data with a continuous dependent variable and one or more categorical predictors, the regression command and the glm command.
If using the regression command, you would create k-1 new variables where k is the number of levels of the categorical variable and use these new variables as predictors in your regression model.
The values for these new variables will depend on coding system you choose. From this point we will refer to a coding scheme when used with the regression command as regression coding.
We will refer to this type of coding scheme as contrast coding. So, if you are using the regression command, be sure to choose the regression coding scheme and if you are using the glm command be sure to choose the contrast coding scheme. The examples in this page will use dataset called hsb2.
Although our example uses a variable with four levels, these coding systems work with variables that have more categories or fewer categories. No matter which coding system you select, you will always have one fewer recoded variables than levels of the original variable. In our example, our categorical variable has four levels. We will therefore have three new variables. A variable corresponding to the final level of the categorical variables would be redundant and therefore unnecessary.
This will help in interpreting the output from the analyses. Perhaps the simplest and perhaps most common coding system is called dummy coding. It is a way to make the categorical variable into a series of dichotomous variables variables that can have a value of zero or one only.
For all but one of the levels of the categorical variable, a new variable will be created that has a value of one for each observation at that level and zero for all others. In our example using the variable race, the first new variable x1 will have a value of one for each observation in which race is Hispanic, and zero for all other observations.
Dummy Coding: The how and why
Likewise, we create x2 to be 1 when the person is Asian, and 0 otherwise, and x3 is 1 when the person is African American, and 0 otherwise.A dummy variable is a dichotomous variable which has been coded to represent a variable with a higher level of measurement.
Dummy variables are often used in multiple linear regression MLR. Dummy coding refers to the process of coding a categorical variable into dichotomous variables.
For example, we may have data about participants' religion, with each participant coded as follows:. This is a nominal variable see level of measurement which would be inappropriate as a predictor in MLR. However, this variable could be represented using a series of three dichotomous variables coded as 0 or 1as follows:.
There is some redundancy in this dummy coding. For instance, in this simplified data set, if we know that someone is not Christian and not Muslim, then they are Atheist. So we only need to use two of these three dummy-coded variables as predictors. More generally, the number of dummy-coded variables needed is one less than the number of categories. Choosing which dummy variable not to use is arbitrary and depends on the researcher's logic.
For example, if I'm interested in the effect of being religious, my reference or baseline category would be Atheist. I would then be interested to see whether the extent to which being Christian 0 No or 1 Yes or Muslim 0 No or 1 Yes predicts the variance in a dependent variable such as Happiness in a regression analysis.
In this case, the dummy coding to be used would be the following subset of the previous full dummy coding table:. Dummy coding for a categorical variable with three categories, using Atheist as the reference category.
Alternatively, I may simply be interested to recode into a single dichotomous variable to indicate, for example, whether a participant is Atheist 0 or Religious 1where Religious is Christian or Muslim.
The coding would be as follows:. For example, we may have data about participants' religion, with each participant coded as follows: A categorical or nominal variable with three categories Religion Code Christian 1 Muslim 2 Atheist 3.
Full dummy coding for a categorical variable with three categories Religion Christian Muslim Atheist Christian 1 0 0 Muslim 0 1 0 Atheist 0 0 1.
Dummy coding for a categorical variable with three categories, using Atheist as the reference category Religion Christian Muslim Christian 1 0 Muslim 0 1 Atheist 0 0. A categorical or nominal variable with three categories Religiosity Code Atheism 0 Religious 1.
Give a concrete example of each. Why might I choose one model rather than another that is, choose either dummy, effect or orthogonal coding to analyze my data?
What effect does unbalanced unequal cell size have on the interpretation of dummy [effect, orthogonal] coded regression slopes and intercepts? What we are doing here is ANOVA with regression techniques; that is, we are analyzing categorical nominal variables rather than continuous variables.
There are some advantages to doing this, especially if you have unequal cell sizes. The computer will be doing the work for you. However, I want to show you what happens with the 3 kinds of coding so you will understand it. We are going to cover lots of ground quickly here. This is designed merely to familiarize you with the correspondence between regression and analysis of variance.
Both methods are specific cases of a larger family called the general linear model. With this kind of coding, we put a '1' to indicate that a person is a member of a category, and a '0' otherwise.
Category membership is indicated in one or more columns of zeros and ones. If we did, we would have a column variable indicating status as male or female. Ordinarily if we wanted to test for group differences, we would use a t -test or an F -test.
But we can do the same thing with regression. Let's suppose we want to know whether people in general are happier if they are married or single. So we take a small sample of people shopping at University Square Mall and promise them some ice cream if they fill out our life satisfaction survey, which some do. The sum of squared deviations from the grand mean is To test for the difference we find the ratio of the two mean squares:.
And if we square this result, we get Formula Status Status2 We can apply dummy coding to categorical variables with more than two levels. We can keep the use of zeros and ones as well. However, we will always need as many columns as there are degrees of freedom. With two levels, we need one column; with three levels, we need two columns. With C levels, we need C-1 columns. Suppose we have three groups of people, single, married, and divorced, and we want to estimate their life satisfaction.
Note how the first vector selects identifies the single group, and the second identifies the married group. The divorced folks are left over. The overall results will be the same, however, no matter which groups we select. The significance of this is found by:. Note that there are three groups and thus two degrees of freedom between groups.
There are 15 people and thus 12 df for error. The group that gets all zeros is the base group or comparison group. The regression coefficients present a contrast or difference between the group identified by the vector and the base or comparison group. For our example, the comparison group is the divorced group.Nominal variables, or variables that describe a characteristic using two or more categories, are commonplace in quantitative research, but are not always useable in their categorical form.
A common workaround for using these variables in a regression analysis is dummy coding, but there is often a lot of confusion sometimes even among dissertation committees!
With this in mind, it is important that the researcher knows how and why to use dummy coding so they can defend their correct and in many cases, necessary use. Dummy coding is a way of incorporating nominal variables into regression analysis, and the reason why is pretty intuitive once you understand the regression model.
Regressions are most commonly known for their use in using continuous variables for instance, hours spent studying to predict an outcome value such as grade point average, or GPA. In this example, we might find that increased study time corresponds with increased GPAs. Now, what if we wanted to also know if favorite class e. Looking at the nominal favorite class variable, we can see that there is no such thing as an increase in favorite class — math is not higher than science, and is not lower than language either.
This is sometimes referred to as directionality, and knowing that a high versus low score means something is an integral part of regression analysis. Luckily, there is a way around this! Enter: dummy coding.
Dummy coding allows us to turn categories into something a regression can treat as having a high 1 and low 0 score. Any binary variable can be thought of as having directionality, because if it is higher, it is category 1, but if it is lower, it is category 0. This allows the regression look at directionality by comparing two sides, rather than expecting each unit to correspond with some kind of increase.
To give the regression something to work with, we can make a separate column, or variable, for each category. The same goes for each of the dummy variables, as they are called. Dummy variables. Now, looking at this you can see that knowing the values for two of the variables tell us what value the final variable has to be. The same goes for student 5; we know that science is not their favorite, nor is math, so language has to have a yes or 1.
Coding Systems for Categorical Variables in Regression Analysis
For this reason, we do not use all three categories in a regression. Doing so would give the regression redundant information, result in multicollinearity, and break the model. This means we have to leave one category out, and we call this missing category the reference category. Using the reference category makes all interpretation in reference to that category. The reference category is usually chosen based on how you want to interpret the results, so if you would rather talk about students in comparison to those with math as their favorite class, simply include the other two instead.
Now that we have covered the basics of one of the most common data transformations done for regression, next time we will cover a little more of a general interpretation of the linear regression. You can also learn more about interpreting binary logistic regression here! Call Us: Blog About Us. Below is an example of how this ends up working out: Dummy variables Student Favorite class Science Math Language Dummy variables Student Favorite class Science Math Language 1 Science 1 0 0 2 Science 1 0 0 3 Language 0 0 1 4 Math 0 1 0 5 Language 0 0 1 6 Math 0 1 0 Now, looking at this you can see that knowing the values for two of the variables tell us what value the final variable has to be.
Pin It on Pinterest.Categorical variables also known as factor or qualitative variables are variables that classify observations into groups. They have a limited number of different values, called levels.
For example the gender of individuals are a categorical variable that can take two levels: Male or Female. Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.
In these steps, the categorical variables are recoded into a set of separate binary variables. This is done automatically by statistical software, such as R. For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the Salaries data set [ car package].
R creates dummy variables automatically:. The p-value for the dummy variable sexMale is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders. The contrasts function returns the coding that R have used to create the dummy variables:. R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise.
The decision to code males as 1 and females as 0 baseline is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.
The fact that the coefficient for sexFemale in the regression output is negative indicates that being a Female is associated with decrease in salary relative to Males. This results in the model:. So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1.
If the regression coefficient is negative, then addition and subtraction is reversed. Generally, a categorical variable with n levels will be transformed into n-1 variables each with two levels. These n-1 new variables contain the same information than the single variable. This recoding creates a table called contrast matrix.
This variable could be dummy coded into two variables, one called AssocProf and one Prof:. This dummy coding is automatically performed by R. For demonstration purpose, you can use the function model. When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.
Note that, ANOVA analyse of variance is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base anova function or the Anova function [in car package]. We generally recommend the Anova function because it automatically takes care of unbalanced designs.
Taking other variables yrs. Significant variables are rank and discipline. For example, it can be seen that being from discipline B applied departments is significantly associated with an average increase of In this chapter we described how categorical variables are included in linear regression model.
As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables. We provide practical examples for the situations where you have categorical variables containing two or more levels.
Note that, for categorical variables with a large number of levels it might be useful to group together some of the levels. Some categorical variables have levels that are ordered. They can be converted to numerical values and used as is.
Contents: Loading Required R packages Example of data set Categorical variables with two levels Categorical variables with more than two levels Discussion. Loading Required R packages tidyverse for easy data manipulation and visualization library tidyverse.Dummy coding provides one way of using categorical predictor variables in various kinds of estimation models see also effect codingsuch as, linear regression.
Dummy coding uses only ones and zeros to convey all of the necessary information on group membership. Consider the following example in which there are four observations within each of four groups:. For this example we will need to create three dummy coded variables. In general, with k groups there will be k-1 coded variables. Each of the dummy coded variables uses one degree of freedom, so k groups has k-1 degrees of freedom, just like in analysis of variance.
Here is how we will create the dummy variables which we will call d1, d2 and d3. For d1, every observation in group 1 will be coded as 1 and 0 for all other groups it will be coded as zero. We then code d2 with 1 if the observation is in group 2 and zero otherwise.
For d3, observations in group 3 will be coded 1 and zero for the other groups. For d4, there is no d4. Note that every observation in group 1 has the dummy code value of 1 for d1 and zero for the others. Those in group 2 have 1 for d2 and 0 otherwise, and for group 3 d3 equals 1 with zero for the others.Dummy variables - an introduction
Observations in group 4 have all zeros on d1, d2 and d3. These three dummy variables contain all of the information needed to determine which observations are included in which group. If you are in group 1 then d1 is equal to 1 while d2 and d3 are zero. The group with all zeros is known as the reference group, which in our example is group 4. We will see exactly what this means after we look at the regression analysis results.
With dummy coding the constant is equal to the mean of the reference group, i. In this case, the value is equal to 10 which is the mean of group 4.
The coefficients of each of the dummy variables is equal to the difference between the mean of the group coded 1 and the mean of the reference group. In our example the mean of group 1 is 2 and the difference of is -8, which is the value of the regression coefficient for d1. The t-test associated with that coefficient is the test of group 1 versus group 4. What if you used group 1 as the reference group? That is, what if group 1 was the group coded with all zeros?
In that case, the value of the constant would be the mean of group 1 which is 2 and the regression coefficients would be equal to the differences between the group mean and the mean of group 1. In all other respects the models are identical with the same F-ratio and R-squared regardless of which group is selected as the reference group.
What if you try to include dummy variables for all four groups?Dummy coding a variable means representing each of its values by a separate dichotomous variable. These so-called dummy variables contain only ones and zeroes and sometimes missing values. The same logic goes for the other three dummy variables, representing values 1, 3 and 4.
Dummy coding is mainly used for including nominal and ordinal variables in linear regression analysis. Since such variables don't have a fixed unit of measurement, assuming a linear relation between them and an outcome variable doesn't make sense. However, dichotomous variables are metric by definition; since they have only two values, there is only a single interval. There are various schemes for creating dummy variables.
The one presented here is known as indicator coding. Note that for each original variables, exactly one of its dummy variables should be excluded from regression analysis. Cases having 1 on this excluded dummy variable are referred to as the reference group.
A more in-depth theoretical discussion on dummy variables is beyond the scope of this tutorial but you'll find one in most standard texts on multivariate statistics. In this new variable, all other values of pet are recoded into zero. The syntax below shows how to do this and next screenshot should further clarify how it works.
The screenshot below shows the result in the output viewer window just ignore the first columns with line numbers. In principle, we're done now. However, it's usually good practice to label any new variables.
First, this will make our output more readable. Unfortunately, SPSS doesn't offer an efficient way for applying proper variable labels here. Next, adjust the second through last commands as needed see syntax below. Although the meaning of the values 0 and 1 is reasonably obvious, we could apply basic value labels to them as well.