# ‘DCEmgmt’: DCE data reshaping and processing

## Introduction

When a discrete choice experiment is conducted online, it is very likely that the output data will be in the wrong format. Most online survey tools present the respondents answers in a wide shaped format. The following table is a fraction of the database used in Pérez-Troncoso (2020) https://doi.org/10.1007/s40258-021-00647-3.

B1_1 B2_16 sex height weight age educ
NA Prefiero el servicio A Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura)
Prefiero el servicio A NA Hombre 172 69 52 Educación secundaria (entre los 12 y los 16 años)
Prefiero el servicio B NA Hombre 183 123 55 Postgrado universitario
Prefiero el servicio B NA Mujer 167 56 42 Postgrado universitario
Prefiero el servicio B NA Mujer 162 68 57 Educación secundaria (entre los 12 y los 16 años)
NA Prefiero el servicio A Hombre 169 65 36 Grado universitario (anteriormente diplomatura o licenciatura)
NA Prefiero el servicio B Hombre 158 90 29 Grado medio en formación profesional
NA Prefiero el servicio A Mujer 170 80 52 Grado medio en formación profesional
Prefiero el servicio A NA Hombre 181 83 27 Grado medio en formación profesional
Prefiero el servicio A NA Mujer 150 90 40 Grado superior en formación profesional

This table might look unclear, but it contains the results of a discrete choice experiment carried out on an online form (as Google Forms or Qualtrics). The table above is displayed partially, but the user can access the complete database by using the following commands.

install.packages("DCEmgmt") #Install DCEmgmt from the CRAN repository
data(survey) #Load the original database (It will appear as "data" in the RStudio environment)

This DCE was based on 16 choice sets. The choice sets were divided into two blocks and each block was presented to the half of the population. Thus, the first block, represented by the variables starting by “B1_” ranges from B1_1 to B1_8. Since only half of respondents answered this block, many rows are NA. This is because the database is coded in ‘wide’ format and each row represents one respondent. However, since the results will be analyzed with discrete choice models, the data needs to be coded in ‘long’ format. For instance, the following table represents the same database coded in ‘long’ format.

id cs alts resp sex height weight age educ gid choice
1 9 1 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 9 0
1 9 2 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 9 1
1 10 1 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 10 0
1 10 2 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 10 1
1 11 1 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 11 0
1 11 2 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 11 1
1 12 1 Prefiero el servicio A Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 12 1
1 12 2 Prefiero el servicio A Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 12 0
1 13 1 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 13 0
1 13 2 Prefiero el servicio B Mujer 169 60 33 Grado universitario (anteriormente diplomatura o licenciatura) 13 1

As can be observed, now the rows represent the alternatives forming the choice sets. Thus, each respondent (denoted by the ‘id’ variable) has 16 rows (8 choice sets x 2 alternatives). The ‘cs’ variable indicates which block was answered by the respondent (cs from 1 to 8: first block; cs from 9 to 16: second block). The ‘choice’ variable indicates the chosen alternative (1: chosen, 0: not chosen). The ‘alts’ variable denote the alternatives forming each choice set; in this case there were only two alternatives per choice set. The rest of variable (sex, height, weight, age, educ) are known as case-specific variables and they do not change within the same ‘id’.

However, discrete choice models try to determine the probability of choice based on the characteristics of the alternatives. Thus, the database is missing the variables describing the characteristics of the alternatives. Once these variables have been added, the resulting database is shown in the following table.

id cs alts gid choice pers1 pers2 pers3 pers4 form1 form2 rut1 rut2 prec1 prec2 prec3 prec4
1 9 1 9 0 0 1 0 0 0 1 0 1 0 1 0 0
1 9 2 9 1 0 0 1 0 1 0 1 0 0 0 1 0
1 10 1 10 0 1 0 0 0 1 0 1 0 0 1 0 0
1 10 2 10 1 0 0 1 0 0 1 0 1 1 0 0 0
1 11 1 11 0 0 1 0 0 1 0 1 0 0 0 1 0
1 11 2 11 1 0 0 1 0 0 1 0 1 0 0 0 1
1 12 1 12 1 1 0 0 0 1 0 0 1 0 1 0 0
1 12 2 12 0 0 1 0 0 0 1 1 0 1 0 0 0
1 13 1 13 0 0 1 0 0 1 0 1 0 0 1 0 0
1 13 2 13 1 0 0 0 1 0 1 0 1 0 0 1 0
1 9 1 3822 1 0 0 0 1 0 1 0 1 1 0 0 0
1 9 2 3823 1 0 0 0 1 1 0 1 0 0 0 0 1
1 10 1 3823 0 1 0 0 0 0 1 0 1 0 1 0 0
1 10 2 3824 1 0 0 1 0 1 0 1 0 0 1 0 0
1 11 1 3824 0 0 0 0 1 0 1 0 1 0 0 0 1
1 11 2 3822 1 0 0 0 1 0 1 0 1 1 0 0 0
1 12 1 3823 1 0 0 0 1 1 0 1 0 0 0 0 1
1 12 2 3823 0 1 0 0 0 0 1 0 1 0 1 0 0
1 13 1 3824 1 0 0 1 0 1 0 1 0 0 1 0 0
1 13 2 3824 0 0 0 0 1 0 1 0 1 0 0 0 1

As it can be seen, now each alternative is in the same row that its characteristics. Finally, the database is ready to be analyze through a discrete choice model.

## How to use DCEmgmt

### Prerequisites

The original database must be coded in wide format. The user can load the example data frame as reference by typing the following code.

install.packages("DCEmgmt") #Install DCEmgmt from the CRAN repository
data(survey) #Load the original database (It will appear as "data" in the RStudio environment)

Normally, when the DCE is conducted in a platform such as Google Form or Qualtrics it is ready to be transformed used DCEmgmt. However, we will take a look at the main elements in case the user needs to make any modifications. First, each row must represent each answer to the survey. Thus, if the survey was completed by 30 users, the database must have 30 rows. Then, each column must contain each question in the survey. Thus, each column can represent either a choice set or a question on respondent characteristics. Finally, each choice set must contain a string (or a number) denoting the choice that the user made. For instance, if the possible answers were “Option A” and “Option B”, when the user chooses the first option the value of that variable will be “Option A” (or any other string or number).

Other important aspect is that, usually, DCEs are blocked into two parts. This happens when the DCE is designed with, for example, 16 choice sets and 8 choice sets are presented to half of respondents and the other 8 choice sets to the other half of respondents. In this case, the database will be similar but the respondents will not have answers for each column. Here, the columns must have a specific name. For example, Q1_2 could denote the second choice set of the first block and Q2_9 could denote the first choice set of the second block. Note that the second index in the second block - Q2_(9) - continues after the last question in the first choice set - Q1_(8) -. When there is only one block the choice sets can be called Q_1, Q_2, Q_3…

The last step requires the user to input the design matrix. The design matrix contains the characteristics of each alternative and is essential to estimate the discrete choice model. The user can load an example running the following code.

install.packages("DCEmgmt") #Install DCEmgmt from the CRAN repository
data(design) #Load the design matrix

As can be seen in the example design matrix (coded in long format), each row represent an alternative identified by its choice set, ‘cs’, and its alternative, ‘alts’. The ‘cs’ variable must match the index used before (for example, Q2_10 is represented by ‘cs’ = 10). The alternatives must be specified in order. Thus, ‘alts’ = 1 is represented by the string “Option A” and ‘alts’ = 2 is represented by “Option B”. The rest of the columns, for example ‘pers1’, are dummies representing a level which is (or is not) in the alternative. These alternative-specific variables can be dummies or monetary values. Learn more about the coding of the design matrix in http://dx.doi.org/10.1016/j.jval.2016.04.004.

### Use

First, the database, usually an Excel sheet, must be imported as a data frame. Once this has been done, ‘DCEmgmt’ can be used following the next syntax:

install.packages("DCEmgmt") #Install DCEmgmt from the CRAN repository
data.c <- DCEmgmt::DCEmgmt(data = data, keepvar = c("sex", "height", "weight", "age", "educ"), create.id = TRUE, blocks = c("B1_", "B2_"), sets = 8, alts = 2, options = c("Prefiero el servicio A", "Prefiero el servicio B"), design = design)
• data: the raw database as a data.frame or list. Check if it is a data frame or a list by typing “typeof(data)”
• keepvar: a vector containing the name of the columns representing the case-specific variables (usually the users’ characteristics).
• create.id: if there is not a variable called ‘id’, it will be added automatically when create.id = TRUE
• blocks: a vector indicating how the blocks were denoted. If there is only a block, the vector will only have one element. In this case, “B1_” is the beginning of the name of the choice sets in block one (B1_1, B1_2, …).
• sets: the number of choice sets per block. The blocks need to have the same number of choice sets. In this case, there are of 16 choice sets, 8 per block.
• alts: the number of alternatives per choice set. In this case only 2.
• options: the string used in the ‘wide’ database to denote the user choice in each choice set.
• design: the design matrix as data.frame or list. See “data(design)” for an example.

The code will create a data frame called ‘data.c’. ‘data.c’ is the same database in ‘long’ format. This new database is ready to be analyzed through discrete choice models. Using the following function, the user can quickly fit a conditional and mixed (random parameters) logit model.

DCEestm(data.c = data.c, model = "all", params = c("pers2", "pers3", "pers4", "form2", "rut2", "prec2", "prec3", "prec4"), rand = c("pers2", "pers3", "pers4", "form2", "rut2", "prec2", "prec3", "prec4"))
• data.c: data coded using DCEmgmt.
• model: the user can specify “clogit” for conditional logit, “mixlogit” for random parameters logit, or “all” for both.
• params: a vector including the names of the levels (columns representing the alternatives’ characteristics) of the alternative-specific variables.
• rand: a vector specifiying wich variables should be treated as random parameters in the mixlogit (this argument must be omitted when estimating only the conditional logit.)

Any question or suggestions please write to