emp_inv_hri: A repository from deichcode

Influence of Algorithms: Empirical study on the influence of algorithm’s reliability and transparency on user’s decision making process

Sören M. Schröder (399838) Computer Science Individual and Technology, RWTH Aachen University Empirical Investigation of Communication in Human-Robot-Interaction Prof. Dr. Astrid Rosenthal-von der Pütten 28. August 2020 Abstract No indent, limited to 120 words. Keywords: Table of Contents

Introduction 4
1. Algorithmic Decision Making 4
2. Research Questions 4
3. Hypotheses 4
Method 5
1. Participants 5
2. Procedure 6
Results 8
1. Prediction Deviation 8
2. Confidence in the algorithm 9
3. Fairness of the algorithm 10
4. Understanding of the algorithm 10
Discussion 11
1. Limitations 11
2. Explorative analysis 12
3. Further Research 12
Summary and Conclusions 12
Appendix A 14
References 15 Figure 1: Participants reported age 5 Figure 2: Understanding of the algorithm before and after using it 11 Table 1: Statistical analysis of prediction deviation 9 Table 2: Statistical analysis of confidence in the algorithm 91. Introduction During the last decade algorithms gained a more important and omnipresent role in our society. In general an algorithm is series of calculations which receives a series of inputs and transform it into a series of outputs (Cormen, Leiserson, Rivest, & Stein, 2013). For us the algorithms are interesting which are defined by Cormen et al. (2013) as correct algorithms. They define them as algorithms which stop for a certain input and delivers an correct result and therefore solves the calculation. Another characteristic of an algorithm is that for the same input always the the same corresponding output is returned (Rogers, 1967), in other word the algorithm is deterministic. 1.1 Algorithmic Decision Making Algorithms can not only be manually written by humas and then get executed. They can also be generated by analising data and recognizing patterns in it. This approach (often referred as machine learning) can be used to develop algorithm decision systems (ADS), which are involved in the process of decision making (Castelluccia & Le Métayer, 2019). They describe that the involvement of humans can be stated in a spectrum that range from systems that provide advises for a human who is finally responsible for the decision to systems that make decisions full automatically. How algorithms on the autonomous end of the described spectrum can be used for job applications was shown by Wang, Harper, & Zhu (2020) in their work about how the fairness of algorithmic decision systems is perceived by concerned individuals. In their work they mocked a system which behaved as if it could promote workers on a crowdsourcing workplace (online platform were workers get paid for fulfilling micro tasks (Wang et al., 2020)), based on workers data provided to the algorithm. Even if this study did not use a real algorithm, due to the fact that their research focused on the perceived fairness, that ADS for job applications will eventually be reality. In other field are already ADS in use. One of these is the so called COMPAS (Correctional Ofender Management Profiling for Alternative Sanctions) software developed by Northpoint, Inc. (Kirkpatrick, 2017). The tool is used to decide wether a defendant is allowed to be free on bail or should be keept in (Hao & Stray, 2019). Also in the field of medicin algorithmic decision systems gathered attention. Esteva et al. (2017) developed an ADS that was able to detect skin cancer by analising images of the corresponding skin areas. Thereby they achieved success rates similar to them of experts they have tested. This topic is of special interest since skin cancer is the most common maligna diseases among humans (Esteva et al., 2017). Dermatologist diagnose melanomas primarily visually by applying the ABCDE method (Esteva et al., 2017). ABCDE is a abbreviation for different characteristics of a spot which can be used as an aid to detect melanoma in an early stage (Rigel, Friedman, Kopf, & Polsky, 2005). The characteristics of the spot which are used by the method to assess them are asymmetry , border irregularity , color , diameter, and evolution (of the spot over time). The development of algorithms like presented by Esteva et al. is most likely to continue. Either to provide an easy access to early skin cancer detection for manny humans by using smartphones (Esteva et al., 2017) or to support dermatologist in their work. For the latter it would be intersting to investigate several effects that emerge from their use. Therefore, this study investigates the use of ADS in the context of medicin on the example of the task of assessing spots on the skin to being melanoma. We set a focus on transparency and reliability of the ADS and how they influence the decision making process of medical professionals. In the following we use algorithm as an synonym for ADS, since we did not introduce the term ADS during the study because it was not necessary to know for the participants. 1.2 Transparency In the application area of jurisdiction the final decision is made by a judge (Kirkpatrick, 2017) and the same would apply when using an ADS in a medical context. For COMPAS we know, that it is biased against particular subgroups (Angwin, Larson, Mattu, & Kirchner, 2016) and it is likely that similar problems will occur in other systems for different application areas. Since the user of an ADS (e.g. judges or doctors) usually do not have the necessary technical knowledge to understand how the algorithm comes to its decision, the level of transparency the algorithm provides could be from significant interest regarding the question of how they are going to be used. Kizilcec (2016) showed that the provided transparency level of the ADS has an influence on the trust in it. So we state our Research Question 1: How is the user’s decision-making influenced by the transparency of the involved algorithm? 1.3 Fairness Prior research could not found a relation between transparency and perceived fairness in context of ADS (Wang et al., 2020). Two things might have led to this result. First, the operationalization of transparency by stating that the algorithm was developed transparent or not might have been a too weak stimulus to measure an effect. Second, their participants were directly affected by the result which might interfere with the perceived fairness. By overcoming these two problems by giving a stronger stimulus and using a setting were the user is not directly affected we state Hypothesis 1: Transparency of the algorithm is positively associated with perception of fairness. 1.4 Reliability Another factor that might influence the perception of the algorithm is how good it performes (i.e. how reliable it is). Studies have shown that the performance of an algorithm can influence how it is perceived by the user. Thereby, the performance was either perceived directly by seeing that the algorithm fails (Dietvorst, Simmons, & Massey, 2015) or by having the information that the algorithm is known to make errors by providing information about error rates and biases (Wang et al., 2020). Since it is difficult to prevent errors during the development of software in general and for ADS in particular, because an already biased dataset can led to a biased algorithm (Hao & Stray, 2019), it is important to know how the users of an ADS are going to cope with the shortcomings of the algorithm and how it influences their decision making process. So this study also investigates Research Question 2: How is the user’s decision-making influenced by the reliability of the algorithm involved? 1.5 Confidence When people recognize that an algorithm make an error, they will lose confidence in the algorithm (Dietvorst et al., 2015). In their study they found out that the loss of confidence is even higher as if the same mistake was made by a human. We suggest that this will also be the case if the participant not only will perceive a bad performance, which lead us to Hypothesis 2: The users will show less confidence in decisions made by an unreliable algorithm, but also if they know that the algorithm will make errors due to a known error rate. Wang et al. (2020) reported that a known bias influenced the user perception of the algorithm. So we state Hypothesis 3: The confidence in the algorithm would be less when the algorithm's error rates are known as compared to when no information about error rate is provided. 1.6 Conformity Since the confidence in the algorithm is expected to decrease for unreliable algorithms we also expect the users to be deviate with their own predictions form the ones stated by the algorithm. This lack of conformity gave us Hypothesis 4: The user’s predictions (of the probability of the nevus being melanoma) will deviate more from the predictions of a less reliable algorithm 1.7 Understanding When using an ADS as an aid it would be important that the users understand how the algorithm used and how the results need to be interpreted. This is necessary to use the algorithms output for their own decisions. Transparency could be a key factor for understanding and ADS. Wortham, Theodorou, & Bryson (2017) showed that providing an insight into the decision making process of an algorithm can increase the users understanding. Since insight into an algorithm is way of making the algorithm more transparent we expect that this will also be the case when providing transparency not during the use but in advance in form of information about how the algorithm was developed. Therefore, we state our Hypothesis 5: The understanding of the algorithm, before it is used, would be higher for the algorithm with high transparency as compared to one with low transparency. Beside we also want to have a look into how the understanding of the algorithm is changed by working with it (i.e. using the algorithm). This led us to Research Question 3: How does the use of algorithms in decision-making process change its understanding?
Method To answer the stated hypotheses and research questions an empirical study was performed as an online experiment, which had the advantage that it was easier to distribute it to more potential participants than by performing the experiment in the lab. The survey was developed with SoSciSurvey1 and can be found in Appendix A (see page 17). 2.1 Participants In total 61 people participated in the study. Since there was no interest to capture the exact age the participants were asked to select an age group. To enable the participants giving an informed consent, it was necessary to limit the age of the participants to at least 18 years. The results show that the participants were not older than 44 years and the most participants are 25 to 34 years old (see Figure 1). The absence of older participants was expected and was due to the recruitment strategy. The taken convenience sample was recruited in several ways. The survey was posted in Facebook groups of medical students, send to faculties of medical students with the demand to distribute the survey, and it was posted on online bulleting boards of medical faculties. Furthermore personal contacts were asked to participate and distribute the survey, and a class coordinator from the medical faculty of the RWTH Aachen University distributed the survey to their students. Figure 1  Participants reported age Note. Most participants are between 18 and 34 werde they quite evenly distributed below 25 and above 24. Only a few participants were between 35 and 44. No one was 45 or above. The participants were asked to only participate if they have knowledge of applying the ABCDE-Method, which is the reason the recruitment was limited to the described ways. This requirement was stated in the introduction and verified during the survey by asking where the participants have learned about the method and what their educational background is. From this we could see that 52 of our participants already knew the ABCDE method. Nevertheless, we have not removed the other participants from the sample since the already low number of participants and the then resulting imbalance of the conditions. As compensation for their time, the participants had the option to register their mail address, which was stored separated from the experimental data, to a lottery with which they had the chance to win one of five Amazon gift cards with a total value of 150 Euro (1 x 50 Euro und 4 x 25 Euro). 2.2 Procedure The study was designed as 2x2 between-subjects online experiment, where the participants were assigned randomly to one of each conditions for both factors. The two factors, our independent variables, were the transparency of the algorithm (low transparency, high transparency) and the reliability of the algorithm (unreliable, reliable). This design gave us the ability to investigate on the one hand side the effect of the algorithms transparency on the perceived fairness (Hypothesis 1), confidence (Hypothesis 3), and the understanding of the algorithm before using it (Hypothesis 5), and on the other hand the effects of the algorithms reliability on the confidence (Hypothesis 2) and the deviation of the users prediction from the algorithm (Hypothesis 4). The study consists of three parts. Starting with a introduction part containing information about the experiment, basic knowledge and a questionnaire, followed by the task and its explanation. and completed by several questions after the task, regarding the perception of the algorithm, demographics and general questions regarding the use of algorithms. On the introduction page visitors were informed about the conditions under which they can participante in the study (participation was anonymous, needed time was about 25 minutes, rough structure of the experiment, brief description of the task, etc.). They were also informed that the purpose of the study is to evaluate an algorithm for detecting melanoma, which was developed at the computer science departement at RWTH. This deception was resolved at the end of the experiment, but necessary to kepp the participants uninformed during the experiment to prevent a bias by knowing the actual research goals. Further they were informed about the optional lottery. Afterwards, their attitude towards algorithms in general was measured by asking 28 rotated questions from which 14 were asked as reversed items on a 7-point Likert scale from "completely disagree" to "completely agree" (e.g. "Algorithms should not make morally difficult decisions." and "Algorithms apply the same scale to everyone."). The participants were asked to read basic informations about the ABCDE-Method to provide them a short recap and ensure some common basic knowledge level. To ensure this, an knowledge check was performed befor continuing. The second part started with the explanation of the task and an example of the later provided cases, which consists of a picture showing a section of a human skin with a nevus, information about the symptoms, and an assessment by the algorithm. The algorithm states to wich degree (0% – 100%) it assess the nevus as a melanoma. Depending on the transparency condition the participants received very few information about how the algorithm was developed (low transparency) or a detailed explanation how the algorithm works, how it was trained, and how its performance was tested. To ensure the information was red attentive the participants had to answer one (low transparency) or three (high transparency) corresponding questions correctly. Afterwards the participants evaluated this explanation and stated how confident they feel in using the algorithm. For each of the 15 provided cases the participants were asked to state there own prediction of how likely the nevus is a melanoma (0% – 100%) and if they would perform a biopsy (yes, no). We also asked to state on a Likert scale (1 = Not at all sure, 5 = Very sure) how sure they are with their own decision (operationalization of self confidence) and also on a Likert scale (1 = Not at all reliable, 5 = Very reliable) how reliable they would rate the the assessment of the algorithm (operationalization of confidence in algorithm). The 15 cases were divided into 5 negative (clearly no melanoma), 5 postive (clearly melanoma) and 5 ambiguous (not clearly) cases. The classification and images werde provided by a dermatologist from the RWTH Aachen University. For the unreliable condition one of the postive and two of the negative cases were changed to a false positive (clearly no melanoma, but algorithm state the opposite) and 2 false negatives (vise versa). The reliable condition only got the 15 not obviously wrong assessed cases. We decided not to add any more mistakes to avoid completely destroying the confidence in the algorithm, what might have led to ignoring the algorithm at all. Since a false negative assessment has severe implications for the patient, we decided to add one more instead of an additional false postive. After performing the task the participants answered a final questionnaire which asked again how confident they feel to use the algorithm. We also asked them to evaluate the algorithm (e.g. "I have largely ignored the algorithm in my decisions"), how fair they perceived the algorithm, give a self estimation how much they were influenced by the algorithm, and they were asked to state if responsible personal in the medical system should be supported by such algorithms. They should also sort several areas of use for ADS (e.g. "recommendations in dating apps", "diagnosis of skin cancer") according to their severity and the likelihood that they would agree to use them. Finally several demographic data was collected (gender, age, fields of studie, setting they learned ABCDE method, etc.). In the debriefing the occlusion of the algorithms nature and the experiments conditions and which were used in their case were revealed. Furthermore, the participants were ask to visit a dermatologist if they observe conspicuous nevus on their skin.
Results The results of our measurements show that overall there is less impact of transparency and reliability than we expected beforehand. Nevertheless, the collected data gave some insights in how the users perception of the algorithm is influenced and and the resulting decisions and actions. 3.1 Fairness of the algorithm After performing all 15 cases the participants were asked to rate the fairness of the algorithm on a Likert scale (1 = very unfair to 5 = strongly fair). The measured difference between the high transparency condition (M = 3.52, SD = 0.72) and the low transparency condition (M = 3.52, SD = 0.72) was quite small and therefore even higher for the low transparency condition. Besides, a two-way ANOVA did not reveal any support for a significant effect of transparency on the perceived fairness of the algorithm, F(1, 57) = 1.04, p = .313, ηp2 = .018, which indicates that the perceived fairness is not influenced by the algorithms transparency. So, there is no support for Hypothesis 1. 3.2 Confidence in the algorithm The confidence in the algorithm was stated by the participants in each case. Among all cases was between moderate and reliable (M = 3.48, SD = 0.42). A two-way ANOVA did not showed any main effect on the confidence in the algorithm. Neither by transparency, nor by reliability, nor by an interaction of both (see table 1). Therefore, no support for Hypothesis 3 and Hypothesis 2 could be found. Nevertheless, the result (p = .140) suggests at least some effect of reliability exists. So, we looked only at the cases were the unreliable algorithm did obvious mistakes. A two-way ANOVA showed that reliability had a main effect on the confidence in the algorithm, even with a medium effect size (see table 1). But the confidence for the unreliable algorithm was close to moderate (M = 2.92, SD = 0.72) and only a bit below the confidence into the reliable algorithm (M = 3.35, SD = 0.55). Table 1  Statistical analysis of confidence in the algorithm All cases Unreliable cases Variable df F(1, 57) Significance Partial Eta Square df F(1, 57) Significance Partial Eta Square Transparency 1 0.839 .363 .015 1 2.490 .120 .042 Reliability 1 2.234 .140 .038 1 6.379 .014 .101 Transparency*Reliability 1 0.063 .802 .001 1 0.129 .720 .002

3.3 Prediction Deviation We used the participants assessments of the cases to calculated the deviation to the algorithm by taking the absolute difference in percentage points between the assessments of the participants and the algorithm. The deviation can be described as moderate (M = 17.2%, SD = 7.3). For the reliable algorithm we can report even lower deviations in the assessments (M = 14.25%, SD = 7.36). In contrast the deviation for the unreliable algorithm shows a 5.96 percentage points higher deviation (M = 20.1%, SD = 6.2). A two-way analysis of variance (ANOVA) revealed a large significant effect of reliability on the deviation of assessments, F(1, 57) = 11.17, p = .001, ηp2 = .164. But no significant effect on the deviation was cause by transparency, or an interaction between transparency and reliability (see table 2). This results support Hypothesis 4. Table 2  Statistical analysis of prediction deviation Variable df F(1, 57) Significance Partial Eta Square Transparency 1 0.41 .522 .007 Reliability 1 11.17 .001 .164 Transparency*Reliability 1 1.31 .267 .022

3.4 Understanding of the algorithm With three Likert scale (1 = completely disagree to 5 = completely agree) items the participants were asked how good they understand how to use the algorithm, how the algorithm works, and how good they can grasp the algorithm. The scale was used after the explanation of the algorithm, right before the cases and directly after the cases. At both points it provided a high internal consistency (α = .86 and α = .87). Regarding the measurement befor assessing the cases, the understanding of the algorithm in the low transparency condition (M = 4.27, SD = 1.41) was below the understanding in high transparency condition (M = 4.89, SD = 1.07). But a two-way ANOVA revealed no significant main effect of transparency on the understanding, F(1, 57) = 3.81, p = .056, ηp2 = .063, which gave no support for Hypothesis 5. Although, the result indicated that there might was an effect of the transparency on the understanding of the effect. Comparing the change of reported understanding between the two measurements, before and after the cases, show that for the low transparency group the understanding was higher after using the algorithms assessment compared to before, with only the explanation. In contrast for the high transparency group the value dropped a little bit from before to after (see Figure 2). A two-way repeated measures ANOVA, revealed there is a significant effect of the interaction between the transparency and using the algorithm on the understanding of it, F(1, 57) = 8.32, p = .006, ηp2 = .127. Figure 2  Understanding of the algorithm before and after using it Note. The understanding of the algorithm has increased for the low transparency group from before to after using the algorithms results. For the high transparency group the the reported understanding decreased a bit from before to after, but not significantly. 4. Discussion The goal of thi study was to get a better insight into how ADS are used by medical professionals in general and (potential) dermatologist in particular. We looked at transparency and reliability as factors that influence the perception of the algorithm and thereby how it is used. Therefore, we observed several dependent factors: fairness, confidence, conformity, and understanding.

•	Fairness
	•	Wang et al.: Fairnes is related to outcome -> Fairness here not applicable
•	Confidence
	•	 Still "moderate" confidence when algorithm do mistakes
	•	Some mistakes seams not to distort the confidence in the algorithm
		•	Idea: No feedback like in the obediance mykeepon experiment.
•	Transparency(Wortham, Theodorou, & Bryson, 2017)
	•	Idea: High transparency let people overrate their understanding.
	•

4.1 Limitations & Further Research • Likert Scale form Agree completely to agree completely • Due to low participant numbers also not educated participants were included to avoid imbalanced conditions • Number of questions for manipulation check (transparency) differed among conditions • Concept of fairness not applicable • Transparency • Forgotten • Overwhelming

•	Base group without algorithm to see the bias
•	Include Algorithm Appreciation
•	Manipulation check for Reliability
•	Including Confounding Variables
•	Analyse of Kontextvergleich
•	Different transparency levels/types
•	The focus on mainly students was due to the fact, that this study acts as a pre-study for further research. By applying improvements, suggested in, the methodology of the study should be improved. A later study might be performed with professionals from the field of skin cancer.

Conclusion

6. References 7. Appendix AAngwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica, May, 23, 2016. Castelluccia, C., & Le Métayer, D. (2019). Understanding algorithmic decision-making: Opportunities and challenges. European Parliament. https://www.doi.org/10.2861/536131 Cormen, T. H., Leiserson, C. E., Rivest, R., & Stein, C. (2013). Algorithmen - Eine Einführung. Walter de Gruyter GmbH & Co KG. https://doi.org/10.1515/9783110522013 Dietvorst, B. J., Simmons, J. P., & Massey, C. (2015). Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1), 13. https://repository.upenn.edu/cgi/viewcontent.cgi?article=1392&context=fnce_papers Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118. https://doi.org/10.1038/nature21056 Hao, K., & Stray, J. (2019). Can you make AI fairer than a judge? Play our courtroom algorithm game. Retrieved 2020-08-16 from https://www.technologyreview.com/2019/10/17/75285/ai-fairer-than-judge-criminal-risk-assessment-algorithm/ Kirkpatrick, K. (2017). It’s not the algorithm, it’s the data. Commun. ACM, 60(2), 21-23. https://doi.org/10.1145/3022181 Kizilcec, R. F. (2016). How Much Information? Effects of Transparency on Trust in an Algorithmic Interface. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2390-2395. https://doi.org/10.1145/2858036.2858402 Rigel, D. S., Friedman, R. J., Kopf, A. W., & Polsky, D. (2005). ABCDE—An Evolving Concept in the Early Detection of Melanoma. Archives of Dermatology Arch Dermatol, 141(8), 1032-1034. https://doi.org/10.1001/archderm.141.8.1032 Rogers, H. (1967). Theory of Recursive Functions and Effective Computability. https://doi.org/10.1137/1011079 Wang, R., Harper, F. M., & Zhu, H. (2020). Factors Influencing Perceived Fairness in Algorithmic Decision-Making: Algorithm Outcomes, Development Procedures, and Individual Differences. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-14. https://doi.org/10.1145/3313831.3376813 Wortham, R. H., Theodorou, A., & Bryson, J. J. (2017). Robot Transparency: Improving Understanding of Intelligent Behaviour for Designers and Users Towards Autonomous Robotic Systems. In Y. Gao, S. Fallah, Y. Jin, & C. Lekakou(pp. 274-289). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-64107-2_22

1https://soscisurvey.de

deichcode/emp_inv_hri