On the Effectiveness of Interactive Detection of Code Anomalies: An Empirical Assessment

Background: Detection of code anomalies should be performed as early as possible in order to effectively reveal refactoring opportunities in due time. Refactoring aims at improving software maintainability, but their late application is counter-productive or even prohibitive. Detection of code anomalies is traditionally supported by non-interactive detection (NID) techniques, which encourage developers to reveal anomalies in later revisions or versions of a program. The reason is that this technique does not support progressive interaction of developers with anomalous code. In addition, it reveals anomalies in the entire source code upon an eventual developer request. More recently, the notion of interactive detecti on (ID) has emerged to address NID’s limitations. This technique reveals anomalies when code fragments are still being edited and without an explicit developer request, thereby encouraging early anomaly detection. Problem Statement: Recent studies suggest the use of NID might lead to: (i) a low number of correctly identified anomalies, and (ii) ineffective refactoring actions. Although ID seems promising, there is no knowledge about its impact on anomaly detection and refactoring actions. Goal: Evaluate the effectiveness of an ID technique on early anomaly detection. In addition, we analyze the aid of an ID technique in performing effective refactoring actions. Method: We conducted a controlled experiment with 14 subjects that underwent tasks related to ano maly detection and judgments of required refactoring actions. Results: Our study reveals the use of ID improves anomaly detection as developers tend to early identify more anomalies when compared to the use of NID. Conclusions: Although ID contributes to detect more anomalies than NID, the former may induce to ineffective refactoring actions. Keywords—Code Anomalies, Interactive Detection, Software Refactoring.


I. INTRODUCTION
Code anomalies are structures in a program that often indicate the presence of deeper maintainability problems [1]. Code anomalies should be early detected, during the ongoing implementation of a program rather than in later maintenance tasks. Early detection of anomalies is likely to lead to effective refactoring actions [2]. Refactoring is a behavior-preserving change in the program structure intended to remove code anomalies and improve software maintainability [1]. However, the early detection of code anomalies is not a trivial task and many factors can hinder the realization of this task. Among those factors, we highlight that developers may not be able to early identify code anomalies due to their lack of experience on this task [3]. In addition, conventional techniques may offer limited support or discourage early detection of co de anomalies [3].
Several techniques for (semi-) automated detection of code anomalies have been proposed in the literature (e.g. [3][5] [6] [7]). Most of these existing techniques are characterized as supporting non-interactive detection (NID) [3] [6]. NID techniques reveal a global list of code anomalies once the source code is completed and compiled. Moreover, the use of NID demands an explicit and eventual request of the developer so that the full source code analysis is triggered. More importantly, NID techniques do not offer means for developers interact with the anomalous code elements while they are producing, editing or inspecting their program statements. All these characteristics of NID techniques encourage late detection of code anomalies.
On the other hand, the notion of interactive detection (ID) has been recently proposed [6]. An ID technique is intended to reveal code anomalies in program fragments without an explicit developer request, thereby encouraging early detection of code anomalies. In contrast to NID, ID provides support for developers interacting with anomalous code as they edit or browse program statements. Unfortunately, there is little empirical knowledge about the effectiveness of interactive detection of code anomalies [6].
Most of the empirical studies on anomaly detection strictly focuses on the evaluation of NID [9][10] [11] [12]. These studies pointed out NID techniques induce to a low number of correctly identified code anomalies. Other studies also suggested NID techniques induce to the realization of ineffective refactoring actions [21] [22].Therefore, the expectation is that ID techniques can better promote early identification of code anomalies and, as a consequence, effective refactoring actions. Even though organizations and developers might want to consider the adoption of ID techniques, there is no evidence in the literature about its effectiveness on anomaly detection. In other words, there is still a lack of empirical knowledge about the use of ID.
Therefore, our goal is to address the following research question: "Can the use of ID improve the effectiveness on anomaly detection and refactoring actions?". For doing so, we conducted a controlled experiment involving 14 subjects with different working experience and technical knowledge. Subjects performed tasks related to anomaly detection and judgments of refactoring with support of both ID and NID techniques. In order to evaluate the effectiveness of both techniques, we used two measures: precision and recall. We select these two measures because they have been widely adopted in other effectiveness studies involving code anomaly detection [13][14] [15]. Our comparative analysis allowed us to evaluate whether some ID characteristics could bring benefits or drawbacks for effective anomaly detection.
The experimental results revealed the use of ID has achieved better effectiveness on code anomaly detection when compared to NID techniques. Developers identified a much higher number of code anomalies when using the ID. On the other hand, we have observed the use of ID might lead to a high number of false positives and, consequently, developers can be induced to perform ineffective refactoring actions.
The remainder of this paper is organized as follows. Section 2 introduces basic concepts required to understand the analysis performed in our study. Study settings are described in Section 3 while the results associated with interactive detection of code anomalies are discussed in Section 4. In Section 5, we present the threats to validity observed in our study. Related work is discussed in Section 6. Finally, we present our conclusions and point out directions for future work in Section 7.

II. BACKGROUND
This section presents essential concepts related to code anomalies, code refactoring and support for anomaly detection.

Code Anomalies and Refactoring
Code anomalies are symptoms on the program structure that may indicate the presence of deeper maintainability design problems [1]. They suggest where perfective maintenance is required in the source code [1]. Several code anomalies have been proposed and cataloged by several researchers, including Fowler [1], van Emden and Moonen [13], and Arevalo [16]. Typical examples of code anomalies are Feature Envy and Long Method [1].
Early detection of code anomalies is the only possibility of promoting the longevity of a software system. Early detection is the ability of identifying opportunities for refactoring [1] [19][20] as soon as anomalies are introduced in the source code by programmers. Longer the code anomalies remain in the source, harder it becomes to refactor out these anomalies from a program. Refactoring [1][17] is defined as behavior-preserving change made in structure of a program with the aim of improving software maintainability. Fowler [1] has identified more than 70 different types of refactoring, which range from local changes in a specific code element (as the Extract Local Variable refactoring) to a global change (as the Extract Class refactoring).
The effectiveness of refactoring actions is largely dependent on the effectiveness of detecting the code anomalies. Preliminary studies [21][22] have exposed negative consequences on code quality whenever ineffective and late refactoring actions are performed. Thus, developers need to identify anomaly instances more effectively and opportunely so that refactoring actions can be performed. In contrast, if developers miss the occurrences of anomaly instances, developers can perform ineffective refactoring actions in the source code.

Support to Detection of Code Anomalies
Usually, developers use (semi)automated techniques to guide their effort on anomaly detection [18] [23]. These techniques are basically comprised of two components [3][7]: (i) a mechanism for anomaly detection; and (ii) a user interface responsible for displaying detected anomaly instances, i.e. occurrences of code anomalies identified by the detection mechanism. The detection mechanism may allow developers to choose or define algorithms for anomaly detection. Developers can choose some metrics and thresholds to compose their own detection algorithms [5]. Based on developer's interaction with the aforementioned components and the anomalous code elements, anomaly detection can be classified according to two different techniques, as shown in Figure 1. Interactive detection (ID) is a technique that supports developer's interaction with anomalous code elements ( Figure 1). The ID techniques reveal anomaly instances in code fragments without an explicit request from the developer. Thus, the ID techniques constantly work on detecting anomaly instances in code fragments being manipulated by the developer. Thereby, a developer using ID techniques can early identify instances of code anomalies. Once developers do not directly interact with the mechanism for anomaly detection, they are able to perform other programming activities. In summary, developers are able to analyze, modify and implement the source code while they interact with the anomalous code elements [6].
Non-interactive detection (NID) is a technique that does not support developer's interaction with anomalous code elements ( Figure 1). The NID techniques reveal anomaly instances in the entire source code upon an explicit request from the developer. The mechanism for anomaly detection receives the request, and then, it detects anomaly instances in the entire source code. Thereby, developers using NID techniques identify anomaly instances only later (e.g., when code is already implemented). Once developers directly interact with the mechanism of anomaly detection, they are not able to concurrently perform other programming activities in the source code [6].
We analyze the ID technique through Stench Blossom [3]. This tool provides the programmer with three different views, which progressively offer information about the anomaly instances in the code fragment being visualized or edited. Initially, the developer interacts with the Ambient View ( Figure 2A). This view relies on the metaphor of a "flower", where each "petal" represents the possible occurrence of a specific anomaly in the code fragment. Higher the radius of a "petal", the higher is the probability of occurrence of the anomaly. The mechanism for anomaly detection of Stench Blossom calculates this probability. For more information about a specific anomaly instance, the developer must click on the "petal" displayed in the Ambient View. When the developer selects an anomaly, the name of code anomaly is presented in a dialog box and then, the Active View is displayed to the developer ( Figure 2B).

Fig.2. Ambient View (A) and Active View (B).
Finally, if the developer requires detailed information about a specific instance of a code anomaly, the Explanation View ( Figure 3) can be displayed from a new click on the name of the anomaly under analysis. The developer can use the color gradation to verify which code fragments are related to a specific instance of code anomaly. Therefore, the interaction with anomalous code elements provided by Stench Blossom, allows developers better understanding the origins of different instances of a given code anomaly.

III. STUDY SETTINGS
This section presents the main concepts related to execution of this research. The details related to the experiment, the choice of subjects and procedures for data analysis are described below.

Effectiveness evaluation
Effectiveness on detection of code anomalies is one of most important criteria for choosing a technique to perform this activity [8] [9]. When a technique for detection of code anomalies is considered effective, it means the technique is able to detect a high number of anomaly instances in a program. In addition, effective techniques should ideally detect only anomaly instances are indeed a maintainability problem. If developers use effective techniques, they can identify anomaly instances and consequently refactoring opportunities in order to improve the software maintainability [8] [9]. We used precision and recall to evaluate the effectiveness of anomaly detection. In the following, we define the concepts required to understand these two measures. Existing code anomalies (ECA) are actual anomaly instances identified by the technique for anomaly detection, where these instances are indeed confirmed by the experts as a maintainability problem. Experts are developers with deep knowledge about the system and its maintainability problems. Detected code anomalies (DCA) are anomaly instances identified through the use of an anomaly detection technique. Not all the detected code anomalies are confirmed as existing (actual) code anomalies by the experts. True positives (TP) are those anomaly instances present in both DCA and ECA setsi.e. anomaly instances identified by experts that actually represent a maintainability problem. False positives (FP) are anomaly instances identified by the programmers using a detection technique, but they are not in the ECA set. Finally, False negatives (FN) are anomaly instances not identified by the developers, which are in ECA.
The precision and recall measures defined in above equations (Eq) were adapted from Rijsbergen [26] and have been widely used in other studies [13][14] [15]. These previous studies were also focused on comparing techniques for anomaly detection. Precision quantifies the rate of true positives by the number of detected code anomalies. Recall quantifies the rate of true positives by the number of existing code anomalies.

Research Questions
In order to address our general research question (Section 1), we defined two specific goals: (i) assess whether developers using the interactive detection (ID) technique identify code anomalies more effectively compared to the non-interactive detection (NID) technique; and (ii) assess whether using ID technique lead developers to perform ineffective refactoring compared to NID technique. Thus, we defined three research questions (Table 1) to achieve the aforementioned goals. The first research Question (RQ1) compares both techniques using precision measure. This analysis is important because it shows the effectiveness of the ID technique regarding the number of true positives and false positives. Similarly, in our second research question (RQ2), we compared the recall measure of ID and NID techniques. The recall is as important as the precision. For example, it allows us to find which technique induced developers to miss more anomaly instances.
Finally, our third research question (RQ3) is focused on evaluate how the techniques interfere in the refactoring actions. As we discussed, code anomalies are considered indicators for refactoring actions. Thus, our work consider as effective refactoring actions, those modifications applied over anomalous code elements in order to improve the system maintainability. Although, techniques for anomaly detection might indicate false positives, and hence, developers will apply refactoring actions over code elements that do not represent a true threaten the system maintainability (i.e. ineffective refactoring actions).
For each research question, we defined hypotheses (H) which are summarized in Table 2. Thereby, we defined H1 and H2 due to empirical evidence found in the work of Murphy-Hill and Black [3]. This work pointed out the use of interactive detection (ID) technique can increase the number of anomaly instances found in the source code. Therefore, our expectation is that the use of ID technique may improve the effectiveness on detection of code anomalies in terms of precision and recall measures. We defined H3 as consequence of H1 and H2. Since ID technique constantly provides (i.e. regardless developers' request) information about anomaly instances, this amount and availability of information may improve the developers' reliability on anomaly detection. Consequently, our expectation is that developers may reduce the amount of false positives and hence, a smaller amount of ineffective refactoring actions would be performed. Table 2. Hypotheses

H1
The ID technique has a better recall than the NID technique.

H2
The ID technique has a better precision than the NID technique.

H3
The ID technique leads to performing less ineffective refactoring actions than the NID technique.

Method and Subjects
We use the recommendations outlined in the work of Kitchenham et al. [24] as a guide for establishing and implementing a controlled experiment. The subjects accomplished tasks related to detection of code anomalies and identification of refactoring opportunities (Section 3.4). They performed these tasks with support of ID and NID techniques. We choose the ID technique provided by Stench Blossom [3] for two main reasons: (i) it provides support to all ID features [6], as previously discussed (Section 2.2); and (ii) to the extent of our knowledge, it is the only robust solution that provides automated support for ID. We choose the NID technique provided by manual inspection due to it has been widely used in other comparative studies [10][3] [4] of techniques for anomaly detection. In addition, this technique does not require automated detection, thereby providing no specific influence of a particular detection mechanism. Similarly, we have also not found any other automated detection technique that supports the same set of anomalies addressed by Stench Blossom. For instance, the automated detection proposed by van Emden and Moonen [13] provides support to only two code anomalies (Instanceof and Typecast). Conducting a comparative experiment against just these two code anomalies would produce quite limited results. Finally, it also provides us with a reference to analyze the impact of an automated ID technique.
The comparison between ID and NID techniques allowed the analysis of whether particular characteristics of ID (e.g. early detection) bring apparent (dis)advantages. It is not the intent of this experiment to compare various ID techniques, such as the one realized by Stench Blossom. This choice is because, to the extent of our knowledge, there is no other robust automated solution that offers an interactive technique for supporting anomaly detection. Finally, many would consider ID and NID complementary rather than competitive techniques as they are naturally targeted at different development stages (Section 2.2). Although ID and NID can be used in a complementary way, they can also be used with the same purpose during a programming activity (e.g. analysis of code fragments). In the context of our experiment, the techniques for anomaly detection were evaluated with the same purpose: detection of code anomalies while browsing code elements.
Regarding to the subjects of this study, we recruited two main groups: (i) postgraduate students and (ii) professionals developers. These subjects were selected based on the criteria of interest in participating of the experiment. We expected from subjects, at least, intermediate knowledge in Java and refactoring. However, we did not expect from subjects knowledge about code anomalies or the interactive detection technique used in the experiment. Due to space constraints, detailed description of subjects profile may be found online in our paper supplementary material [25].

Experiment Description
The subjects performed tasks related to identification of code anomalies and refactoring opportunities. In these tasks, the subjects manipulated Java code files extracted from Java Core Library [25]. We have chosen this project because is an open source industrial system, making it easier to replicate this study by independent researchers. Four code files were selected according to the similar size and amount of the code anomalies. The experimental phase required two code filesone file for the ID task (e.g. file A) and the other to NID task (e.g. file B). This criterion was adopted because both files could be used in the tasks, regardless of the order, reducing their influence on the results of the experiment. Each experimental task was individually conducted with the first author as an observer of the experiment. It is also important to mention we already provided the environment with all the files and tooling support required to execute the experimental tasks. The maximum time each subject had available for executing the experimental tasks was 60 minutes. A detailed description of experimental tasks may be found online in our paper supplementary material [25]. Finally, we organized the experiment into three different phases, namely: Phase 1 -Pre-Experiment: Initially, the subjects answered a questionnaire to collect the necessary data for definition of subjects' profile (Section 4). Then, the subjects received a material with the definition of eight (8) code anomalies supported by Stench Blossom, as well as an example of the occurrence of each one. A detailed description of code anomalies used may be found online in our paper supplementary material [25]. A deadline of 15 minutes (maximum) was given for the subject to understand these definitions. This step was intended at leveling the knowledge of the subjects. Finally, the subjects underwent a training session about Stench Blossom and the Eclipse IDE version used in the experiment. The data obtained from these tasks will be used to evaluate the first and second hypothesis (H1 and H2) and the Section 4.1 provides its detailed description. Phase 3 -Judgments of Refactoring: Subjects performed judgments of refactoring using ID and NID techniques. This phase consisted in identifying of Feature Envy anomaly. We decided to focus on Feature Envy for this experimental phase, as this is the only code anomaly currently supported by the implementation of the Explanation View (Section 2.2). After the identification of Feature Envy, the subject should infer about the usefulness of applying a refactoring action. In positive case, the subject should answer the following questions: (i) how scattered is the anomaly in the analyzed code, (ii) how likely removing this anomaly and (iii) which refactoring actions are required. The aforementioned questions are directly related to judgments of refactoring [1] [2]. The following concepts are required to understand this task: Ineffective Refactoring (IR) occurs when the developer positively infers about refactoring from an instance of Feature Envy anomaly, which has been considered a false positive. Effective Refactoring (ER) occurs when the developer positively infers about refactoring necessity from an instance of Feature Envy anomaly, which has been considered a true positive. The data obtained from these tasks will be used to evaluate the third hypothesis (H3) and its description can be seen in Section 4.2.

Analysis Method
We applied statistical analysis on the data obtained from experimental tasks. Such statistical analyzes were carried out with support of the R tool [27]. This tool provides means for calculating statistical tests considered in this study: (i) Wilcoxon signed-rank test [28], and (ii) paired T-Test [28]. The first one is applied to the values associated with the correctly identified anomaly instances. This test was selected since the data were not following a normalized distribution. The second one is applied to the values of recall and precision since the obtained measures were following a normalized distribution. The execution of the experimental tasks derived data for two samples: the sample with the aid of ID and the sample with the aid of NID technique. The aforementioned statistical tests can be applied since each observation in the first sample can be paired with one observation of the second sample.

IV. RESULTS AND DISCUSSION
In this section, we present the results of the experimental tasks described in Section 3.4. Each subject spent on average 45 minutes to execute the experiment. Therefore, the upper limit of one hour was enough for the subjects conclude the tasks. Whenever it is appropriate, statistical analyzes are presented. The first phase (Section 3.4) of the experiment involved the application of a questionnaire aiming to determine the subjects' profile. Table 3 summarizes the main characteristics of the subjects' profile. Their profile meets our study assumptions since all subjects have at least intermediate knowledge about Java, detection of code anomalies and program refactoring. The following subsections present the key results and findings revealed by our study. Table 3. Results of the pre-experiment questionnaire Question Results

Professional practice
7 Subjects were postgraduate students and 7 subjects were professional developers

Experience time
Half of the sample had between 5 and 8 years of experience in software development

Java proficiency
On a scale from 0 to 4 (*), 36% of subjects answered 2 and 57% of subjects answered 3

Anomaly detection proficiency
On a scale from 0 to 4 (*), approx. 80% of the subjects answered 1 or 2.

Refactoring proficiency
On a scale from 0 to 4 (*), approx. 60% of the subjects answered 3 or 4.

Identification of Code Anomalies
The second phase involved the execution of the tasks related to identification of code anomalies using noninteractive detection (NID) and interactive detection (ID) techniques. The tasks focused on analyzing the effectiveness of using ID on the detection of code anomalies. Table 4   ID technique increases both true and false positives: We observed the subjects identified 22 false positives when using the ID technique. That is, the number of false positives is approximately 38% higher than the number of false positives (16) produced when subjects used the NID technique. Similarly, the subjects identified 106 true positives (i.e. anomalies correctly identified) based on the use of ID technique, while subjects identified 73 true positive based on the use of NID technique. Therefore, the use of ID increased in 45% the total of true positives by the subjects when identifying code anomalies. Finally, the data related to true positives generated with ID and NID techniques were statistically significant (p = 0.002, df = 12, z = 3.05, using a Wilcoxon signed-ranks test [28]).
Aiming to provide an additional perspective on the effectiveness of the interactive detection of code anomalies, we also analyzed precision and recall measures. Therefore, we applied those collected measures in the equations defined in Section 3.1. The Table 5 illustrates the results of these metrics for both ID and NID techniques. The precision and recall measures were calculated in order to address the research questions RQ1 and RQ2. In addition, these results were used in order to test the hypotheses H1 and H2, respectively. ID increases recall: When analyzing recall measures, we observed that, in average, the subjects using the ID technique achieved a score of 0.30, while the use of the NID achieved 0.21. Thus, the results represent a difference of approximately 30% in favor of the ID technique. Similar results could be observed when analyzing different samples (e.g. students or developers). For instance, the developers' sample improves recall values in 40%, while the students' sample improves recall values in 50%. Likewise, the data related to recall in this task through ID and NID was statistically significant (p = 0.0013, df = 13, t = 4.06, using a Paired T-Test [28]) in the task of identification of code anomalies.
We also found that recall suffers direct influence regarding the subjects' working experience. The results allowed us to conclude the use of ID can directly affect the recall values. The use of ID implies the interaction of subjects with the anomalous code elements as they progressively analyze code fragments. Therefore, developers are able to achieve more coverage with ID regarding the correctly identified instances of code anomalies. Finally, we can confirm the first hypothesis (H1), since the use of ID led to better recall values compared to the use of NID.  ID and NID techniques have similar precision: We observed the average of precision measures with ID was 0.82, while the use of NID achieved 0.84. As opposed to recall values, the difference of precision measures with NID and ID was not significant. This finding is revealed when analyzing percentage values. We also realized the subjects' working experience directly affected the results. The professionals' sample achieved better precision values compared to the students' sample. As previously discussed, although the use of the ID technique increases the number of false positive, it also tends to increase the number of true positive -which directly affect precision values. According to results illustrated in Table 5, there is no evidence to support that the subjects using ID have worse (or better) precision than subjects using NID technique. Therefore, we cannot confirm or refute the second hypothesis (H2). The results indicated there is no negative impact when the interactive detection of code anomalies is performed progressively -i.e. while the developer is browsing or editing the code. Software developers are likely to benefit from detecting anomalies earlier, when they constantly receive feedback provided by ID. Moreover, the constant availability and higher amount of information through ID led developers to accept a higher number of anomaly instances. However, if the subject holds a higher level of working experience, he can be more confident to infer (i.e. accept or reject) about the suggestions of anomaly instances from ID. The data described in Table 4 allow us confirm this assumption. More experienced developers using ID obtained a lower number of false positives compared to the students (fewer working experience) using the same technique. In a similar way, developers identified a higher number of true positives compared to students. Finally, these results are similar to those presented in the work of Murphy-Hill and Black [3], as developers identify more true positives using ID compared to developers using NID technique.

Judgments of Refactoring
In the third phase (Section 3.4), subjects performed judgments of refactoring using non-interactive detection (NID) and interactive detection (ID) techniques. These tasks were performed in order to address the research question RQ3, which is validated by testing the hypothesis H3. In summary, we analyzed whether the subjects performed ineffective refactoring (IR) or effective refactoring (ER) related to occurrence of Feature Envy anomaly. Section 3.4 shown a detailed description of the judgments of refactoring. Finally, the Table 6 illustrates results from the accomplishment of aforementioned tasks. ID technique may increase IR: We verified the subjects performed 3 ineffective refactoring when using the NID, while the subjects using the ID performed 6. That is, the use of ID occasioned a growth of 50% in the ineffective refactoring performed by subjects. Moreover, when analyzing the results achieved by the developers' sample, subjects performed only 1 ineffective refactoring when using NID, while 2 ineffective refactoring were performed when ID was employed. The same proportion of growth (i.e. 50%) occurs in the results obtained from students' sample. We noticed 2 ineffective refactoring were performed when the NID was used, while subjects using the ID performed 4.
In summary, we observed the subjects using ID are likely to perform more ineffective refactoring compared to subjects using NID technique. Moreover, we could observe that working' experience also influences the results of this task, since the developers performed 50% fewer ineffective refactoring than the students. During the second experimental phase, we observed most of the false positives were related to occurrences of the Feature Envy anomaly. Furthermore, the students using ID pointed out the majority of false positives. This fact led us to conclude occurrences of false positives might be directly associated with developers' working experience. Moreover, the use of the ID technique for the anomaly detection also directly affects the refactoring actions.
Concluding, we can refute the third hypothesis (H3) by analyzing the collected data associated with judgments of refactoring ( Table 6). The use of ID might induce developers to perform ineffective refactoring actions because the existence of the anomaly instance that could indicate the refactoring action may be untruthful. In short, if the developer performs refactoring on a false positive related to some anomaly, the effort to accomplishment this task might not contribute to improving the system maintainability.

V. THREATS TO VALIDITY
Sample size and diversity: Fourteen subjects performed the controlled experiment. The results may have direct influence from size of the sample and the subjects' working experience on anomaly detection and refactoring. To mitigate this threat, we choose a sample comprising by students and developers. Furthermore, we conducted training sessions in order to leveling the knowledge of subjects with respect to these topics. Experiment Complexity: Other threats to validity are related to: (i) the difficulty in understanding code files working experience also directly affected the recall measures. Actually, the subjects' working experience improved recall values in a greater proportion compared to values associated with precision.
Finally, new experiments about ID effectiveness can be performed using a different set of code anomalies with different levels of granularity (i.e. anomalies that affect different code elements, such as packages, classes and methods). This recommendation is even more relevant for the second phase of the experiment (Section 3.4), which focused on the occurrence of the Feature Envy anomaly.