[PAST EVENT] Bogdan Dit, Computer Science - Ph.D. Defense

June 24, 2015
1pm - 2:30pm
McGlothlin-Street Hall, Room 002
251 Jamestown Rd
Williamsburg, VA 23185Map this location
Textual or unstructured data generated during the software development process contains a significant amount of useful information that captures design decisions and the rationale of developers. One of the ways to exploit this information in order to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.) is to use Information Retrieval (IR) techniques (e.g., Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation, etc.).

Two of the most important steps in a typical process of applying IR techniques to support SE tasks are: (i) preprocessing the corpus (i.e., a set of documents associated with a software system) by removing special characters, splitting identifiers, removing stop words, stemming identifiers, etc. and (ii) configuring the IR technique (i.e., setting up its parameters) and applying it on the preprocessed corpus.

In our previous work, we observed that the various options available for the preprocessing steps of the corpus (e.g., splitting identifiers), as well as the different parameter values for configuring IR techniques (e.g., configuring the parameters for LDA) can significantly influence the results produced by IR techniques on different datasets for various SE tasks.

This work proposes the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support software engineering tasks. The approach named GA-IR determines the (near) optimal solution to be used for each step of the IR process.

For example, for the corpus preprocessing steps our GA-IR approach will determine which special characters to remove, will choose the method to split the identifiers, will decide whether or not to remove stop words and how to stem identifiers. In addition, for the chosen IR technique it will automatically determine its (near) optimal parameter values. As a preliminary study, we applied GA-IR on three different software engineering tasks: (i) traceability link recovery, (ii) feature location, and (iii) identification of duplicate bug reports. The results of the study indicate that GA-IR outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach (i.e., one that knows the results a priori) and a combinatorial approach (i.e., one that considers a large number of parameter combinations and knows the results beforehand).

Bogdan Dit is a Ph.D. candidate in the Computer Science Department at the College of William & Mary. He obtained his M.S. in Computer Science from Wayne State University in 2009. His research interests include software evolution and maintenance, program comprehension, application of information retrieval in software engineering, reproducibility of experiments in software maintenance. He has published in various top Software Engineering venues such as ICSE, ICSM, EMSE. He received the Best Paper Award at the 29th IEEE International Conference on Software Maintenance (ICSM '13).