Evaluating Clone Detection Techniques
Filip Van Rysselberghe
Lab On Re-Engineering
University Of Antwerp
Middelheimlaan 1, B 2020 Antwerpen
Filip.VanRysselberghe@ua.ac.be
Serge Demeyer
Lab On Re-Engineering
University Of Antwerp
Middelheimlaan 1, B 2020 Antwerpen
Serge.Demeyer@ua.ac.be
Abstract
In the last decade, several researchers have investigated techniques to detect duplicated code in programs
exceeding hundreds of thousands lines of code. All of these techniques have known merits and deficiencies, but
as of today, little is known on where to fit these techniques into the software maintenance process. This paper
compares three representative detection techniques (simple line matching, parameterized matching, and metric
fingerprints) by means of five small to medium cases and analyses the differences between the reported matches.
Based on this experiment, we conclude that (1) simple line matching is best suited for a first crude overview of the
duplicated code; (2) metric fingerprints work best in combination with a refactoring tool that is able to remove
duplicated subroutines; (3) parameterized matching works best in combination with more fine-grained refactoring
tools that work on the statement level.
1. Introduction
Code cloning or the act of copying code fragments and making minor, non–functional alterations, is a well-
known problem for evolving software systems leading to duplicated code fragments or code clones. Of course, the
normal functioning of the system is not affected, but without countermeasures by the maintenance team, further
development may become prohibitively expensive [7, 18]. Fortunately, the problem has been studied intensively
and several techniques to both detect and remove duplicated code have been proposed in the literature.
As far as removal of duplicated code is concerned, the state of the art proposes refactoring which is a technique
to gradually improve the structure of (object-oriented) programs while preserving their external behaviour [17].
Extract Method which extracts por