Qualitas Corpus Clone Collection Glossary

This page (is supposed to) contains all the Collection-specific terms that are used, with links to the full documentation (if appropriate).
Term (link to details) Short description
Candidate Pair A candidate pair is a pair of code fragments for which there is some information regarding whether or not one is a clone of the other. The information may be that the pair is in fact a clone pair, but it could also be that the pair is not a clone pair.
Code Fragment A code fragment is any contiguous sequence of text lines in a source code file.
Clone One code fragment is a clone of another fragment if it is conceivable that a rational developer created one fragment by copying (and possibly modifying) the other.
Clone Pair A clone pair is a pair of code fragments for which there is some evidence that one fragment is a clone of the other. That is, it is a candidate pair where the information is in support of the clone relationship existing.
Cluster A cluster is a set of code fragments where, for every code fragment, there is at least one other code fragment such that the two fragments together are a clone pair. Note that this is the "connected component" definition. The "clique" variant would require that every pair of code fragments form a clone pair, but this variant is not used in the Collection.
Confidence Level Confidence level is an ordinal-scale value indicating the degree of confidence regarding some datum.
ELOC "Executable" lines of code --- lines that are not blank, are not entirely comments, and contain more than braces.
Master File This is the authoritative data source for clone information. There is one for each system version.
Provenance Provenance in the context of the Collection, refers to identifying the origin and (ideally) processes for creating the data that provides supporting evidence for the clone data.
Clone type There have been several classifications proposed for code clones, the one that seems to be referred to the most is the clone 'type'. The categories in this classification are Type-1 Type-2, Type-3, and Type-4 (definitions taken from Roy et al.). There is not unanimous agreement on these categories, especially what's in Type-3 and Type-4 is not considered in the Collection.
Type-1 clone Identical code fragments except for variations in whitespace, layout and comments.
Type-2 clone Syntactically identical fragments except for variations in identifiers, literals, types, whitespace, layout and comments.
Type-3 clone Copied fragments with further modifications such as changed, added or removed statements, in addition to variations in identifiers, literals, types, whitespace, layout and comments.
Type-4 clone Two or more code fragments that perform the same computation but are implemented by different syntactic variants.