Qualitas Corpus Clone Collection Data Description
The main goal of the clone collection is to provide data about code clones
in the Qualitas Corpus. A key principle is that any data provided comes with
its provenance — information as to where that data came from
(or, what process lead to its creation). This is important to establish the
trustworthiness of the data.
The most fundamental information is the identification of a
clone pair. This is a claim that
one code fragment is a clone of another
fragment. There is currently no operational definition as to what it means
for one fragment to be a clone of another, and so there will be some error
in such claims. The Collection uses the following guideline:
One code fragment is a clone of another fragment if it is conceivable
that a rational developer created one fragment by copying (and possibly
modifying) the other.
The clone data is provided in
RCF format file. The
file provides a set of clone candidate pairs. The following information is
provided for each candidate. Unimplemented.
- Code Fragment 1
-
One of the code fragments
in the candidate pair.
- Location 1
-
The location of Code Fragment 1. At minimum this is a path to a source code
file and a start and end line number. Other information may be provided
(e.g. if the fragment corresponds to a programming unit, such as a method),
but is not required. See Attributes below.
- Code Fragment 2
-
The other code fragment
in the candidate pair.
- Location 2
-
The location of Code Fragment 2.
- Clone/Not clone
-
Whether there is evidence that the candidate pair is a clone or
not.
- Provenance
-
A summary of the evidence to support the Clone/Not clone value. This is
intended to allow distribution of the Master
file without the associated provenance data.
- Confidence Level
-
How much confidence there is in the Clone/Not clone value.
- Other Attributes
-
Other attribute values may be provided, depending on what
other data is available. Examples include:
- Other location information. For example, if the code fragment also
corresponds to a method, then the method identification should be given.
- Other size information. For example
ELOC), number of tokens, number of
nodes in the AST
- Clone type.
All attribute values must have associated (and identified) provenance
data.