Qualitas Corpus Clone Collection Data Description

The main goal of the clone collection is to provide data about code clones in the Qualitas Corpus. A key principle is that any data provided comes with its provenance — information as to where that data came from (or, what process lead to its creation). This is important to establish the trustworthiness of the data.

The most fundamental information is the identification of a clone pair. This is a claim that one code fragment is a clone of another fragment. There is currently no operational definition as to what it means for one fragment to be a clone of another, and so there will be some error in such claims. The Collection uses the following guideline:

One code fragment is a clone of another fragment if it is conceivable that a rational developer created one fragment by copying (and possibly modifying) the other.

The clone data is provided in RCF format file. The file provides a set of clone candidate pairs. The following information is provided for each candidate. Unimplemented.

Code Fragment 1

One of the code fragments in the candidate pair.

Location 1

The location of Code Fragment 1. At minimum this is a path to a source code file and a start and end line number. Other information may be provided (e.g. if the fragment corresponds to a programming unit, such as a method), but is not required. See Attributes below.

Code Fragment 2

The other code fragment in the candidate pair.

Location 2

The location of Code Fragment 2.

Clone/Not clone

Whether there is evidence that the candidate pair is a clone or not.

Provenance

A summary of the evidence to support the Clone/Not clone value. This is intended to allow distribution of the Master file without the associated provenance data.

Confidence Level

How much confidence there is in the Clone/Not clone value.

Other Attributes

Other attribute values may be provided, depending on what other data is available. Examples include:

Other location information. For example, if the code fragment also corresponds to a method, then the method identification should be given.
Other size information. For example ELOC), number of tokens, number of nodes in the AST
Clone type.

All attribute values must have associated (and identified) provenance data.