Qualitas Corpus Clone Collection: mete-cmcd Data Description

This page describes the format of the raw data provided by the clone detector tool mete-cmcd.

The output from mete-cmcd is a plain text file with (where appropriate) fields separated by tab characters. The example below shows the elided contents of such a file. These files are meant to be reasonably self-contained, that is, anyone familiar with what the files are for should be able to puzzle out what it all means with the help of the documentation included. But just in case that's not true, here are more details.

There are four main sections: the Parameters, the dataset summary information, the Clone Pair information, and the Cluster information.

Parameters

mete-cmcd has several parameters that can affect different aspects of its operation, including what output is produced. The parameters are as follows. (Emphasised text appears in the data file.)

Fold Path A path prefix that is elided when displaying paths
  mete-cmcd gives absolute paths to the source files in the specification of code fragments. This means there is a lot of long, repeated information. The fold path parameter says what can be sensibly elided in the code fragment specifications.
Difference Threshold The largest value of the normalised CM difference that is considered a clone pair
  mete-cmcd computes a "different" score between each pair of methods, where 0 means no difference (other than whitespace and comments) and larger scores mean more different. This parameters indicates the maximum difference score used in the data set to determine whether a candidate pair is a clone pair.
Minimum AST nodes The definition of 'small' (in AST nodes) for omitting small methods.
  mete-cmcd is implemented using an ANTLR parser, and does its analysis by walking the AST trees produced by the parser. Frequently, clone detectors do not report matches of "small" code fragments, as small fragments can often look very similar just due to their size (e.g., two fragments consisting of a declaration of an integer variable and its initialisation). mete-cmcd determines whether or not a code fragment is small by determining how many nodes there are in its AST. This parameter indicates what the smallest code fragment to consider according to this means of determining size.
Size ratio threshold If the method size (measure as number of nodes in AST) ratio is more than this then not clones.
  If code fragments are of quite different sizes, then it is unlikely that one is a clone of another. This parameter value is the ratio used to determine whether or not to proceed with the difference computation.
Text difference threshold If the method texts differ by more than this ratio then not clones.
  The means by which mete-cmcd determines the difference score can produce a small score for what are clearly quite different methods. To avoid such false positives, a simple text comparison is done between the methods first. If the difference is greater than this parameter value, then the fragments are considered not clones.
Comments ignored Whether to include comments when comparing methods
  When determining the text difference, this parameter indicates whether or not comments should be considered. Mainly this is useful for performance.

Summary Information

This section provides data that applies to the whole dataset.

Sysver Identification of what was analysed (typically a corpus System Version)
  What code base is being analysed (identified using the Corpus identifier)
Files Number of files analysed
  Not all files in the corpus are analysed. The ones that are analysed are those for which the contents.csv file indicates that the file contains source code that is considered to be developed for the system under analysis, and for which there is both source and binary (byte code) versions in the corpus.
Methods Number of methods, not counting constructors or methods that are too small.
  Since mete-cmcd identifies clone pairs at the method level of granularity, it is useful to know how many methods were considered.
ELOC (Methods only) ELOC for methods, not counting constructors or methods that are too small. ELOC is lines of code, not counting lines that are blank, contain only comments, or only braces.
  The sum of the ELOC of methods that were analysed.
Clone pairs Number of clone pairs
  What it says.
Clusters Number of clusters
  What it says.
Code clones ELOC of code that is in a clone pair (proportion of ELOC).
  The sum or the ELOC over all code fragments that appear in a clone pair (and, in paratheses, the proportion with respect to the ELOC (Methods only) measurement).
Cloned code Sum of ELOC for code in a cluster minus the size of smallest fragment, summed over all clusters. (proportion of ELOC)
  Sum of the ELOC(cloned) values over all clusters. (the proportion with respect to the ELOC (Methods only) measurement).

Clone Pair Information

The Clone Pair information section begins with a brief description of the fields (in emphasised text in the table below), then a line with the field names as headers for the columns (meaning the data file can be usefully loaded in to a spreadsheet), followed by the clone pair information itself.

To be considered a clone pair (that is, listed in this section), a pair of code fragments must meet the various conditions described above (see Parameters) and neither code fragment can be a Java constructor. Each line of the clone pair information describes one clone pair, and is divided into the following fields:-

Cluster A unique ID identifying the cluster the clone pair belongs to.
  Each cluster has an ID that is unique with each file. Further information about the cluster a code fragment belongs to is given in the Cluster information below.
File1 The name of the file (sans foldpath prefix) containing the lexically first method in the clone pair.
  This value prefixed by the foldpath gives the absolute path to the source file containing the "lexically first" code fragment. While there is no inherent order to the fragments in a clone pair, it is useful for presentation purposes (e.g. sorting clone pairs) to have a well-defined order of fragments. The one chosen is the lexical ordering determined by the method name.
Method1 The name of the lexically first method in the clone pair.
  mete-cmcd works at the method granularity, that is, all code fragments it considers are methods. These methods are identified by the (Java) name of the method, the types of the parameters, and the fully-qualified name of the class the method is declared in. All of this information is given in this field.
Location1 The beginning and ending line numbers in the file where the lexically first method can be found.
  The line numbers in the source file that bound the "first" code fragment. These are the physical line numbers referring to the file exactly as it appears. No normalisation or transformation of any kind is assumed.
ELOC1 The number of lines of code in the lexically first method.
  One indication of code fragment size can be determined by its beginning and ending line numbers, however this will include such things as blank lines, and so may be misleading in some way. The ELOC metric is a lines-of-code variant that does not count blank lines, lines consisting only of comments, or lines consisting only of braces.
Nodes1 The number of nodes in the AST for the the lexically first method.
  This is a size measurment based on the AST for the code fragment. In order for a pair of code fragments to be listed, this value has to be as large as the Minimum AST nodes parameter value.
File2 The name of the file (foldpath common prefix) containing the lexically second method in the clone pair.
  Same as for File1 but for the other code fragment.
Method2 The name of the lexically second method in the clone pair.
  Same as for Method1 but for the other code fragment.
Location2 The beginning and ending line numbers in the file where the lexically second method can be found.
  Same as for Location1 but for the other code fragment.
ELOC2 The number of lines of code in the lexically second method.
  Same as for ELOC1 but for the other code fragment.
Nodes2 The number of nodes in the AST for the the lexically second method.
  Same as for Nodes1 but for the other code fragment.
Diff The normalised difference score between the two methods.
  The difference score used to determine whether or not a pair of code fragments is a clone pair. That is, to be listed as a clone pair, this value has to be smaller than the Difference Threshold parameter value.
RawDiff The raw difference score between the two methods.
  This is the difference score produced by the basic algorithm used by mete-cmcd. However this value is sensitive to the size of the code fragment, so the actual difference score used is normalised by the code fragment size.

Cluster Information

This section provides summary information for the clusters. This information can be inferred from the clone pair information, but it is convenient to have it explicitly. It consists of a brief description of the fields, a header line with the field names, and then one tab-separated line for each cluster.
Cluster A unique ID identifying the cluster the clone pair belongs to.
  This ID is unique only within the dataset. It may match an ID in another dataset. It has no meaning other than to identify a cluster.
Pairs Number of clone pairs in cluster
  This is one indication of cluster size (that is, number of "edges").
Methods Number of distinct methods in cluster
  This is another indication of cluster size (number of "vertices")
ELOC Sum of ELOC for all methods in cluster
  This provides one indication of "cloned code" there is.
ELOC(cloned) ELOC for all but the smallest method
  If a clone pair was formed by one code fragment being copied (and then perhaps modified), then in any cluster there is a fragment that is the "original". And technically, the original fragment is not a clone. So the "cloned code" is all code in the cluster other than the original. Because there is no way for mete-cmcd to tell which is the original, but also because all code fragments are roughly the same size (due to the Size ratio threshold), a good indication of how much code has been cloned can be given by just picking one fragment in the cluster as a proxy for the original. In order to ensure the same answer is given every time, the smallest fragment is chosen.

Example output

Data from clone analysis.
Tool:	mete-cmcd: 2013-01-29T1615
Timestamp:	Wed Jan 30 10:56:32 NZDT 2013
Parameters
Fold Path:	/opt/qualitas/QualitasCorpus-20120401/Systems/ant/ant-1.8.2/src	A path prefix that is elided when displaying paths
Difference Threshold:	45	The largest value of the normalised CM difference that is considered a clone pair
Minimum AST nodes:	50	The definition of 'small' (in AST nodes) for omitting small methods.
Text difference threshold:	0.5	If the method texts differ by more than this ratio then not clones.
Size ratio threshold:	0.65	If the method size (measure as number of nodes in AST) ratio is more than this then not clones.
Comments ignored:	true	Whether to include comments when comparing method text.
Global Values
Sysver:	ant-1.8.2	Identification of what was analysed (typically a corpus System Version)
Files:	843	Number of files analysed
Methods:	2974	Number of methods, not counting constructors or methods that are too small.
ELOC (Methods only):	49791	ELOC for methods, not counting constructors or methods that are too small. ELOC is lines of code, not counting lines that are blank, contain only comments, or only braces.
Clone pairs:	963	Number of clone pairs
Clusters:	299	Number of clusters
Code clones:	10302 (0.21)	ELOC of code that is in a clone pair (proportion of ELOC).
Cloned code:	6571 (0.13)	ELOC of code in clone pair minus size of smallest fragment (proportion of ELOC)
Clone Pair Information
# 1. 	Cluster	A unique ID identifying the cluster the clone pair belongs to.
# 2. 	File1	The name of the file (sans foldpath prefix) containing the lexically first method in the clone pair.
# 3. 	Method1	The name of the lexically first method in the clone pair.
# 4. 	Location1	The beginning and ending line numbers in the file where the lexically first method can be found.
# 5. 	ELOC1	The number of lines of code in the lexically first method.
# 6. 	Nodes1	The number of nodes in the AST for the the lexically first method.
# 7. 	File2	The name of the file (foldpath common prefix) containing the lexically second method in the clone pair.
# 8. 	Method2	The name of the lexically second method in the clone pair.
# 9. 	Location2	The beginning and ending line numbers in the file where the lexically second method can be found.
# 10. 	ELOC2	The number of lines of code in the lexically second method.
# 11. 	Nodes2	The number of nodes in the AST for the the lexically second method.
# 12. 	Diff	The normalised difference score between the two methods.
# 13. 	RawDiff	The raw difference score between the two methods.
Cluster	File1	Method1	Location1	ELOC1	Nodes1	File2	Method2	Location2	ELOC2	Nodes2	Diff	RawDiff
C89	/apache-ant-1.8.2/src/main/org/apache/tools/ant/AntClassLoader.java	org.apache.tools.ant.AntClassLoader.forceLoadClass(String)	(645,654)	6	61	/apache-ant-1.8.2/src/main/org/apache/tools/ant/AntClassLoader.java	org.apache.tools.ant.AntClassLoader.forceLoadSystemClass(String)	(672,681)	6	61	0.00	0.00
C110	/apache-ant-1.8.2/src/main/org/apache/tools/ant/AntClassLoader.java	org.apache.tools.ant.AntClassLoader.getCertificates(File, String)	(1191,1202)	9	95	/apache-ant-1.8.2/src/main/org/apache/tools/ant/AntClassLoader.java	org.apache.tools.ant.AntClassLoader.getJarManifest(File)	(1169,1178)	7	66	31.29	31.29
C87	/apache-ant-1.8.2/src/main/org/apache/tools/ant/AntClassLoader.java	org.apache.tools.ant.AntClassLoader.getResource(String)	(868,904)	23	256	/apache-ant-1.8.2/src/main/org/apache/tools/ant/AntClassLoader.java	org.apache.tools.ant.AntClassLoader.getResourceAsStream(String)	(692,722)	23	183	30.01	30.01
...
C1	/apache-ant-1.8.2/src/main/org/apache/tools/ant/types/resources/selectors/ResourceSelectorContainer.java	org.apache.tools.ant.types.resources.selectors.ResourceSelectorContainer.dieOnCircularReference(Stack, Project)	(110,126)	12	114	/apache-ant-1.8.2/src/main/org/apache/tools/ant/types/selectors/BaseSelectorContainer.java	org.apache.tools.ant.types.selectors.BaseSelectorContainer.dieOnCircularReference(Stack, Project)	(331,347)	12	115	0.00	0.00
C1	/apache-ant-1.8.2/src/main/org/apache/tools/ant/types/selectors/AbstractSelectorContainer.java	org.apache.tools.ant.types.selectors.AbstractSelectorContainer.dieOnCircularReference(Stack, Project)	(325,340)	11	113	/apache-ant-1.8.2/src/main/org/apache/tools/ant/types/selectors/BaseSelectorContainer.java	org.apache.tools.ant.types.selectors.BaseSelectorContainer.dieOnCircularReference(Stack, Project)	(331,347)	12	115	0.00	0.00
Cluster Information
# 1. 	Cluster	A unique ID identifying the cluster the clone pair belongs to.
# 2. 	Pairs	Number of clone pairs in cluster
# 3. 	Methods	Number of distinct methods in cluster
# 4. 	ELOC	Sum of ELOC for all methods in cluster
# 5. 	ELOC(cloned)	ELOC for all but the smallest method
Cluster	Pairs	Methods	ELOC	ELOC(cloned)
C110	1	2	16	9
C89	1	2	12	6
C87	1	2	46	23
...
C1	161	27	300	293