Qualitas Corpus Metadata Attributes

This page defines all of the attributes for which metadata is provided in the Qualitas Corpus.

Attributes come in two flavours. Some apply to systems and some apply to sysvers.

System Attributes

system
The unique identifier for a unit of development that can be deployed.

status
An indication of the development status of the system.

description
A short description of the system.

systemnotes
Notes regarding system, e.g. explanation of status.

sysvercount
Number of versions of the system in the release of the corpus.

Sysver Attributes

sysver
A unique identifier for the version of the system that is in the corpus. This identifier follows the naming conventions.

fullname
The full name of the system the version belongs to. This is often the system attribute, but can be different when the full name is too awkward to use as the system identifier. This is not a system level attribute to accommodate systems that change names.

distribution
Which distribution does the sysver appear in: 'e' - evolution, 'r' - recent, 'f' - full.

domain
An indication of the purpose of the system.

jreversion
The earliest JRE version that the sysver depends on.

license
The license under which the system has been released. The format for this data is:
human readable license identifier ";" path to license text
An example is
Eclipse Public License - v 1.0; src/eclipse/epl-v10.html
The human readable license identifier will typically be the title text of the license or something similar that uniquely identifies the license. Ideally the path to license text will be a path relative to the installation of the sysver to the license text. This is preferred to (for example) a URL, since the URL can become stale, so a URL is only provided when no license text is distributed. Details as to the reliability of this information are available.

loc(both)
The sum of the LOC values from the contents details metadata for all types that are in the source packages, and that appear in both src and bin.

This value has to apply to types for which there is source code (otherwise LOC has no meaning) and it doesn't seem sensible to include types for which there is source code but no binary, hence the "both" designation.

n_bin
The number of types found in the contents details metadata for all types that are in the source packages, and that appear in the bin. This is considered the definitive set of types for the system, as this is what is actually deployed.

This is one candidate for determining the number of types in the system. The reasoning for this choice is, types for which there is a compiled version are intended to be part of the system so must be counted, even if there is no source available (e.g. because it is generated). Note that LOC/NCLOC measurements may not exist for all types.

n_both
The number of types found in the contents details metadata for all types that are in the source packages, and that appear in both src and bin.

This is another means to measure the number of types in the system. The reasoning is, types for which there is source but no binary are probably infrastructural (e.g. testing) or examples or similar. Types for which there are binary but not source are probably generated.

n_files
The number of source files found in the contents details metadata for all types that are in the source packages, and that appear in the src.

n_top(bin)
The number of top-level types found in the contents details metadata for all types that are in the source packages, and that appear in the bin.

This might not always match n_files if there are files with multiple top-level types declared (meaning some must be non-public).

ncloc(both)
The sum of the NCLOC values from the contents details metadata for all types that are in the source packages, and that appear in both src and bin.

This value has to apply to types for which there is source code (otherwise LOC has no meaning) and it doesn't seem sensible to include types for which there is source code but no binary, hence the "both" designation.

releasedate
The release date of the sysver. This date is determined ideally by a notice in the distribution or on the system website, but sometimes has to be guessed by looking at dates of files in the distribution. The versionnotes attribute should indicate how this was determined.

sourcepackages

Records the decision we have made regarding what is in a system (and not, for example, third-party library types). This is a space-separated list of prefixes of packages of Java types. Any type (class, interface, enum, annotation) whose binary name has one of the listed package prefix as a prefix of it is considered a type that was developed for the system, and everything else is considered as being a library type.

So, for azureus-3.0.3.4, its sourcepackages value is "org.gudy. HTML. com.aelitis.", indicating that types such as com.aelitis.azureus.core.AzureusCore are considered part of that version of azureus, whereas java.lang.String would not.

url
A web site for the version. Typically this is the system's home page, but as this may change over time this is kept at the version level.

versionnotes
Version specific notes, e.g. how release date was determined.