What is a "System"

Measurement is about assigning numbers or symbols to attributes of entities, so when doing measurement we must be clear as to what the attribute we are measuring is, and what the entities are. In the case of the Qualitas Corpus the entities are "systems", so we must be clear what we mean by "system".

The kinds of empirical studies the Qualitas Research Group carry out are intended to help us understand how software engineers create code and the relationship between the code structure and quality attributes such as modifiability, reusability, maintainability, and testability. We would like to understand what decisions developers have made when writing the code.

Many systems require third-party software. Should such software be considered part of the system? Given that such software is usually not under the control of the developers of the system, including it in the analysis would be mis-leading in terms of understanding what decisions have been made by developers for that system.

In theory we can just look at what's distributed and determine from that what is and what is not the system code. However there is no common format for organising distributions and consequently it has proved sometimes difficult to answer this question.

The kinds of issues we have faced in identifying what constitutes an system's code include:

For the compiled form of the system, some systems distribute the third-party code with it, and some don't. The former cases need to be identified, and some means of distinguishing third-party code from system code is needed.
It is not always obvious what is third-party code. Many systems are distributed as a single jar file. But sometimes these actually contain third-party systems (they've been unpacked and then jar'd up with the system) so, we can't just rely on what's in the jar file.
Some systems are distributed as several jar files and it is not always easy to figure out which jar files are in the system and which aren't. (E.g. netbeans has over 400 jar files!)
Some systems are careful to identify what third-party systems are included in the distribution (eclipse for example). However usually this is in simple text document that must be processed by a human, and so some judgement is still needed.
Sometimes the third-party content has been modified from the original, from simply placing the classes in different packages to significant changes to the actual code (eclipse again).
Often the code that really belongs to a system can be identified by the packages (often one top-level package) its classes belongs to. However, there are cases where systems have most of their classes in one top-level package, but a small number in completely different packages (possibly containing third-party code adapted from somewhere else).
Sometimes what is distributed include classes in what appears to be third-party packages, but are actually system-specific implementations of third-party classes, or for some other reason are considered by the developers to belong to those packages (packages such as javax).
Sometimes uncertainty about what's distributed in the binary form can be cleared up by comparing with what's in the source distribution. Classes in what appear to be third-party packages that don't come with source can probably be considered to not be in the system, for example.
Using the source that is distributed doesn't aways help. Some systems provide source of classes that do not appear in the compiled form. Are these in the system or not?
The most common case of getting extra classes in source form is when those classes are test classes. It seems reasonable to expect that the decisions made for these classes are not representative of the overall system so we probably do not want to
Some systems come with installers, example uses and demo systems, and other "extra" code that is intended to be used by those who install the system (unlike test code for example). Should this kind of code be considered part of the system? On the one hand it is code under control of the developers, but on the other hand it seems possible that such code is not representative of the decision making process of the "main" code.
Some binary distributions of some systems contain references to classes that are not included in the distribution.
Some distributions contain multiple implementations of the same type (such as stubs for testing or platform specific implementations).

Ideally we would have the exact specification as to what the developers consider to be "in" the system (assuming there is agreement amongst them!). However it is a very time consuming process to get such information, and so for the moment we have made our best guess following the principles described below.

Principles for identifying system contents

The two main principles we have used in making decisions about what is in a system and what is not are:

Do not include something in a given system if it could also appear in some other system in the corpus. This will avoid (or at least reduce) double-counting of code measurements that are done over the entire corpus.
Make some decision about what is in a system and document it. This means that even if the decision is not necessarily the best, others trying to reproduce a given analysis will know what actually was analysed.

The decision we have made regarding what is in a system is recorded in the sourcepackages attribute.