What is a "System"
Measurement is about assigning numbers or symbols to attributes of
entities, so when doing measurement we must be clear as to what the
attribute we are measuring is, and what the entities are. In the case of the
Qualitas Corpus the entities are "systems", so we must be clear what we mean
by "system".
The kinds of empirical studies the Qualitas Research Group carry out
are intended to help us understand how software engineers create
code and the relationship between the code structure and quality
attributes such as modifiability, reusability, maintainability, and
testability. We would like to understand what decisions developers
have made when writing the code.
Many systems require third-party software. Should such software be
considered part of the system? Given that such software is usually not
under the control of the developers of the system, including it in the
analysis would be mis-leading in terms of understanding what decisions have
been made by developers for that system.
In theory we can just look at what's distributed and determine from that
what is and what is not the system code. However there is no common format
for organising distributions and consequently it has proved sometimes
difficult to answer this question.
The kinds of issues we have faced in identifying what constitutes an
system's code include:
-
For the compiled form of the system, some systems
distribute the third-party code with it, and some don't. The former
cases need to be identified, and some means of distinguishing
third-party code from system code is needed.
-
It is not always obvious what is third-party code.
Many systems are distributed as a single jar file. But
sometimes these actually contain third-party systems (they've
been unpacked and then jar'd up with the system) so,
we can't just rely on what's in the jar file.
-
Some systems are distributed as several jar files
and it is not always
easy to figure out which jar files are in the system and which
aren't.
(E.g. netbeans has over 400 jar files!)
-
Some systems are careful to identify what third-party systems are included
in the distribution (eclipse for example). However usually this is
in simple text document that must be processed by a human, and so some
judgement is still needed.
-
Sometimes the third-party content has been modified from the original, from
simply placing the classes in different packages to significant changes to
the actual code (eclipse again).
-
Often the code that really belongs to a system can be identified by the
packages (often one top-level package) its classes belongs to. However,
there are cases where systems have most of their classes in one top-level
package, but a small number in completely different packages (possibly
containing third-party code adapted from somewhere else).
-
Sometimes what is distributed include classes in what appears to
be third-party packages, but are actually system-specific
implementations of third-party classes, or for some other reason
are considered by the developers to belong to those packages
(packages such as javax).
-
Sometimes uncertainty about what's distributed in the binary form can be
cleared up by comparing with what's in the source distribution. Classes in
what appear to be third-party packages that don't come with source can
probably be considered to not be in the system, for example.
-
Using the source that is distributed doesn't aways help. Some
systems provide source of classes that do not appear in
the compiled form. Are these in the system or not?
-
The most common case of getting extra classes in source form is
when those classes are test classes. It seems reasonable to expect
that the decisions made for these classes are not representative
of the overall system so we probably do not want to
-
Some systems come with installers, example uses and demo systems, and
other "extra" code that is intended to be used by those who install the
system (unlike test code for example). Should this kind of code be
considered part of the system? On the one hand it is code under control
of the developers, but on the other hand it seems possible that such code is
not representative of the decision making process of the "main" code.
-
Some binary distributions of some systems contain references to
classes that are not included in the distribution.
-
Some distributions contain multiple implementations of the same
type (such as stubs for testing or platform specific implementations).
Ideally we would have the exact specification as to what
the developers consider to be "in" the system (assuming
there is agreement amongst them!). However it is a very
time consuming process to get such information, and so for
the moment we have made our best guess following the principles
described below.
Principles for identifying system contents
The two main principles we have used in making decisions about what
is in a system and what is not are:
- Do not include something in a given system if it could
also appear in some other system in the corpus. This will
avoid (or at least reduce) double-counting of code measurements
that are done over the entire corpus.
- Make some decision about what is in a system and
document it. This means that even if the decision is not
necessarily the best, others trying to reproduce a given analysis will know
what actually was analysed.
The decision we have made regarding what is in a system is recorded in
the sourcepackages
attribute.