Criteria for inclusion in Qualitas Corpus

Currently, the criteria for an system to be included in a release of the corpus are:

In the previous release
Written in Java
Distributes both source and binary forms
Distribute binary forms as a set of jar files
Available to anyone independent of the corpus
Most recent version are final
Identifiable contents

Eventually we hope that all of these may be relaxed (except perhaps the last one!). The rationale for these criteria are given below.

In the previous release
We do not want to remove things from a release that was in a previous release. This allows people to have the latest release and yet still be able to reproduce studies based on previous releases. While we intend to continue to distributed previous releases, we assume most people would prefer not to have to juggle multiple versions of the corpus.
This isn't to say that a new release will be a complete superset of all previous releases. If there are errors in a previous release (e.g. missing or wrong meta-data, mis-named systems or versions, problems with installation) then we will fix them, while providing enough information to allow people to determine how much the changes may affect attempts to reproduce previous studies.
Written in Java
The choice of Java is due to both the amount of open source code available (far more than C# at the moment, although perhaps not as much as C++) and the relative ease with which it can be analysed (unlike, for example, C++). Should the opportunity arise, other languages will be added, but doing so is not a priority at the moment.
Distributes both source and binary forms
One advantage with Java is that its "compiled" form is also fairly easy to analyse, easier than for the source code in fact, however there are slight differences between the source and binary forms. Having both forms means that analysis results from the binary form can be manually checked against the source.
In order for it to make sense to have both source and binary forms, the binary form must really be the binary form of the source. It is expensive (in time) to download source and then compile it as every project has a different build technology (e.g. ant, bat files, uses eclipse infrastructure) that takes significant effort to understand. We have made the decision to simply take what is distributed by the developers, and assume that the binary form is from the source that is distributed. For this reason, we only include systems that do actually distribute both forms in a clearly identifiable way.
This rules out, for example, systems whose source are only available through a source control system. While in theory it should be possible to extract the source relevant to a given binary release, being confident that we can extract exactly the right versions of each file is sufficiently hard that we just avoid the problem at the moment. In the future we hope to relax this, at least for systems where the relevant source version is clearly labelled.
This also rules out, for example, JHotDraw from version 7 onwards, which only distributes the source. (It's more complicated than that - the problem is there are compiled forms but they aren't organised in a useful fashion.)
This is in fact quite difficult to meet. It is not uncommon that the compiled form contains something for which there is no source for (e.g., not all of the jre is available in source form), and very common that source is distributed that does not appear in the compiled form (test code, examples, and so on). We just hope that the differences aren't so great as to invalidate any studies, and are working towards providing good documentation as to what the differences are so that people can judge for themselves how valid their results are.
Distribute binary forms as a set of jar files
The binary form of systems included in the corpus must be bundled as .jar files, that is, not .war, .ear, etc, and not unbundled .class files. This is solely due to the expectations of our tools for managing the corpus and doing analysis using the corpus. This criteria has already been relaxed in that a system not meeting this will not be rejected out of hand, but it will still lower the chance of being included. However it is expected this will eventually go away.
Available to anyone independent of the corpus
This criteria is intended to avoid ephemeral systems that crop up from time to time, or systems that are only known to us that cannot be acquired by other researchers. This allows the possibility of others to independently check the decisions we have made.
This is the hardest one to meet, as we can not be sure when development will stop on some system. Some systems we used (and analysed) before the first external release of the corpus have suffered this fate, and so are not in the corpus. In fact we already have the situation where the version of a system we have in the corpus is now apparently no longer available, as the developers only appear to keep (or make available at least) the most recent versions. Due to the first criteria, we have chosen to keep these, even though they do not meet this criteria.
Most recent version are final
We want versions in the corpus that are somehow representative of their development. Beta versions, release candidates, and similar "transitional" versions may not be representative is some way, so we have chosen to only include releases considered "final" for the system.
For systems where there are multiple versions, it might be useful to have non-final releases for doing research on software evolution. To accommodate this, we include non-final versions, but only up to the most recent final release. This will mean that anyone doing studies on the "r" distribution of the corpus will only be dealing with final releases.
If there is some doubt as to whether or not it is official, then we include it (and document any uncertainty under version notes).
Because this criteria is new as of 2010, not all existing entries in the corpus meet this criteria, but they have been kept.
Identifiable contents
It is not always easy to determine what the contents of a system are. If there is uncertainty regarding the contents of a system, we do not include it.
For example, the binary form of netbeans has 400+ jar files. Trying to determine what is relevant and what is not has proven to be a challenge, which delayed its inclusion until fairly recently. Other systems (e.g., openoffice) still have not been included for this reason.

These criteria were developed to simplify some aspects of the management of the corpus. Eventually hope some of them will be relaxed (e.g. choice of language and distribution in jar files).