Qualitas Corpus Domain Model

In order for any corpus to be useful, it must be representative. If it only contains particular kinds of things, or the things come from a limited source, or similar restrictions on its contents, then there is the possibility of bias that will impact the validity of any conclusions drawn from its use (specifically, threats to external validity). Ideally, the corpus should contain a representative sample of its population, but in reality this is impractical. This is acknowledged in fields such as computational linguistics, which makes heavy use of corpora of language use. Hunston observes that "The real question as regards representativeness is how the balance of a corpus should be taken into account when interpreting data from that corpus." [Hun2002] That is the philosophy of the Qualitas Corpus.

To support understanding the balance, or representativeness, of the Qualitas Corpus, there needs to be some way to characterise its representativeness. This is the goal of the Qualitas Corpus Domain Model. It provides a set of categories and all entries in the corpus are classified into one of the categories. Hopefuly looking at what is in each category gives a sense of what is "in" the corpus.

The domain model described below is just a start. There are some issues with it:

But it is a start. As the corpus develops, hopefully some of these issues will be resolved.
Systems that provide some sort of media support, in particular graphics. This should be compared with diagram/visualisation.
Provides a tool that supports code development (particularly the edit/execute cycle).
Provides the base libraries for programming in a particular language.
Provides some sort of database management.
Provides diagrams or visual presentation of some sort of data.
Is either a game, or provides support for game development.
Provides support for typically middleware services, such as transactions, persistence, concurrency.
Provides support for creating parsers or building systems.
programming language
Provides a new programming language.
Provides support for automated support.
Everything else.


Susan Hunston 'Corpora in Applied Linguistics' Cambridge University Press 2002.