Qualitas Corpus Domain Model
In order for any corpus to be useful, it must be representative. If it
only contains particular kinds of things, or the things come from a limited
source, or similar restrictions on its contents, then there is the
possibility of bias that will impact the validity of any conclusions drawn
from its use (specifically, threats to external validity). Ideally, the
corpus should contain a representative sample of its population, but in
reality this is impractical. This is acknowledged in fields such as
computational linguistics, which makes heavy use of corpora of language
use. Hunston observes that "The real question as regards representativeness
is how the balance of a corpus should be taken into account when
interpreting data from that corpus." [Hun2002] That
is the philosophy of the Qualitas Corpus.
To support understanding the balance, or representativeness, of the
Qualitas Corpus, there needs to be some way to characterise its
representativeness. This is the goal of the Qualitas Corpus Domain Model. It
provides a set of categories and all entries in the corpus are classified
into one of the categories. Hopefuly looking at what is in each category
gives a sense of what is "in" the corpus.
The domain model described below is just a start. There are some issues
with it:
- It only contains categories in which there is at least one system. This
means any "gaps" (sensible categories for which there are no entries in
the corpus) are hard to see.
- The categories are somewhat subjective in nature. There is on operational
means to classify a given system.
- Some categories are perhaps a bit too broad, or not very coherent.
- It's not clear that it makes sense to use a single dimension of
categorisation.
But it is a start. As the corpus develops, hopefully some of these
issues will be resolved.
- 3D/graphics/media
-
Systems that provide some sort of media support, in particular graphics.
This should be compared with diagram/visualisation.
- IDE
-
Provides a tool that supports code development (particularly the
edit/execute cycle).
- SDK
-
Provides the base libraries for programming in a particular language.
- database
-
Provides some sort of database management.
- diagram/visualisation
-
Provides diagrams or visual presentation of some sort of data.
- games
-
Is either a game, or provides support for game development.
- middleware
-
Provides support for typically middleware services, such as transactions,
persistence, concurrency.
- parsers/generators/make
-
Provides support for creating parsers or building systems.
- programming language
-
Provides a new programming language.
testing
Provides support for automated support.
tool
Everything else.
References
- [Hun2002]
-
Susan Hunston 'Corpora in Applied Linguistics' Cambridge University
Press 2002.