Search and Explore a 500 MW reference corpus
One of the biggest challenges facing language researchers is the lack of access to high quality data. In an effort to solve this problem for the Dutch language community, the SoNaR corpus was constructed. SoNaR contains over 500 million words of modern Dutch texts across a variety of genres and sources. All texts are enriched with linguistic information, including lemmas, parts of speech, and named entities (i.e. proper nouns).
In addition to this, the OpenSoNaR project received funding to further disclose SoNaR to its intended audience. To this end, we developed the WhiteLab online search and exploration interface. WhiteLab includes four distinct search screens designed to suit the needs of both first-time users as well as those who are more familiar with the data.
The exploration interface reveals corpus statistics, including frequency lists, n-grams and vocabulary growth charts. Both interfaces include extensive filters on the corpus metadata. This allows researchers to zoom in on just the data that is relevant to their research.