With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was created in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre.
The corpus consists of 18 million pages downloaded from the .hu domain, thus representing common written language fairly extensively. Texts that were present multiple times and files which contained no useable text were filtered out. We stratified the remainder in four sections according to the proportion of words in a page that were accepted by a spellchecker.