News

13 February 2023
The YouTube channel of the HGC has been launched, with educational videos, in Hungarian: https://www.youtube.com/@magyarnemzetiszovegtar
9 March 2020
A detailed description of the morphosyntactic codes is available (in Hungarian).
8 June 2018 v2.0.5
Convenience services and improvements.
  1. In previous versions, certain xml tags appeared in the text. These are now encoded as structural information, their display can be switched on and off.
  2. Previously, some texts outside all subcopora were present, so number of hits in subcorpora did not sum up to number of all hits in some cases. In the current version, all texts are assigned to a subcorpus.
  3. There is a lot of duplicate in social media – because of the nature of the material. If duplicate-free text is important for a certain investigation, social media can be omitted. To make this easier, we split the személyes (personal) subcorpus into two parts: személyes-közösségi which contains the social media data, and személyes-fórum which contains other texts.
  4. The punctuation marks are separate tokens in the corpus, so they appear in the concordance separated by spaces. Partial solution was introduced to glue together the punctuation marks with the adjacent word as usual in written texts to increase the legibility of the concordances. This is achieved by turning on the <g> (glue) structure.
There was no change in the text of the corpus compared to v2.0.4.
Versions v2.0.2v2.0.4 are still available for reproducibility of previous researches.
18 October 2017
Data on sizes of subcopora are available (in Hungarian).
29 August 2016 v2.0.4
Size of the corpus is 1.04 billion running words (1.348 billion tokens).
19 February 2016 v2.0.3
Size of the corpus is 785 million running words (978 million tokens).
Whole corpus has been reanalysed.
New annotation: mboundary field in ana attribute which contains the morpheme boundaries of words in this form: dolgoz+ó+i.
New attributes: word_syll – number of syllables in the wordform, lemma_syll – number of syllables in the lemma.
26 September 2014 v2.0.2
Size of the corpus is 587 million running words (732 million tokens).
13 September 2014 v2.0.1
HGC is opened. It contains the material of the 187-million-word HNC with new language analysis and on a new interface.