korpus.cz - Newsletter 12/2020

Prosinec 2020 |

							SYN2020: nový reprezentativní korpus

Po pěti letech ČNK vydává další stomilionový reprezentativní korpus současné psané (tištěné) češtiny SYN2020, který se tak stává nástupcem dosavadního korpusu SYN2015. SYN2020 odráží jazyk z let 2015–2019 a svým složením plně odpovídá SYN2015. Významně se ale u SYN2020 změnila anotace, která doznala těchto konkrétních změn:

vylepšená tokenizace;
dvouúrovňová lemmatizace: zatímco lemma sdružuje více variant jednoho slova, sublemma umožňuje pracovat s nimi také odděleně;
aktualizovaná sada morfologických značek (tagset) s dodatečným atributem verbtag;
vícenásobná lemmatizace a značkování tzv. agregátů, tedy slov, která se píšou dohromady, ale chovají se jako slova dvě (např. kdyby, vidělas, nač);
výrazně vyšší úspěšnost syntaktické anotace.

Podrobnosti najdete na stránce s popisem korpusu SYN2020.

							KonText 0.15

Na konci roku 2020 byla spuštěna také nová verze hlavního rozhraní pro práci s korpusy KonText. Vedle podstatných interních změn došlo k rozšíření funkcionality uživatelského rozhraní zejména o tyto funkce:

zjednodušení typů dotazu z původních šesti na pouhé dva, jednoduchý a pokročilý; zatímco pokročilý dotaz odpovídá původnímu "CQL" dotazu, jednoduchý dotaz spolu s vhodným nastavením parametrů nahrazuje všechny ostatní;
specializovaný doplněk nabízející další možné varianty zadávaného (sub)lemmatu (zatím jen pro SYN2020).

							Webové korpusy ONLINE

S hrdostí oznamujeme zveřejnění monitorovacích korpusů nazvaných ONLINE a mapujících český web, tj. internetovou žurnalistiku, diskuse a sociální sítě, a to od roku 2017 do současnosti. Korpusy vznikají ve spolupráci se společností Dataweps, mají více než 6 mld. tokenů a jsou pravidelně každý den aktualizovány!

							Nové verze mluvených korpusů

Zveřejňujeme i nové verze dvou mluvených korpusů: ORTOFON v2 a ORATOR v2. Kromě dvojnásobné velikosti (2 mil., resp. 1 mil. slov) obsahují mnohá drobná zlepšení v konzistentnosti transkripce a v anotaci.

							QuitaUp: stylometrická aplikace

Letošní třetí novou webovou aplikací je QuitaUp, která vznikla ve spolupráci s R. Čechem a M. Kubátem z FF OU. QuitaUp slouží k analýze textů, konkrétně k výpočtu základních kvantitativních stylometrických indexů (např. slovní bohatství, tematická koncentrace) z textů zadaných uživatelem. Díky využití UDPipe umí QuitaUp analyzovat texty ve 20 jazycích.

							PF 2021

Celý tým ČNK přeje do Nového roku hodně štěstí a spokojenosti!

December 2020 |

							SYN2020: a new representative corpus

The CNC is releasing SYN2020, a new 100-million-word representative corpus. SYN2020 supersedes SYN2015 as the CNC’s flagship corpus of synchronic written (printed) Czech. Compared to SYN2015, the design of SYN2020 has not changed, which means that the composition of SYN2020 follows that of SYN2015. In addition to new texts from the 2015–2019 period, there are significant advances in the annotation:

improved tokenization;
two-level lemmatization that keeps spelling and other variants under a single lemma while treating them separately on the sublemma level;
updated morphological tagset with an additional verbtag attribute;
multivalue lemmatization and tagging of aggregates, i.e. words spelled as a single word that exhibit features of more than one word (e.g. kdyby, vidělas, nač);
syntactic annotation with significantly improved accuracy.

For details, please refer to the main SYN2020 web page.

							KonText 0.15

Recently, we launched a new version of KonText, our main corpus query interface. Besides significant internal code enhancements, the functionality of the user interface has been extended to include the following features:

reduction of the six query types to only two: simple and advanced; while the advanced query corresponds to the former "CQL", all other query types are now superseded by the simple query with its settings;
a special add-on for suggesting other possible variants of the input (sub)lemma (SYN2020 only).

							ONLINE: corpora of the Czech web

We are proud to announce that we have published the large monitor ONLINE corpora that map the Czech web, i.e. internet news, discussions and social networks from 2017 to the present day. The ONLINE corpora are compiled in cooperation with the Dataweps company, have more than six billion tokens and feature regular daily updates!

							New versions of spoken corpora

We are also releasing new versions of two spoken corpora: ORTOFON v2 and ORATOR v2. Apart from having doubled in size (to 2M and 1M words, respectively), they feature many small improvements in the consistency of the transcription and in their annotation.

							QuitaUp: stylometric web app

The third new web application released this year is QuitaUp, which was created in collaboration with R. Čech and M. Kubát (FA OU). QuitaUp is used to calculate quantitative stylometric indices (e.g. vocabulary richness, thematic concentration) of texts uploaded by the user. By leveraging UDPipe, QuitaUp can analyze texts in 20 languages.

							Season’s greetings

The entire CNC team wishes you all the best in the New Year 2021!

Ústav Českého národního korpusu, Filozofická fakulta Univerzity Karlovy
www.korpus.cz | ucnk@korpus.cz | +420 221 619 837

SYN2020: nový reprezentativní korpus

KonText 0.15

Webové korpusy ONLINE

Nové verze mluvených korpusů

QuitaUp: stylometrická aplikace

PF 2021

SYN2020: a new representative corpus

KonText 0.15

ONLINE: corpora of the Czech web

New versions of spoken corpora

QuitaUp: stylometric web app

Season’s greetings