korpus.cz - Newsletter 12/2017

Prosinec 2017 |

							Nové mluvené korpusy

Začátkem června 2017 byla zveřejněna trojice mluvených korpusů: zcela nové korpusy ORTOFON a nářeční DIALEKT s dvouúrovňovou transkripcí včetně propojení se zvukem, a dále korpus ORAL v1 sjednocující všechny korpusy řady ORAL. Všechny tyto korpusy jsou lemmatizovány a morfologicky označkovány. Podrobné informace o jednotlivých korpusech, jejich složení, anotaci a způsobu práce s nimi najdete na naší wiki (viz odkazy výše).

							SYN verze 6

Před několika dny byl zveřejněn korpus SYN verze 6. Svým zpracováním, strukturou, anotací i klasifikací textů plně odpovídá korpusu SYN verze 5, nově však přibyla publicistika s rokem vydání 2016 o objemu téměř 200 mil. slov. Celkový rozsah korpusu tak přesáhl 4 miliardy slov (4,8 mld. pozic včetně interpunkce).

							Rozhraní KonText 0.11

V prosinci 2017 byla zveřejněna také nová verze hlavního rozhraní pro práci s korpusy KonText. Vedle některých interních změn došlo k rozšíření funkcionality uživatelského rozhraní zejména o tyto funkce:

2-rozměrnou frekvenční distribuci s intervalovými odhady skutečných hodnot;
možnost vracet zpět jednotlivé kroky (funkce 'undo') v interaktivním výběru textů a morfologických kategorií;
vylepšenou historii dotazů s možností pojmenování a uložení jednotlivých položek.

Souhrn podstatných změn je k dispozici na samostatné stránce věnované historii verzí KonTextu.

							InterCorp verze 10

Paralelní korpus InterCorp je dostupný ve verzi 10. Pokrývá v nestejném rozsahu celkem 39 jazyků a stejně jako dosud je jeho obsah tvořen beletristickým ručně zkontrolovaným jádrem a automaticky zarovnanými kolekcemi. Celkový rozsah cizojazyčné části dosáhl 1,48 mld. slov, seznam rozdílů oproti předchozí verzi najdete na samostatné stránce.

							FicTree

Nedávno byl zveřejněn také syntakticky anotovaný korpus současné české beletrie FicTree. Korpus obsahuje 135 000 slov, jeho lemmatizace, morfologická a syntaktická anotace byly provedeny manuálně.

							Konference SlaviCorp 2018

ČNK zve všechny svoje uživatele na konferenci SlaviCorp 2018, která se bude konat 24.–26. září 2018 v Praze. Konference bude tematicky zaměřena na korpusový výzkum slovanských jazyků (včetně výzkumu kontrastivního), na vytváření jazykových zdrojů pro tyto jazyky a na jejich aplikace v jazykových technologiích.

Konferenci bude předcházet workshop na téma jazykové variability a multidimenzionální analýzy, na němž přednese plenární přednášku prof. Douglas Biber. Workshop bude všem účastníkům konference volně přístupný.

December 2017 |

							New spoken corpora

In June 2017, three new spoken corpora were released: brand new ORTOFON and dialectal DIALEKT corpus, both with a two-level transcription linked to the audio, as well as ORAL v1, the unification of ORAL-series corpora. All these corpora feature lemmatisation and morphological tagging. Detailed information about the corpora, their composition, annotation and how to work with them can be found in the web documentation (see the links above).

							SYN release 6

A few days ago, SYN release 6 was published. SYN release 6 fully corrresponds to SYN release 5 in terms of the text processing, structure, annotation, and text classification. In addition, it also includes a large amount of journalistic material from 2016 of a total size almost 200 mil. words. The total size of the corpus thus exceeded 4 billion running words (4.8 bil. tokens including punctuation).

							KonText 0.11

In December 2017, a new version of KonText, our main corpus interface, was launched. Apart from some internal code enhancements, the functionality of the user interface has been extended to include the following features:

2-dimensional frequency distribution with confidence intervals;
support for 'undo' in the interactive text selection and morphological tag builder;
improved query history where individual items can be archived under a custom name.

A comprehensive KonText version history is available on a separate page.

							InterCorp release 10

Release 10 of the InterCorp parallel corpus has been made available online. It includes 39 languages with varying amounts of textual data. InterCorp contains both a manually checked fiction core and several automatically aligned collections. The total size of the non-Czech part of InterCorp release 10 has reached 1.48 bil. running words; the version history can be found on a separate page.

							FicTree

The FicTree treebank, a syntactically annotated corpus of Czech fiction, has also been released recently. It consists of 135,000 words with manually performed lemmatization, morphological and syntactic annotation.

							SlaviCorp 2018 conference

We are pleased to invite all the CNC users to the SlaviCorp 2018 conference to be held in Prague on 24–26 September 2018. The conference will be focused on corpus research on any Slavic language (including contrastive topics), development of Slavic language resources and their use for language technologies.

The conference will be accompanied by the workshop on language variability and multi-dimensional analysis which will include a plenary talk delivered by Prof. Douglas Biber. The workshop will be free and open to all conference participants.

Ústav Českého národního korpusu, Filozofická fakulta Univerzity Karlovy
www.korpus.cz | ucnk@korpus.cz | +420 221 619 837

Nové mluvené korpusy

SYN verze 6

Rozhraní KonText 0.11

InterCorp verze 10

FicTree

Konference SlaviCorp 2018

New spoken corpora

SYN release 6

KonText 0.11

InterCorp release 10

FicTree

SlaviCorp 2018 conference