Prague Dependency Treebank

About this list Date view Thread view Subject view Author view

Daniel Zeman (zeman@ufal.ms.mff.cuni.cz)
Thu, 15 Oct 1998 17:48:14 +0000


The Institute of Formal and Applied Linguistics (UFAL) at the Charles University, Prague, proudly announces that the first version of the PRAGUE DEPENDENCY TREEBANK has been made available to the research community. The Prague Dependency Treebank (PDT) is a morphologically and syntactically annotated corpus of Czech as a representative of inflectionally rich free-word-order languages. (E.g., all the Slavic languages such as Russian, Polish, Serbo-Croatian and many others spoken together by more than 350 million people have similar typological properties as Czech in both morphology and syntax.) The current version of PDT (0.5) contains 456705 tokens (words+punctuation) in 26610 sentences and 576 files. For keeping results of NLP applications comparable the data has been divided into a training set (19126 sentences), a development test set (3697 sentences) and a (cross-)evaluation test data set (3787 sentences). The Prague Dependency Treebank is - to a certain extent - modelled after the Penn Treebank but it uses the dependency syntax representation of sentences. It has three layers: 1.morphological (uses word forms, tags, lemmas) 2.analytical, or surface syntax (uses dependencies and analytical functions of dependencies) 3.tectogrammatical, which captures linguistic meaning (contains tectogrammatical functions such as Actor, Patient, Addressee, etc.) The Prague Dependency Treebank is a long-term project which should end in the year 2000. At the moment (October 1998) we have at our disposal roughly half the material (at levels 1 and 2) while the level 3 is still in the specification phase and rules of transition between the representations on level 2 and level 3 are being formulated. The current version is thus preliminary and identified as "PDT version 0.5" (reflecting mostly the amount of material currently available). The text material contains samples from the following sources: 1.Lidove noviny (daily newspapers), 1991, 1994, 1995 2.Mlada fronta Dnes (daily newspapers), 1992 3.Ceskomoravsky Profit (business weekly), 1994 4.Vesmir (scientific magazine), Academia Publishers, 1992, 1993 The electronic source has been provided by the Institute of the Czech National Corpus, in a format jointly developed by the ICNK and UFAL. The Treebank has been supported by the following grants and projects: Grant Agency of the Czech Republic No. 405/96/0198 (Treebank Definition and Procedures Specification) Grant Agency of the Czech Republic No. 405/96/K214 (Tools and Level 1 Annotation) Ministry of Education of the Czech Republic Project No. VS96151 (Tools and Structural Annotation on the Level 2) National Science Foundation grant No. #IIS-9732388 (Version 0.5 Preparation for the Workshop 98) The documentation of PDT is linked from its main page at UFAL. Go to the UFAL home page, http://ufal.ms.mff.cuni.cz/, then click on "Projects" and "Treebank". The PDT Version 0.5 is freely available for research purposes providing you fill in and submit a licence agreement. The appropriate form is also linked from the PDT web page. -- Daniel Zeman, UFAL MFF UK, Praha zeman@ufal.mff.cuni.cz http://www.ms.mff.cuni.cz/~zeman/


About this list Date view Thread view Subject view Author view

This archive was generated by hypermail 2.0b3 on Fri Dec 18 1998 - 20:38:22 PST