.
This is because both syntactic and semantic structure are commonly represented compositionally as a
. The term
Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with
or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a
assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.
Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB) and those that annotate dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank).
It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation):
This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.
perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers,
, semantic analyzers and machine translation systems.
Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.
In corpus linguistics, treebanks are used to study syntactic phenomena (for example, diachronic corpora can be used to study the time course of syntactic change). Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.
Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.
In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.
A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the
, developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is
, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in
Many syntactic treebanks have been developed for a wide variety of languages:
|
|
|
Abaza language | Universal Dependencies, ATB | Dependency | |
Afrikaans | Universal Dependencies, AfriBooms | Dependency | |
Akkadian | Universal Dependencies, PISANDUB | Dependency | |
Albanian | Universal Dependencies, TSA | Dependency | |
Amharic language | Universal Dependencies, ATT | Dependency | |
Ancient Greek | Universal Dependencies, Perseus | Dependency | |
Ancient Greek | Universal Dependencies, PROIEL | Dependency | |
Ancient Greek |
/ref>[Mambrini, F. 2016. The Ancient Greek Dependency Treebank: Linguistic Annotation in a Teaching Environment. In: Bodard, G & Romanello, M (eds.) Digital Classics Outside the Echo-Chamber: Teaching, Knowledge Exchange & Public Engagement, Pp. 83–99. London: Ubiquity Press. ] | Dependency | |
Ancient Greek | PROIEL Treebank[Dag Haug. 2015. Treebanks in historical linguistic research. In Carlotta Viti (ed.), Perspectives on Historical Syntax, Benjamins, 188-202. A preprint is available at http://folk.uio.no/daghaug/historical-treebanks.pdf.] | Dependency | |
Arabic language | Columbia Arabic Treebank (CATiB) | Dependency | |
Arabic language | Prague Arabic Dependency Treebank (PADT) | Dependency | |
Arabic language | Universal Dependencies, NYUAD | Dependency | |
Arabic language | Universal Dependencies, PADT | Dependency | |
Arabic language | Universal Dependencies, PUD | Dependency | |
Arabic language | Penn Arabic Treebank | Phrase structure | |
Armenian | Universal Dependencies, ArmTDP | Dependency | |
Assyrian (Neo-Aramaic) | Universal Dependencies, AS | Dependency | |
Bambara language | Universal Dependencies, CRB | Dependency | |
Basque language | Universal Dependencies, BDT | Dependency | |
Belarusian | Universal Dependencies, HSE | Dependency | |
Bhojpuri | Universal Dependencies, BhEn | Dependency | |
Bhojpuri | Universal Dependencies, BHTB | Dependency | |
Breton language | Universal Dependencies, KEB | Dependency | |
Bulgarian | Universal Dependencies, BTB | Dependency | |
Bulgarian | BulTreeBank | HPSG | |
Buryat language | Universal Dependencies, BDT | Dependency | |
Cantonese | Universal Dependencies, HK | Dependency | |
Catalan language | Cat3LB | Phrase structure | |
Catalan language | Universal Dependencies, AnCora | Dependency | |
Chinese language | Sinica Treebank | Case grammar | |
Chinese language | Universal Dependencies, CFL | Dependency | |
Chinese language | Universal Dependencies, GSD | Dependency | |
Chinese language | Universal Dependencies, GSDSimp | Dependency | |
Chinese language | Universal Dependencies, HK | Dependency | |
Chinese language | Universal Dependencies, PUD | Dependency | |
Chinese language | Penn Chinese Treebank | Phrase structure | |
Chinese language | Chinese Dependency Treebank | Dependency | |
Classical Arabic | Quranic Arabic Dependency Treebank (QADT) (Quranic Arabic Corpus) | Dependency | |
Classical Armenian | PROIEL Treebank | Dependency | |
Coptic language | Universal Dependencies, Coptic Scriptorium | Dependency | |
Croatian | Croatian Dependency Treebank | Dependency | |
Croatian | Universal Dependencies, SET | Dependency | |
Czech language | Prague Dependency Treebank | Dependency | |
Czech language | Universal Dependencies, CAC | Dependency | |
Czech language | Universal Dependencies, CLTT | Dependency | |
Czech language | Universal Dependencies, FicTree | Dependency | |
Czech language | Universal Dependencies, PDT | Dependency | |
Czech language | Universal Dependencies, PUD | Dependency | |
Danish language | Danish Dependency Treebank | Dependency | |
Danish language | Arboretum: A syntactic tree corpus of Danish | Phrase structure | |
Danish language | Universal Dependencies, DDT | Dependency | |
Danish language | Universal Dependencies, DTB | Dependency | |
Dutch language | Spoken Dutch Corpus (CGN) | Phrase structure | |
Dutch language | Universal Dependencies, Alpino | Dependency | |
Dutch language | Universal Dependencies, LassySmall | Dependency | |
Dutch language | LASSY Small and Large | Dependency | |
Dutch language | Alpino Treebank | Dependency | |
Egyptian | Universal Dependencies, UJaen | Dependency | |
English language | CCGbank | Combinatory categorial grammar | |
English language | LinGO Redwoods | HPSG | |
English language | Lancaster Parsed Corpus | Phrase structure | |
English language | Prague English Dependency Treebank | Dependency | |
English language | Universal Dependencies, BhEn | Dependency | |
English language | Universal Dependencies, ESL | Dependency | |
English language | Universal Dependencies, EWT | Dependency | |
English language | Universal Dependencies, GUM | Dependency | |
English language | Universal Dependencies, GUMReddit | Dependency | |
English language | Universal Dependencies, LinES | Dependency | |
English language | Universal Dependencies, ParTUT | Dependency | |
English language | Universal Dependencies, Pronouns | Dependency | |
English language | Universal Dependencies, PUD | Dependency | |
English language | Treebank Semantics Parsed Corpus | Phrase structure | |
English language | Christine Corpus | Phrase structure | |
English language | Lucy Corpus | Phrase structure | |
English language | Susanne Corpus | Phrase structure | |
English language | BLLIP WSJ corpus | Phrase structure | |
English language | Tübingen Treebank of English / Spontaneous Speech (TüBa-E/S) | HPSG | |
English language | Diachronic Corpus of Present-Day Spoken English (DCPSE) | Phrase structure | |
English language | British Component of the International Corpus of English (ICE-GB) | Phrase structure | |
English language | The PARC 700 Dependency Bank | Dependency | |
English language | Yahoo Query Treebank | Dependency | |
English language | Penn Treebank | Phrase structure | |
English language | Multi-Treebank | Phrase structure | |
English language | CHILDES Brown Eve corpus with dependency annotation | Dependency | |
English language | SMULTRON - Parallel Treebank EN-DE-SV | Phrase structure | |
Erzya language | Universal Dependencies, JR | Dependency | |
Estonian | Arborest | Phrase structure | |
Estonian | Syntactically analyzed and disambiguated text corpus | Dependency | |
Estonian | Universal Dependencies, EDT | Dependency | |
Estonian | Universal Dependencies, EWT | Dependency | |
Faroese language | Universal Dependencies, FarPaHC | Dependency | |
Faroese language | Universal Dependencies, OFT | Dependency | |
Finnish language | Turku Dependency Treebank (TDT) | Dependency | |
Finnish language | Universal Dependencies, FTB | Dependency | |
Finnish language | Universal Dependencies, PUD | Dependency | |
Finnish language | Universal Dependencies, TDT | Dependency | |
French language | Rhapsodie | Dependency and macrosyntactic annotation | |
French language | L'Arboratoire | Phrase structure | |
French language | Universal Dependencies, CrapBank | Dependency | |
French language | Universal Dependencies, FQB | Dependency | |
French language | Universal Dependencies, FTB | Dependency | |
French language | Universal Dependencies, GSD | Dependency | |
French language | Universal Dependencies, ParTUT | Dependency | |
French language | Universal Dependencies, PUD | Dependency | |
French language | Universal Dependencies, Sequoia | Dependency | |
French language | Universal Dependencies, Spoken | Dependency | |
French language | French Treebank | Phrase structure | |
French language | Free French Treebank | Phrase structure | |
French language | Sequoia Treebank | Phrase structure & Dependency | |
Galician | Universal Dependencies, CTG | Dependency | |
Galician | Universal Dependencies, TreeGal | Dependency | |
German language | Hamburg Dependency Treebank (HDT) | Dependency | |
German language | Universal Dependencies, GSD | Dependency | |
German language | Universal Dependencies, LIT | Dependency | |
German language | Universal Dependencies, PUD | Dependency | |
German language | SMULTRON - Parallel Treebank EN-DE-SV | Phrase structure | |
German language | NEGRA | Phrase structure | |
German language | TIGER | Phrase structure | |
German language | Tübingen Treebank of German / Spontaneous Speech (TüBa-D/S) | Phrase structure | |
German language | Tübingen Treebank of Written German (TüBa-D/Z) | Phrase structure | |
German language | Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) | Phrase structure | |
Gothic language | PROIEL Treebank | Dependency | |
Gothic language | Universal Dependencies, PROIEL | Dependency | |
Greek language | Greek Dependency Treebank | Dependency | |
Greek language | Universal Dependencies, GDT | Dependency | |
Hebrew language | Universal Dependencies, HTB | Dependency | |
Hebrew language | Hebrew Dependency Treebank | Dependency | |
Hindi English | Universal Dependencies, HIENCS | Dependency | |
Hindi language | Universal Dependencies, HDTB | Dependency | |
Hindi language | Universal Dependencies, PUD | Dependency | |
Hindi | AnnCorra | Dependency | |
English (historical) | Penn Parsed Corpora of Historical English; | Phrase structure | |
English (historical) | York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) | Phrase structure | |
French (historical) | Corpus MCVF | Phrase structure | |
Portuguese (historical) | Tycho Brahe corpus | Phrase structure | |
Hungarian | Universal Dependencies, Szeged | Dependency | |
Hungarian | Hungarian Treebank | Phrase structure | |
Icelandic | IcePaHC - Icelandic Parsed Historical Corpus | Phrase structure | |
Icelandic | Universal Dependencies, IcePaHC | Dependency | |
Icelandic | Universal Dependencies, PUD | Dependency | |
Indonesian | Universal Dependencies, GSD | Dependency | |
Indonesian | Universal Dependencies, PUD | Dependency | |
Indonesian | ICON | Phrase structure | |
Irish language | Universal Dependencies, IDT | Dependency | |
Italian language | ISST - Italian Syntactic-Semantic Treebank | Phrase structure and dependency | |
Italian language | MIDT (Merged Italian Dependency Treebank) resulting from the merging and harmonization of the TUT and ISST-CoNLL/TANL treebanks | dependency | |
Italian language | VIT - Venice Italian Treebank | Phrase structure and dependency | |
Italian language | Universal Dependencies, ISDT | Dependency | |
Italian language | Universal Dependencies, ParTUT | Dependency | |
Italian language | Universal Dependencies, PoSTWITA | Dependency | |
Italian language | Universal Dependencies, PUD | Dependency | |
Italian language | Universal Dependencies, TWITTIRO | Dependency | |
Italian language | Universal Dependencies, VIT | Dependency | |
Italian language | Italian Syntactic-Semantic Treebank for the CoNLL-2007 Shared Task (ISST-CoNLL) | dependency | |
Italian language | SUT - Siena University Treebank | | |
Italian language | TUT - Turin University Treebank | Dependency | |
Italian language | ISDT (Italian Stanford Dependency Treebank) | dependency | |
Japanese | Kyoto Text Corpus | | |
Japanese | Universal Dependencies, BCCWJ | Dependency | |
Japanese | Universal Dependencies, GSD | Dependency | |
Japanese | Universal Dependencies, KTC | Dependency | |
Japanese | Universal Dependencies, Modern | Dependency | |
Japanese | Universal Dependencies, PUD | Dependency | |
Japanese | Keyaki Treebank | Phrase structure | |
Japanese | Tübingen Treebank of Japanese / Spontaneous Speech (TüBa-J/S) | Phrase structure | |
Japanese | ATR Dependency corpus | Dependency | |
Karelian | Universal Dependencies, KKPP | Dependency | |
Kazakh language | Universal Dependencies, KTB | Dependency | |
Komi Permyak | Universal Dependencies, UH | Dependency | |
Komi Zyrian | Universal Dependencies, IKDP | Dependency | |
Komi Zyrian | Universal Dependencies, Lattice | Dependency | |
Korean language | Universal Dependencies, GSD | Dependency | |
Korean language | Universal Dependencies, Kaist | Dependency | |
Korean language | Universal Dependencies, Penn | Dependency | |
Korean language | Universal Dependencies, PUD | Dependency | |
Korean language | Universal Dependencies, Sejong | Dependency | |
Korean language | Korean Treebank | Phrase structure | |
Kurmanji | Universal Dependencies, MG | Dependency | |
Latin language | Universal Dependencies, ITTB | Dependency | |
Latin language | Universal Dependencies, LLCT | Dependency | |
Latin language | Universal Dependencies, Perseus | Dependency | |
Latin language | Universal Dependencies, PROIEL | Dependency | |
Latin | Index Thomisticus Treebank | Dependency | |
Latin | PROIEL Treebank | Dependency | |
Latin | Latin Dependency Treebank[Bamman David & al. 2008. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3). http://nlp.perseus.tufts.edu/syntax/treebank/1.3/docs/guidelines.pdf] | Dependency | |
Latvian language | Universal Dependencies, LVTB | Dependency | |
Lithuanian | Universal Dependencies, ALKSNIS | Dependency | |
Lithuanian | Universal Dependencies, HSE | Dependency | |
Livvi language | Universal Dependencies, KKPP | Dependency | |
Magahi language | Universal Dependencies, MGTB | Dependency | |
Maltese language | Universal Dependencies, MUDT | Dependency | |
Marathi language | Universal Dependencies, UFAL | Dependency | |
Mbya Guarani | Universal Dependencies, Dooley | Dependency | |
Mbya Guarani | Universal Dependencies, Thomas | Dependency | |
Middle Irish | Universal Dependencies, CritMITB | Dependency | |
Middle Irish | Universal Dependencies, DipMITB | Dependency | |
Moksha language | Universal Dependencies, JR | Dependency | |
Naija language | Universal Dependencies, NSC | Dependency | |
North Sami | Universal Dependencies, Giella | Dependency | |
Norwegian | INESS treebanking infrastructure | LFG | |
Norwegian | Universal Dependencies, Bokmaal | Dependency | |
Norwegian | Universal Dependencies, Nynorsk | Dependency | |
Norwegian | Universal Dependencies, NynorskLIA | Dependency | |
Old Church Slavonic | Universal Dependencies, PROIEL | Dependency | |
Old Church Slavonic | TOROT Treebank | Dependency | |
Old French | Universal Dependencies, SRCMF | Dependency | |
Old Russian | Universal Dependencies, RNC | Dependency | |
Old Russian | Universal Dependencies, TOROT | Dependency | |
Old Russian | TOROT Treebank | Dependency | |
Persian language | Persian Dependency Treebank (PerDT) | Dependency | |
Persian language | PerTreeBank | HPSG | |
Persian language | Universal Dependencies, Seraji | Dependency | |
Polish language | A Treebank / Test Suite for Polish | HPSG | |
Polish language | Universal Dependencies, LFG | Dependency | |
Polish language | Universal Dependencies, PDB | Dependency | |
Polish language | Universal Dependencies, PUD | Dependency | |
Polish language | Składnica | Phrase structure and Dependency | |
Portuguese | Universal Dependencies, Bosque | Dependency | |
Portuguese | Universal Dependencies, GSD | Dependency | |
Portuguese | Universal Dependencies, PUD | Dependency | |
Portuguese | Projecto Floresta Sintá(c)tica | Dependency, Phrase structure | |
Romanian | Romanian Dependency Treebank | Dependency | |
Romanian | Universal Dependencies, Nonstandard | Dependency | |
Romanian | Universal Dependencies, RRT | Dependency | |
Romanian | Universal Dependencies, SiMoNERo | Dependency | |
Russian language | Universal Dependencies, GSD | Dependency | |
Russian language | Universal Dependencies, PUD | Dependency | |
Russian language | Universal Dependencies, SynTagRus | Dependency | |
Russian language | Universal Dependencies, Taiga | Dependency | |
Russian language | SynTagRus Dependency Treebank (Russian National Corpus) | Dependency | |
Sanskrit | Universal Dependencies, UFAL | Dependency | |
Sanskrit | Universal Dependencies, Vedic | Dependency | |
Scottish Gaelic | Universal Dependencies, ARCOSG | Dependency | |
Serbian language | Universal Dependencies, SET | Dependency | |
Sindhi language | Universal Dependencies, MazharDootio | Dependency | |
Skolt Sami | Universal Dependencies, Giellagas | Dependency | |
Slovak language | Universal Dependencies, SNK | Dependency | |
Slovene language | Slovene Dependency Treebank | Dependency | |
Slovenian | Universal Dependencies, SSJ | Dependency | |
Slovenian | Universal Dependencies, SST | Dependency | |
Spanish language | Cast3LB | Phrase structure and dependency | |
Spanish language | Universal Dependencies, AnCora | Dependency | |
Spanish language | Universal Dependencies, GSD | Dependency | |
Spanish language | Universal Dependencies, PUD | Dependency | |
Spanish language | UAM Treebank of Spanish | Phrase structure | |
Swedish language | Talbanken05 | Phrase structure and dependency | |
Swedish language | Swedish Treebank | Phrase structure | |
Swedish language | Universal Dependencies, LinES | Dependency | |
Swedish language | Universal Dependencies, PUD | Dependency | |
Swedish language | Universal Dependencies, Talbanken | Dependency | |
Swedish language | SMULTRON - Parallel Treebank EN-DE-SV | Phrase structure | |
Swedish Sign Language | Universal Dependencies, SSLC | Dependency | |
Swiss German | Universal Dependencies, UZH | Dependency | |
Tagalog language | Universal Dependencies, TRG | Dependency | |
Tagalog language | Universal Dependencies, Ugnayan | Dependency | |
Tamil language | Universal Dependencies, TTB | Dependency | |
Telugu language | Universal Dependencies, MTG | Dependency | |
Thai language | NAiST Thai Treebank | Dependency | |
Thai language | Universal Dependencies, PUD | Dependency | |
Thai language | THTB | Phrase structure | |
Turkish language | METU-Sabanci Turkish Treebank | Dependency | |
Turkish language | Universal Dependencies, BOUN | Dependency | |
Turkish language | Universal Dependencies, GB | Dependency | |
Turkish language | Universal Dependencies, IMST | Dependency | |
Turkish language | Universal Dependencies, PUD | Dependency | |
Ukrainian | Institute for Ukrainian, NGO Gold Standard | Dependency | |
Ukrainian | Universal Dependencies, IU | Dependency | |
Upper Sorbian | Universal Dependencies, UFAL | Dependency | |
Urdu language | NU-FAST Treebank | Phrase structure | |
Urdu language | The URDU.KON-TB Treebank | Phrase and Hyper Dependency Structure | |
Urdu language | Universal Dependencies, UDTB | Dependency | |
Uyghur language | Universal Dependencies, UDT | Dependency | |
Vietnamese | Universal Dependencies, VTB | Dependency | |
Vietnamese | Vietnamese Treebank | Phrase structure | |
Vietnamese | Vietnamese Dependency Treebank | Dependency | |
Warlpiri | Universal Dependencies, UFAL | Dependency | |
Welsh language | Universal Dependencies, CCG | Dependency | |
Wolof language | Universal Dependencies, WTB | Dependency | |
Yoruba language | Universal Dependencies, YTB | Dependency | |
To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance,
The universal annotation approach for dependency treebanks; and the universal annotation approach for phrase structure treebanks.
One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis (2008) discusses the principles of searching treebanks in detail and reviews the state of the art around that time.