Faq

How did we extract the LexIt distributional profiles?

LexIt profiles were automatically extracted from corpora using computational linguistics methods and tools. In a first phase, the corpora were annotated with incremental levels of linguistic analysis: lemmatization, part-of-speech tagging, and syntactic dependency parsing. Automatic annotation was performed with TANL, a pipeline of stochastic tools for Italian Natural Language Processing (NLP). In a second phase, the distributional profiles were extracted from annotated corpora with Perl scripts. Distributional profile extraction is described in this article: Alessandro Lenci (in press), " Carving Verb Classes from Corpora" in Raffaele Simone e Francesca Masini (a cura di) Word Classes Amsterdam - Philadelphia: John Benjamins.

From which corpora did we extract the LexIt distributional profiles?

Currently, in LexIt you can explore the distributional profiles extracted from two corpora:

La Repubblica (ca. 331 million tokens)
Wikipedia.it (ca. 152 million tokens)

The LexIt distributional profiles contain errors. Why?

The errors are due to the fact that linguistic analysis and distributional profile extraction were performed automatically, without any kind of manual review. Most of the errors in LexIt are indeed due to mistakes during lemmatization, part-of-speech tagging, or syntactic parsing. Last generation NLP tools - such as those used to annotate the LexIt corpora - are based on statistical techniques and machine learning algorithms, which allowed for significant improvements in the accuracy of linguistic analysis. Despite this, current tools are still far from being error-free, especially at the syntactic level. For example, even identifying the subject and the object of a verb in Italian is a very difficult task, in which automatic analyzers make many mistakes. Some errors were filtered with statistical analysis, others have inevitably remained. When using resources such as LexIt, we must therefore be aware of the limits of the state of the art in computational linguistics, and we must be willing to tolerate a certain amount of "noise" in the data. However, we believe that the LexIt distributional profiles constitute a valuable resource for the study of Italian, and that that their utility can compensate the errors that will be found.

What types of lexical items are profiled in LexIt?

Currently LexIt contains distributional profiles of Italian verbs and nouns. The extraction of distributional profiles of adjectives is ongoing.

Do LexIt syntactic frames distinguish arguments from adjuncts?

No, this distinction is not represented in LexIt, like in many computational resources of this kind. There are two main reasons for this choice. The first is purely practical and depends on the current limits of parsing tools, which do not distinguish between arguments and adjuncts. The second lies in the inherent difficulty of finding reliable criteria for this distinction, which is still an open and widely debated issue in theoretical linguistics. In this perspective, the LexIt distributional profiles represent a starting point rather than an end point for the study of various aspects of argument structure, including the argument vs. adjunct opposition.

How was LexIt founded?

LexIt was entirely self-financed. It is the result of the voluntary work of members of the Laboratory of Computational Linguistics at the Department of Linguistics of the University of Pisa, including undergraduate and graduate students.

Can I contribute to the development of LexIt?

Any contribution is welcome. For suggestions and proposals for collaboration, please contact the director of the LexIt project, Alessandro Lenci

What are the development plans for LexIt?

LexIt is an open project and is in constant evolution. Here are some current lines of development:

extract distributional profiles of adjectives (ongoing)
add new corpora, especially domain corpora
improve the procedures for the extraction of syntactic frames and semantic representation of selection preferences
broaden the spectrum of information represented in the distributional profiles