LexIt profiles were automatically extracted from corpora using computational linguistics methods and tools. In a first phase, the corpora were annotated with incremental levels of linguistic analysis: lemmatization, part-of-speech tagging, and syntactic dependency parsing. Automatic annotation was performed with TANL, a pipeline of stochastic tools for Italian Natural Language Processing (NLP). In a second phase, the distributional profiles were extracted from annotated corpora with Perl scripts. Distributional profile extraction is described in this article: Alessandro Lenci (in press), " Carving Verb Classes from Corpora" in Raffaele Simone e Francesca Masini (a cura di) Word Classes Amsterdam - Philadelphia: John Benjamins.
The errors are due to the fact that linguistic analysis and distributional profile extraction were performed automatically, without any kind of manual review. Most of the errors in LexIt are indeed due to mistakes during lemmatization, part-of-speech tagging, or syntactic parsing. Last generation NLP tools - such as those used to annotate the LexIt corpora - are based on statistical techniques and machine learning algorithms, which allowed for significant improvements in the accuracy of linguistic analysis. Despite this, current tools are still far from being error-free, especially at the syntactic level. For example, even identifying the subject and the object of a verb in Italian is a very difficult task, in which automatic analyzers make many mistakes. Some errors were filtered with statistical analysis, others have inevitably remained. When using resources such as LexIt, we must therefore be aware of the limits of the state of the art in computational linguistics, and we must be willing to tolerate a certain amount of "noise" in the data. However, we believe that the LexIt distributional profiles constitute a valuable resource for the study of Italian, and that that their utility can compensate the errors that will be found.
Currently LexIt contains distributional profiles of Italian verbs and nouns. The extraction of distributional profiles of adjectives is ongoing.
No, this distinction is not represented in LexIt, like in many computational resources of this kind. There are two main reasons for this choice. The first is purely practical and depends on the current limits of parsing tools, which do not distinguish between arguments and adjuncts. The second lies in the inherent difficulty of finding reliable criteria for this distinction, which is still an open and widely debated issue in theoretical linguistics. In this perspective, the LexIt distributional profiles represent a starting point rather than an end point for the study of various aspects of argument structure, including the argument vs. adjunct opposition.
LexIt was entirely self-financed. It is the result of the voluntary work of members of the Laboratory of Computational Linguistics at the Department of Linguistics of the University of Pisa, including undergraduate and graduate students.
Any contribution is welcome. For suggestions and proposals for collaboration, please contact the director of the LexIt project, Alessandro Lenci
LexIt is an open project and is in constant evolution. Here are some current lines of development: