Italian Dictionary for Full-Text Search

Author: Daniele Varrazzo
Contact: piro (at) develer.com
Organization: Develer S.r.l.
Date: 2008-03-10
Version: 1.1
Copyright: 2001, 2002 Gianluca Turconi
Copyright: 2002, 2003, 2004 Gianluca Turconi and Davide Prina
Copyright: 2004, 2005, 2006 Davide Prina
Copyright: 2007, 2008 Daniele Varrazzo

Abstract

This package provides a dictionary and the other files required to perform full text search in Italian documents using the PostgreSQL database.

Using the provided dictionary, search operations in Italian documents can keep into account morphological variations of Italian words, such as verb conjugations.

[ Versione italiana ]

Contents

  • Spelling Dictionary Informations
    • Presentation at PGDay
  • Download and installation
    • PostgreSQL 8.3
    • PostgreSQL 8.2 and older versions
  • License
  • Acknowledgments

Spelling Dictionary Informations

This vocabulary has been generated from the MySpell OpenOffice.org vocabulary, provided by the progetto linguistico.

The dictionary had to undergo an huge amount of transformations, and is now quite unrecognizable from the original. Above all, all the verbal forms, including irregular verbs, are now reduced to the infinite form. Furthermore, for each verb, the construction with pronominal and reflexive particles are recognized on gerund, present and past participle, imperative and infinite.

Great care has also been taken in reducing the different forms of adjectives (male and female, singular and plural, superlatives) to a single normal form, and to unify different forms of male and female (es. ricercatore and ricercatrice: male and female form of "researcher").

Furthermore, in the original dictionary, many unrelated male and female nouns were joined together as they were an adjective (es. caso/casi + casa/case, with the unrelated meanings of "case(s)" and "house(s)"). Such false friends have been mostly split apart to avoid false positives in search results, but some of them may still lie around in the dictionary (this is a kind of error that no Python script can help fixing...).

Some statistics about the current dictionary edition:

  • 66,929 distinct roots,
  • 7,300 completely conjugated verbs
  • 1,943,826 distinct recognized terms
  • 62 flags in the affix file
  • 10,365 production rules in the affix file.

Presentation at PGDay

The dictionary was presented at PGDay 2007, the first Italian PostgreSQL conference. The slideshow is available for download.

Download and installation

PostgreSQL 8.3

  • italian-fts-1.1.tar.gz

This package doesn't include a stemming dictionary, which is already included in the PostgreSQL installation. The package can be used with database clusters in any encoding.

Please refer to the README.italian_fts file for installation details.

PostgreSQL 8.2 and older versions

The package is available in two encodings:

  • UTF8 encoding
  • LATIN1 encoding

Please install only the version matching your cluster locale (use psql -tc SHOW LC_CTYPE postgres to know your cluster locale).

Please refer to the README.italian_fts_utf8 or README.italian_fts_latin1 file for installation details.

License

The Italian Dictionary for Full-Text Search is distributed under GPL license.

Acknowledgments

I wish to thank Davide Prina and Gianluca Turconi, because without their progetto linguistico i wouldn't have had anything to work upon.

I also hearty thank Oleg Bartunov and Teodor Sigaev, the Tsearch2 authors.

And many thanks to Develer, one of the finest hackers assembly in Italy!