Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

The Vulgar Tongue: Green's History of Slang

By Jonathon Green

A comprehensive history of slang in the English speaking world by its leading lexicographer.


New from Cambridge University Press!

ad

The Universal Structure of Categories: Towards a Formal Typology

By Martina Wiltschko

This book presents a new theory of grammatical categories - the Universal Spine Hypothesis - and reinforces generative notions of Universal Grammar while accommodating insights from linguistic typology.


New from Brill!

ad

Brill's MyBook Program

Do you have access to Dynamics of Morphological Productivity through your library? Then you can by the paperback for only €25 or $25! Find out more about Brill's MyBook program!


Academic Paper


Title: An information-theoretic, vector-space-model approach to cross-language information retrieval
Author: Peter A. Chew
Email: click here to access email
Homepage: http://www.dissertation.com/library/1121784a.htm
Institution: Sandia National Laboratories
Author: Brett W. Bader
Institution: Sandia National Laboratories
Author: Stephen Helmreich
Institution: New Mexico State University
Author: Ahmed Abdelali
Institution: New Mexico State University
Author: Stephen J. Verzi
Institution: Sandia National Laboratories
Linguistic Field: Computational Linguistics
Abstract: In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ???standard??? approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.

CUP at LINGUIST

This article appears in Natural Language Engineering Vol. 17, Issue 1, which you can read on Cambridge's site or on LINGUIST .



Back
Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page