The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. - anand
This paper describes a technique for making personalized recommendations from any type of database to a user based on similarities between the interest prole of that user and those of other users. - anand
Collaborative filters help people make choices based on the opinions of other people. GroupLens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. News reader clients display predicted scores and make it easy for users to rate articles after they read them. Rating servers, called Better Bit Bureaus, gather and disseminate the ratings. The rating servers predict scores based on the heuristic that people who agreed in the past will probably agree again. Users can protect their privacy by entering ratings under pseudonyms, without reducing the effectiveness of the score prediction. The entire architecture is open: alternative software for news clients and Better Bit Bureaus can be developed independently and can interoperate with the components we have developed. - anand
This short paper reports on work in progress related to applying data partitioning/clustering algorithms to ratings data in collaborative filtering. - anand
Human language technology experts, Franz Josef Och and Mike Cohen discuss their exciting research in machine translation and speech technology with Alfred Spector. - anand
We present a tutorial introduction to n-gram models for language modeling and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of several of these smoothing techniques - anand
This paper describes one possible way to solve task “Who rated what?” of the KDD CUP 2007. The proposed solution is a history-based model that predicts whether a user will vote a given movie. Key points to our approach are (1) the estimation of the model baseline, (2) the definition of the explanatory variables and (3) the mathematical model form. - anand
The Oxford English Corpus gives us the fullest, most accurate picture of the language today. It represents all types of English, from literary novels and specialist journals to everyday newspapers and magazines and from Hansard to the language of chatrooms, emails, and weblogs. And, as English is a global language, used by an estimated one third of the world's population, the Oxford English Corpus contains language from all parts of the world - not only from the UK and the United States but also from Australia, the Caribbean, Canada, India, Singapore, and South Africa. It is the largest English corpus of its type: the most representative slice of the English language available. - anand
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development. - anand
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written - anand
This year's tasks employ the Netflix Prize training data set. This data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received by Netflix during this period. The ratings are on a scale from 1 to 5 (integral) stars. (See below for details on downloading this data set.) - anand
Researchers at Carnegie Mellon University's Robotics Institute are working with colleagues at Caterpillar Inc. to develop autonomous versions of large haul trucks used in mining operations. - anand
When you go to college for linguistics, they teach you all kinds of stuff about grammar, about writing down in a formal way the rules that describe how a given language works. It all seems very scientific and complete, until you try to make a completely logical system, a computer, use it to try to act in some way like an actual speaker of the language you’ve been describing. I’ve done this two or three times now — in phonology, in my earlier academic incarnation, I’ve tried to emulate Turkish and Malagasy using a formal computational model of the phonological paradigm called Optimality Theory, and in syntax, at Cognition, the rest of the NLP team and I have been working on a parser for English. In every case, you discover two things: actual languages are very messy and complicated, and computers are smarter than you are. - anand
This year's competition is about classifying internet user search queries. The task was specifically designed to draw participation from industry, academia, and students. - anand
An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. - anand
This paper describes RESOLVE, a system that uses decision trees to learn how to classify coreferent phrases in the domain of business joint ventures. An experiment is presented in which the performance of RESOLVE is compared to the performance of a manually engineered set of rules for the same task. The results show that decision trees achieve higher performance than the rules in two of three evaluation metrics developed for the coreference task. In addition to achieving better performance than the rules, RESOLVE provides a framework that facilitates the exploration of the types of knowledge that are useful for solving the coreference problem. - anand
A tokeniser is a piece of software that splits a text into its component elements. These are typically individual words, but also punctuation marks and other symbols which are not normally considered to be words. The collective term for these elements that make up a text is tokens. So, the tokeniser takes as input a text, and splits it into its tokens. This is usually done by inserting separator, either blank spaces or linebreaks, so that subsequent programs (like a parts-of-speech tagger) can easily read in the tokens and process them further. - anand
QTAG is a probabilistic parts-of-speech tagger. That means it's a program that reads text and for each token in the text returns the part-of-speech (eg noun, verb, punctuation, etc). It works using statistical methods, hence the `probabilistic'. As a result it does make mistakes (as does every POS tagger), but it is fairly robust and (from informal evaluation) tags texts with good accuracy. - anand
We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classification problem more challenging. - anand
We identify and validate from a large corpus constraints from conjunctions on the positive or negative semantic orientation of the conjoined adjectives. A log-linear regression model uses these constraints to predict whether conjoined adjectives are of same or different orientations, achieving 82% accuracy in this task when each conjunction is considered independently. Combining the constraints across many adjectives, a clustering algorithm separates the adjectives into groups of different orientations, and finally, adjectives are labeled positive or negative. Evaluations on real data and simulation experiments indicate high levels of performance: classification precision is more than 90% for adjectives that occur in a modest number of conjunctions in the corpus. - anand
The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., "honest", "intrepid") and a negative semantic orientation implies undesirability (e.g., "disturbing", "superfluous"). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. - anand
This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not rec-ommended (thumbs down). The classifi-cation of a review is predicted by the average semantic orientation of the phrases in the review that contain adjec-tives or adverbs. - anand