Reviews and Natural Language Processing: Clustering

[The first of a multi-part series on Natural Language Processing, this blog post discusses clustering based on document similarity. Future posts will present sentiment analysis and statistically significant terms.]

"The Vacation Rental market is large but fragmented, with hundreds of thousands of suppliers (homeowners). It has few brands and no commonly accepted service levels or rating standards."

Douglas Quinby, PhoCusWright

This quote initiated a Natural Language investigation into the HomeAway Review corpus: do the Traveler reviews (of properties) adhere to some set of standards? Reviews contain text and a "star" rating; does the text align with the rating? Analyzing its various corpora with Natural Language Processing tools allows HomeAway to better listen to - and better serve - its customers.

Step One: Get the Reviews

The first order of business: get a corpus of reviews from somewhere, with text and associated star rating. A quick note to the DBA group and set of a few thousand reviews showed up in the email inbox, ready for analysis.

Step Two: Fun with Clusters

An individual review contains language that could contain common words and concepts within the corpus - multiple Travelers could write similar things about different properties. For example, "great place for families" or "the saltwater pool has been awesome". TF*IDF ("term frequency-inverse document frequency") distance metrics can form the basis of a review clustering, to determine if like-rated reviews contain similar text.

Alias-i.com provides a Natural Language Processing toolkit called "LingPipe" which performs many functions needed to analyze the reviews, including clustering. A LingPipe Developer license married to the Protovis JavaScript library for visualization allowed for an easy means to tease clusters out of reviews. The following listing presents the main driver.

[codesyntax lang="java"]

//randomize the list of all reviews, but use a known seed to //recreate the study if needed Collections.shuffle(allReviews, new Random(12)); int sampleSize = 1000; Set inputSet = new HashSet( sampleSize); //just a subset for (int i = 0; i < sampleSize; i++) { inputSet.add(allReviews.get(i)); } //prepare for magic Clusterer cl = new Clusterer(inputSet, tokenizerFactory()); //here be magic -> cluster the reviews in the input set based //on the their distance from each other (defined in the //tokenizer Tree clusters = cl.buildTree(); //just for display outputJavaScript(clusters, "reviews.js");

[/codesyntax]

The tokenizerFactory method creates the rules by which to extract tokens out of each distinct review. It looks like this (APIs are LingPipe):

[codesyntax lang="java"]

public static TokenizerFactory tokenizerFactory() { TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE; factory = new WhitespaceNormTokenizerFactory(factory); factory = new LowerCaseTokenizerFactory(factory); factory = new EnglishStopTokenizerFactory(factory); return factory; }

[/codesyntax]

A Tokenizer provides streams of tokens, whitespace and positions to some consumer. A TokenizerFactory constructs the Tokenizers. Here, the software configures the factory to build a tokenizer that (a) handles alphanumerics and other common constructs in Indo-European languages, (b) converts sequences of whitespace into a single space, (c) lower cases everything, and (d) removes common English stop words. All this ensures that the token streams analyzed by the Clusterer conform to some common rules.

The Clusterer's constructor builds a mechanism to calculate some distance between two reviews - in this case, a TF*IDF distance. To calculate a simplified distance between two reviews, first model the reviews as vectors in the "term vector space", an n-dimensional space consisting of many of the terms in the set of all reviews. Figure 1 shows two such vectors in a "Cats-Dogs" space.

The blue vector presents a review that mentions "cats" $latex c_1$ times and "dogs" $latex d_1$ times. The red vector represents a different review. The dot-product of the vectors divided by the product of their magnitudes computes the distance between the vectors - i.e. the cosine of the angle between them:

$latex \displaystyle \cos(\Theta) = \frac{\mathbf{v_1} \bullet\mathbf{v_2}}{|\mathbf{v_1}| \times |\mathbf{v_2}|}$

In practice, the vectors exist in an n-dimensional space (where n approaches the number of terms in the entire corpus), making the above calculation a very expensive operation.

A TF*IDF distance weights the values assigned to the coordinates by considering the frequency of a term in the corpus, and how many different documents (e.g. reviews) in which the term appears.

This listing drills down into a LingPipe API that actually creates the clusters:

[codesyntax lang="java"]

HierarchicalClusterer clusterer = new CompleteLinkClusterer(distance); Dendrogram dendrogram = clusterer .hierarchicalCluster(inputSet); Set> clusters = dendrogram .partitionDistance(partitionDistance);

// convert the Set<Set> into a Tree data structure

[/codesyntax]

The "outputJavaScript" method transforms the Tree into a Protovis file for display.

Each dot represents an individual review, colored by the original star-rating from red (1 star) to dark green (5 stars). The reviews are clustered according to their relative distance.

Step Three: Analyze the Clusters

The "7 o'clock" cluster shows many reviews of various ratings. These reviews all contain similar "canned" text defined by the UI: "This property was clean", "The property was not clean", "The owner provided the keys", "We had no problem accessing the property". These Reviews contains similar enough text that they cluster together and semantically serve to demonstrate effective clustering. The "7:30" cluster contains reviews of similar characteristics, but the review solicitation UI did not allow for a star rating so these reviews contain a neutral rating.

Starting at "8 o'clock", the clusters contain "free-form" reviews and demonstrate that similarly rated reviews contain similar text. For example, two reviews taken from the "9 o'clock" position contain the following text:

We Just returned from a great week at your wonderful villa. The location by the pool is really convenient and the short walk to the beach. The villa is very comfortable (especially the beds) and we always have a very relaxing time. We have stayed here numerous times and are never disappointed. Fiddler's Cove is great for families.

and

We have recently stayed here for 2 weeks and have really enjoyed our time. The house is brand new and immaculate, and the saltwater pool has been awesome for our 2 children (ages 1 and 4). The beach is a very short walk from the house- easy with a double stroller and all of our gear. The house is set up really well to accommodate kids or adults and has an outdoor shower, shady areas to hang out in and sunny spots to lay out, as well. The owners live nearby and were very helpful with the house and suggestions for things to do. The location is great for walking to get coffee, going for a run, taking a beach walk or just hanging out with little kiddos or adults. Plus, it's near the best beach areas on the island. We highly recommend it!

Both of these reviews discuss "pools" and emphasize families, but don't contain very many terms in common. The Clusterer placed these two reviews together because it found them similar enough to a third (or fourth, or fifth...) review that it also placed in the same cluster. Nifty! Figure 2 shows that reviews tend to clump together, indicating that some "natural" review language - an implied standard - exists. (Once the clusters move beyond "12 o'clock", the number of reviews per cluster trails off - this indicates that many reviews tend to use text that cannot be clustered using these techniques)

At about 8:30 and 9:30, two clusters contain many similarly rated reviews (negative and positive) along with one or two reviews of opposite polarity (e.g. a green dot surrounded by red and orange dots). Occasionally, a review's rating does not match rating of other reviews using similar text, which echoes the quote that started this investigation. How frequently does that occur in the reviews corpus?

Next step ("Step Four", for those keeping score): Sentiment Analysis, covered in a future post. Stay tuned.