Using R for content analysis of short documents.

Recently I was trying to extract structure from a large corpus of documents. Nearly all the documents were short, many were just notes of one or two lines in length. Regular approaches to clustering do not work so well here. Nonetheless after doing some research I found a suitable method that I was able to apply on the data using the statistical programming language R.

Clustering short documents presents a challenge. The low word count in each document means that approaches based on a vector space model suffer from the effects of high dimensionality and data sparsity in the feature vectors. Other approaches such as Latent Dirichlet Allocation (LDA) also tend to fare badly when the text documents being analysed are short. LDA models word-document co-occurrence so when documents have few words sparsity is a problem here too. An example of this is given in the analysis of tweets about the Rugby World Cup that I did a couple of years ago.

After doing some research I found a paper by Yan, Guo, Lan & Cheng (2013) that outlines a topic modelling approach suitable for use with short texts called Biterm Topic Modelling (BTM). BTM models word-word co-occurrences (biterms) across the whole corpus. These biterms are unordered word pairs that co-occur together within a short segment of text. Gibbs sampling algorithm is used to estimate the model. Because topics are learned using the corpus level biterm distribution (rather than document level word occurrence as in LDA), BTM can overcome the problem of sparse data in individual documents.

R is seen to lag behind Python somewhat when it comes to natural language processing functionality, though on the CRAN task view for NLP I did find the package BTM. This packages acts as a wrapper in R to run the BTM (C++) library created by an author of the original paper on BTM linked to above. This package worked quite well with my data and I was able to extract an interpretable topic model using it.

One aspect of topic modelling that can be a bit tricky is choosing the appropriate number of topics for the model. One can run various models for different values of k and compare the results for interpretability. In the issues list on the BTM package site on github there is a suggestion to incorporate measures of topic quality as a guide for analysts using BTM. While these have not been implemented in the package yet, there is code provided for functions to calculate an exclusivity measure and a semantic coherence measure for a biterm topic model. One can use these as a guide to help with choosing the most suitable topic number. This is definitely a useful package for anybody using R to analyse the content of collections of documents where individual document length is short.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.