When I was asked to do a workshop for the recent whyR mini conference running alongside Career Zoo at Thomond Park, I had to give some thought as to what to present on. I wanted to demo something that showed the power of R while at the same time being easy to use and something that might be at least somewhat interesting and fun for participants!
Eventually I decided to demo the stylo package. This is a really nice package with some great functions for performing stylometric analyses in R. Unlike many R packages you can run it using a GUI or from the command line so it is easy to run analyses with it, but it’s a powerful package at the same time.
Not many people are too familiar with stylometry but it is an active and quite interesting area of research. Essentially it’s the statistical study of an authors style, often used for authorship attribution. Stylometry rests on the assumption that there is a statistically quantifiable set of features consistent across works by the same author but different in works by different authors, a sort of literary fingerprint. Research seems to support this assumption even if there isn’t universal agreement on the best features or method to use: see this paper by Juola (2015) for discussion and a proposed protocol .
The default features used in stylo are most frequent words/character ngrams, although the package allows users to include their own features in models if they wish. A variety of supervised and unsupervised machine learning methods for authorship attribution are supported. The stylo function includes options to perform clustering (hierarchical cluster analysis and consensus trees) and dimensionality reduction (multidimensional scaling, principal components analysis and t distributed stochastic neighbour embedding). The main classifier function (classify) supports classification using delta, k nearest neighbours, support vector machines, naive Bayes and nearest shrunken centroids.
One can also export results of analyses in stylo for use in other programs. For example using stylo.network function with the consensus tree method will generate spreadsheets of nodes and edges defining a stylometric network that you can then import into a network analysis tool such as Gephi, something I demo’d in the workshop.
Input: stylo accepts plain text, xml and html files in a variety of languages. I used plain text UTF-8 files downloaded from www.gutenberg.org for the workshop. There are a couple of options with regard to tokenising documents in English. You can choose to treat compound words and contracted words as single words or not. You can also choose whether you want to preserve upper case letters and whether you want to remove pronouns.
Features: As already indicated the default features in stylo are most frequent words. One can set the number of features to be any value, 100 provided reasonably good results for me when training a classifier. There is the option to cull words, for example setting culling to 20 will remove any of the most frequent words that don’t appear in at least 20% of documents in the corpus.
Statistics: Once the list of most frequent words across the whole corpus have been calculated, stylo calculates the relative frequency of each word on the list for each document. A distance measure can then be used to calculate the distances in vector space between documents. Stylo offers standard distance measures such as Euclidean, Manhattan and so on but also includes delta measures developed specifically for authorship attribution work. It also allows analysts to specify their own measures. If interested in learning more about distance measures for authorship attribution check out this great paper by Evert, Proisl, Jannidis, Reger, Pielstrom, Schoch and Vitt (2017). As outlined already, stylo supports various supervised and unsupervised learning methods and also allows you to export word frequency tables and lists for use with other algorithms if you wish. Stylo also supports sampling, so for example if you have documents that are very different in length you may wish to use sampling so that similar size portions of text are compared.
Output: Output options include the type of file you want to write graphs to, and options for colours, size and font on graphs. You also get some options for how particular graph types should be displayed, depending on the statistical analyses you are running. In addition to graphs, stylo will also write some files to your working directory including list of most frequent words used in the analysis and their relative frequencies per document, these can be imported into other packages or programs for doing other analyses.
I chose Shakespeare for this workshop, partly because works by him and his contemporaries are easily accessible on Project Gutenberg in a format suitable for analysis, but also because there is a not inconsiderable body of scholarly work that suggests that the historical Shakespeare did not in fact write some or all of the works attributed to him. There are various reasons other than stylistic similarity for this, among which are that Shakespeare does not appear to have been particularly well educated or travelled, has not left any surviving letters and does not mention any of his works in his will.
One of the persons proposed as an alternative author of Shakespeares works is Christopher Marlowe. While it’s generally acknowledged that he would have influenced Shakespeare, some have gone a step further and suggested that he is the author of works attributed to Shakespeare. This is despite the fact that Marlowe is supposed to have died in a drunken knife fight in a tavern in 1593, years before most of Shakespeares works were written.
In the workshop, I did a cluster analysis on the 7 plays known to have been written by Marlowe and 15 of Shakespeare’s best known plays, using 100 most frequent words as features, culling value of 20% and cosine delta as the distance measure. Hierarchical clustering produced the dendrogram below.
It’s apparent from the diagram above that the clustering algorithm finds differences in style between Shakespeare’s plays and Marlowe’s plays indicating they were not written by the same person. Certainly the style differs between the two sets of work, not a single work of Shakespeares is in the same cluster as a work by Marlowe.
The above dendrogram shows the results of single analysis based on the 100 most frequent words. A more robust version of this analysis combines the results of multiple cluster analyses into a single consensus tree. Running the bootstrap consensus tree analysis in stylo for values of most frequent words from 100-500 in increments of 100, with 20% culling, cosine delta as the distance measure and consensus of 0.5 produces the consensus tree below.
Clustering the works in terms of stylistic similarity using the bootstrap consensus method has resulted in all but 1 of Marlowe’s plays clustering together, indicating again that there seems to be stylistic differences between Marlowes and Shakespeares work.
How about analysing a work that we know Shakespeare collaborated on? For this analysis I used the play Henry VIII with the rolling.classify function in stylo. It’s thought that Henry VIII was the result of a collaboration between Shakespeare and a playwright called John Fletcher. The rolling.classify allows us to check whether parts of a text are more similar to a particular author based on a reference sample of their works. In this case the reference sample was made up of the 15 Shakespeare works used in the analysis above, and 11 of Fletcher’s plays. Since Fletcher seems to have collaborated a lot, I only included plays that he is thought to be the sole author of in the reference set. When using rolling.classify the text being classified is broken into overlapping segments or slices and each one compared to the reference set to see which authors style the segments are most similar to.
For this analysis the analyst specifies the size of the slices and the slice overlap window, so for example I set a slice size of 1000 words and slice overlap of 500 words. This means the first segment of Henry VIII analysed would be words 1-1000, the second would be words 501-1500, the third would be 1001-2000 and so on. I used support vector machines to classify the segments against the reference set giving the results below.
So the classifier found that parts of Henry VIII are stylistically more similar to Shakespeare and parts stylistically more similar to Fletcher. The parts similar to Fletcher are in red and the parts by Shakespeare in green, with the thickness of the line indicating how confident the classifier is that this particular portion of the text was written by the indicated author. So results of this analysis are consistent with the hypothesis that Henry VIII was a collaborative work.
One downside to the rolling.classify function I found, at least with these plays, is that the results seem sensitive to the size of the slice and slice overlap chosen. The exact position of the segments thought to be written by Fletcher changes somewhat depending on the slice size chosen. Also because the play is relatively short (<10000 words) and mostly written by Shakespeare, if you choose a large slice size the classifier will indicate the whole play was written by him. It would be interesting to see if there is any academic work indicating how to choose the optimal slice size and slice overlap for analysing the style of collaborative works such as this.
There’s lot more one can do with this package and I demo’d some other functions at the workshop. One update that would be nice would be for some other classifiers to be included in the package for example neural networks but this is definitely an interesting and useful package to learn.