This blog post provides the details of how to load the data weRead More
I was recently asked about running sentiment analysis over various forms of customer feedback. How can we help with scaling this over large sets of data? This is an area of analytics I haven’t done much work in before so a bit of research was in order – bring on the Amazon Review Data.
I set myself the task of answering the following question:
Like many data science projects digging into this data for the first time opened up many areas that could be followed up in future work but in this blog I am going to concentrate on my initial analysis of the 130M+ reviews using Tableau Desktop.
Utilizing Kognitio available on AWS Marketplace, we used a python package called textblob to run sentiment analysis over the full set of 130M+ reviews. There was no need to code our own algorithm just write a simple wrapper for the package to pass data from Kognitio and results back from Python. Kognitio automatically scales processing based on available compute resource. We simply connect Tableau to Kognitio to explore the data. Interested in more technical details? Then check out Mark Marsh’s excellent technical blog.
Looking for patterns in the sentiment metrics (produced with textblob) by star rating there appears to be strong correlations.
This suggests that the Amazon star rating is a good indicator of the sentiment but the data only includes reviews up to August 2015. Behavior might have changed in the last few years so lets look at trends.
In the line chart above I looked at the change in behavior over time. From 2000 (the first year with 1M+ reviews) the average polarity for each star rating appears to be quite stable until 2011. From here the average polarity values start to diverge. This suggests reviewers are becoming clearer in expressing their opinions.
Differences in subjectivity for lower star ratings are not really prevalent before 2004. After this 1 star ratings start to become less subjective. This drops further in 2013 when the average subjectivity shown all diverge significantly. Are we really more likely to “own” positive feedback than we were in previous years? And vice-versa with negative feedback. This requires deeper research into changes in language used in reviews. I am going to park subjectivity here, purely on the basis of blog length. I may re-visit it in a future blog though.
Its certainly easier than ever to post our opinions. Perhaps this increasing clarity reflects this? It occurred to me after my analysis that it would be interesting to look at the length of reviews and how this has changed too. My gut feel is it is easier for the algorithm to derive sentiment from shorter reviews. Are reviews getting shorter and more succinct? What did I say about initial analysis opening up more questions and areas of interest?
Another possible cause could be an explosion in fake reviews on Amazon. where the reviewer obviously sets out to be clear in their sentiment. There has certainly been quite a lot written about this lately but I’m not convinced this is driving the divergent polarity seen above. I would have thought fake reviews would be concentrated in 1 and 5 star ratings but the other star ratings are diverging too.
Looking at the Top 10 categories by number of reviews posted we can see slight differences in behavior by category. Books and Music reviews are generally positive. Is it more difficult for the algorithm to isolate the sentiment? It does seem likely as these reviews may use more florid language. Mobile App reviews are generally more negative regardless of the star rating when compared to other categories. This category is quite new as is Digital Ebook Purchases both of which show stronger negative polarity.
In short: no. Bringing Year into the Pages Pane on Tableau allows us to play through the number of reviews and polarity by year – see video below. On the right we can see the explosion and diversification of reviews as Amazon grows. However, there is a clear diverging of polarity in all categories. By 2014 all categories, barring Music, have a negative polarity in their 1 star rated reviews and the 5 star reviews were also becoming more positive.
Obviously for me as a learning exercise – yes. I now have the framework in place for running sentiment analysis at scale. (See Mark Marsh’s blog for technical details) and I can swap in my preferred sentiment code and corpus as required.
My client has seen that sentiment analysis can be scaled easily (on-demand) using Kognitio on AWS and readily available python packages. It is possible for them to put the analysis in the hands of end-users. They can use their preferred BI tool without exposure to the underlying algorithms or code base.
Well – sort of. The star ratings are a good indicator of the review text sentiment. On the face of it reading the content is not necessary. However, if this were a client project there is definitely further work required:
Finally, there are also few areas of this sentiment analysis I would like to explore more because I think they may have applications for future client projects:
Kognitio is great for supporting Tableau data discovery and dashboard development as it is specifically designed for running complex SQL and analytics over large data sets. This means you don’t have to sample or extract data prior to discovery – the super-fast response times from Kognitio mean you can follow your train of thought directly from Tableau.
In this analysis I didn’t have to go back into the database at all I simply built out parameters, filters and metrics in Tableau as I went.
Note Kognitio does support access from any JDBC/ODBC connection so if you prefer a different tool you can connect it directly to Kognitio