Sentiment Analysis of Amazon Review Data using Kognitio on AWS

This blog post provides the details of how to load the data we created for Sharon Kirkham’s article that describes how she analysed the sentiment in the publicly available Amazon review data set. Mark Chopping has also explored the data from a more general perspective in his LinkedIn post.

For the sentiment analysis we’ll be using the TextBlob python library which provides an easy to use sentiment analysis based on the “bag of words” approach.

To launch a Kognitio on AWS cluster for this exercise, refer to the documentation. We suggest you use an r4.16xlarge EC2 instance for the cluster but any combination of nodes that provides a minimum of 350GB of memory and 32 CPUs will be adequate. Using spot instances will save some money at the (small) risk of losing the instance.

Once the cluster is running we can reconfigure it for higher external script performance by clicking the cog icon on the clusters management page and adding “fixed_pool_size=30” to the “[boot options]” section. Once you’ve done that, click apply and then “Apply Changes” in the dialog box.

You will now need to restart the server; select the “Server / Restart Server” menu option, untick “With image recovery” in the dialog box and click restart.

While the server is restarting we can install the TextBlob python library. You will need to do this for every node in the cluster (you can get the list of nodes from the nodes menu on the cluster management page).

ssh ec2-user@node-ip-address
sudo pip3 install textblob

Once this has been done and the server has restarted, Connect to your Kognitio cluster as the sys user and create a user for this exercise. We add the user to GRP_S3 to enable access to the S3_PARQUET connector and to group GRP_LOCAL_ENVS to allow access to the LOCAL_SHELL script environment.

CREATE USER sentiment PASSWORD "xyzzy" SCHEMA sentiment;

Disconnect from the sys user and connect as the sentiment user.

We are now ready to load the data and we’ll use the S3_PARQUET connector to create an external table to directly reference the Parquet files (unlike Mark Chopping’s post we also want to read the review body text so we’ll do it slightly differently). The external table definition truncates the review body to 5K characters as some reviews are very long and we don’t need that much text to determine the sentiment. We define this transformation at the table level rather than in a view because that gives us more control over what is read from the files.

marketplace varchar(2) character set utf8,
customer_id bigint,
review_id char(16),
product_id char(12),
product_parent bigint,
product_title varchar(250) character set utf8,
star_rating integer,
helpful_votes integer,
total_votes integer,
vine char(1),
verified_purchase char(1),
review_headline varchar(5000) character set utf8,
review_body varchar(5000) character set utf8,
review_date date,
"YEAR" integer,
product_category varchar(5000) character set utf8
TARGET 'uri_location s3://amazon-reviews-pds, uri_path /parquet/,
fmt_filename_partitions product_category,
repeatable_results yes,
ignore_invalid_records 1,
fmt_truncate_strings true';

Tip: you can create a table using the embedded Parquet schema as Mark Chopping’s LinkedIn post does and then use the


command to get the full create table definition.

We can now image the data by creating a view over the table and then imaging that view. In the view we do some deduplication and remove the spurious review from 1973. The image is hashed so that joins with the sentiment view image we create below are more efficient. This will take around 10 minutes on the recommended system.

create view opt_reviews as
select distinct * from reviews where "YEAR" >= 1995;
create view image opt_reviews hashed on (review_id);

Generating the Sentiment scores

We’re going to use an external script (described below) to generate the sentiment scores – run the following SQL to create the external script:

create external script textblob_sentiment
environment local_python3
receives(review_id varchar(25) CHARACTER SET UTF8, review_body varchar(5000) CHARACTER SET UTF8)
input 'fmt_field_separator "^"'
sends(review_id varchar(25) CHARACTER SET UTF8, polarity decimal(8,6), subjectivity decimal(8,6))
requires 480 MB RAM
script S'python(
import os
os.environ["OMP_NUM_THREADS"] = "1" # number makes no difference to performance but uses more memory if higher and fails if too high
import sys
from textblob import TextBlob 

for ln in sys.stdin:
  if(len(v) > 1):
    analysis = TextBlob(v[1])
    print("ERROR- Couldn't process : " + ln, end = "", file = sys.stderr)

Very simply, the external script is a wrapper around a Python 3 script. the parameters mean:

  • environment – use the Python 3 interpreter (installed by default on AWS).
  • receives – the script will receive (on stdin) the review_id (which we’ll use as a key) and the review body text.
  • input – the separator is set to a caret ‘^’ because the body text has commas in it (the default separator).
  • sends – the Python script is expected to write the review_id, the polarity score and the subjectivity score to stdout (separated by the default comma separator).
  • requires – tells the script scheduler how much memory to allocate for each python script.
  • script – the Python script (enclosed in here strings) described below.

The Python script:

  • imports the os library
  • sets an environment variable to stop the next libraries allocating too many threads
  • imports the TextBlob library
  • Iterates over each line in stdin (each record sent to the script)
  • splits the line into review_id (v[0]) and review_body (v[1])
  • checks that there are two entries
  • If there are, it
  • uses the TextBlob function to get an analysis for the body text and writes it out along with the review_id
  • If there aren’t, it writes a message to stdout which will be recorded in the servers ipe_script_debug table.

External scripts can be configured for a variety of behaviours – for example, the incoming data can be partitioned and ordered which makes processes like creating mobile phone call records from call fragments very straightforward and performant.

Parallelism is automatic but with a large degree of control. You can:

  • limit the total number of concurrent processes (to reduce the impact a resource intensive process has on the cluster)
  • limit the processes to a subset of nodes (if only some nodes have access to a resource, maybe an IoT feed)
  • exclude processes from a node

We can now create a view using the external script to process the review body records from the opt_review view image for all movie downloads. We image the view because the process of generating the sentiment metrics is relatively time consuming so we want to save the results in memory for further analysis.

Notice that using the external script to create the view does not require any knowledge of Python or sentiment analysis techniques. External scripts are an excellent mechanism for abstracting complexity away. The Python example here is relatively simple but it could be very complex and it could be in any language that will run on Linux.

create view body_sentiment as
external script textblob_sentiment
from (
select review_id, review_body
from opt_reviews
where marketplace in ('US', 'UK')
create view image body_sentiment hashed on (review_id);

This will run the sentiment analysis on around 150 million reviews. If you used the recommended cluster, it will use approximately 60 concurrent python processes to do this in around 50 minutes.

Kognitio is a scalable system and can run hundreds or even thousands of concurrent external script processes on larger clusters so you could reduce this time by creating a larger cluster.

Once the body_sentiment view has finished imaging you can query the data using Kognitio’s ANSI standard SQL.

As a starter we can look at how sentiment correlates with star rating across the product categories.

select product_category, star_rating, avg(polarity), avg(subjectivity), count(*)
from body_sentiment
inner join opt_reviews using (review_id)
group by 1, 2 order by 1, 2;
Sentiment analysis graph

We can see from this graph that there is a definite correlation between star rating and both average polarity and average subjectivity though there are some categories such as Music or Video where subjectivity is high even for low star ratings. Adding query predicates using HELPFUL_VOTES or VERIFIED_PURCHASE changes the averages though the same basic pattern stays.

For more insight into the data see Sharon Kirkham’s article describing how she built a Tableau dashboard to analyse this data.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *