Tuesday, April 11, 2017

Twitter Analysis Using MongoDB Atlas and Knowi

GUEST POST: Matthew Plummer, Plymouth University U.K.

This project is a product of a final year honors project by student Matthew Plummer from Plymouth University U.K. At the core of this project is sentiment analysis of large data sets. More specifically, based on sentiment analysis of social network platforms focusing mainly on the social network Twitter. There are multiple parts to this project including Twitter API, Java, NoSQL database storage model using MongoDB and data visualization using Knowi.

Why carry out sentiment analysis on social networks?
It is widely known that online social networking has grown rapidly in past years and will continue to do so in the years to come. The volume of data produced via social media by its users is growing at a phenomenal rate with a report from (PaCroffre, 2012) predicting that by the year 2020 data volumes on social media platforms will increase by a staggering factor of 44.

These vast volumes of data provide a hugely beneficial insight into the public's opinion on certain topic areas and/or events. In recognizing this source of information and by taking this opportunity businesses could benefit greatly in many aspects by analyzing this data. This is where sentiment analysis comes into play, by sentimentally analyzing Twitter tweets for specific topic areas or products by using keywords, phrases or hashtags valuable information can be extracted and used to make business decisions.

System architecture
As stated previously this project is made up of 4 key components, to reiterate, these are Twitter API, Java, NoSQL database storage model using MongoDB and data visualization using Knowi. The architecture can be seen in the architecture diagram which was created for this project. This diagram was iterated upon many time as the project developed and came into its own.


Java and Twitter4J
At its core this project is built using Java, Twitter4J (http://twitter4j.org) was used to construct and make API calls to Twitter in order to retrieve tweets. Twitter4J is a Java based library for use with the Twitter API. In using Twitter4J, access to the Twitter API can be easily integrated into any Java application or IDE. Twitter4J provided a set of predefined code classes which are used to request information from the Twitter servers for specific data. The fact that Twitter4J is an open source and free to use library it is easily downloaded and installed. The number of developers working on and using this package has been on the rise since its conception. It has grown as a platform over the years adding new functionality which can be used for multiple tasks, some examples of these functionalities are live Twitter tweet streams and Twitter tweet searches which were both used for this project. A Twitter stream of live tweets could be retrieved or a search for past tweets may also be carried out. In both cases, the user inputs search words which are used to filter tweets to only those needed and in the case of carrying out a search the user selects a past date (within the 6-9 days which the Twitter API stores) for the search to execute for.

Java was used for Twitter4J but also to carry out the sentiment analysis and in order to access, populate and update the database which was created and is stored on MongoDB Atlas (https://www.mongodb.com/cloud/atlas).

Data storage using MongoDB Atlas
This project heavily implemented both of the concepts of cloud storage and cloud computation. The chosen storage method is the use of MongoDB Atlas, the official definition from MongoDB is that:
“MongoDB Atlas is a cloud service for running, monitoring, and maintaining MongoDB deployments, including the provisioning of dedicated servers for the MongoDB instances.”
(MongoDB, 2017)
In order to store the vast volumes of data which were retrieved from Twitter a cloud-based storage system, such as that of MongoDB Atlas, was needed. A number of cloud storage options was considered including, but not limited to, AWS (Amazon Web Services) Redshift, and Oracle. Upon considering the options it was decided that the NoSQL approach of MongoDB was of best for this project. As a result, a cluster was started in MongoDB Atlas which is where all the data was stored.

MongoDB is a NoSQL data storage system, if you aren't sure what that is or how the storage system/ architecture of a typical NoSQL system is constructed, this is briefly explained below in term of the more traditional relational database model  If you already have an understanding of NoSQL data structures you can skip over the below.


Database: this is same as a database in relational data models. The database acts as the highest level of storage where all other data and information is stored. In the case of NoSQL, it is modeled by means other than a tabular format, which is what relational models employ.
Collection: in terms of a relational data model this acts as a ‘table’, here all data is inserted into and stored. Collections are used for sorting similar data into groups for ease of accessing, processing and analyzing.
Document/ object: an object, or document which it is most commonly referred to as acts as a ‘record’. Within a document pairs of corresponding keys and values are stored. Documents are used to organize data into beyond the basic level of key-pair matching. Each document will be assigned a unique object ID which can be used to uniquely identify and retrieve the document.
Key: the key is, in a sense, a ‘field name’, the key will give information regarding the stored value. This allows the user, or analytics tool, to relate the value data to a corresponding key in order to understand what the retrieved values represent.
Value: this is the data that is ingested into the data storage system. This acts as the raw data and also processed information. The value, along with the matching key, is the core of the data and the structure. The value is a record of data in relational data models.

In researching both relational and NoSQL database management systems, it was decided that NoSQL would be best suited for this project. Due to the fact that NoSQL databases are best suited for the management of large unrelated data sets, with the ability to scale and maintain the performance of the manipulation and access to the data and the fact that, for this project, there will be no relations between data sets, NoSQL was, therefore, the approach to be taken forward for this piece of work.

Sentiment analysis and Knowi
Each tweet received was analyzed for sentiment words. A collection of over 6,000 sentiment words was constructed which was used for the analysis. Java code was used to access the MongoDB collection to retrieve the sentiment words and then this was iterated through, along with the tweet, in order to find matches.The overall sentiment of the tweet was then calculated by the following:

  • By default, the overall sentiment of all tweets is initially 0.
  • If a positive sentiment is found the overall sentiment is increased by 1.
  • If a negative sentiment is found the overall sentiment is decreased by 1.
  • After all sentiment words are have been checked against the tweet, a polarity is assigned to the tweet. If the overall sentiment is:
    • Greater than 0 it is positive
    • Less than 0 it is negative
    • Equal to 0 zero, with no sentiment words present, it is neutral
    • Equal to 0, with sentiment words present, it is balanced.

The tweet date, tweet text, overall sentiment, polarity and sentiment words found are all inserted into a collection.

At this point, Knowi takes over. MongoDB Atlas is integrated into Knowi easily and it is supported natively. Within Knowi additional fields are created if needed, an example for this project being a stripped down date field to use as a parameter to graphical visualizations i.e. from dd-mm-yyyy hh:mm:ss to dd-mm-yyyy. A number of graphs and charts are created by the user and these widgets are added to a dashboard for presentation, sharing etc.

This system has been run over several days using #Trump as an example search word in order to retrieve data from Twitter.

Matthew Plummer