Thursday, April 27, 2017

Native Analytics on MongoDB Atlas - Tutorial

This is a 10 minute, hands-on tutorial of setting up connectivity to MongoDB Atlas and building visualizations from it using Knowi. The end result is a dashboard of restaurants near the NYC area. In a previous post, we reviewed MongoDB Atlas.


This assumes that you have an Atlas account setup. If you don't, go to the MongoDB Atlas page and sign up for a free sandbox account. (More details on setting up your database on Atlas here).  

  1. Importing data into Atlas: 
b) Use MongoHub or mongoimport to import the JSON file into a collection.  
The JSON structure looks like this, with some nested elements:

  "borough": "Bronx",
  "cuisine": "Bakery",
  "name": "Morris Park Bake Shop",
  "restaurant_id": "30075445",
  "address": {
    "building": "1007",
    "coord": [
    "street": "Morris Park Ave",
    "zipcode": "10462"
  "grades": [
      "date": {
        "$date": 1393804800000
      "grade": "A",
      "score": 2
2. Sign up for a free Knowi account.
3. Create a MongoDB datasource connection:
  • Whitelist our IP addresses in Atlas to enable connectivity. (Alternatively, An on-premise way to set up connectivity is also available. See agent docs for more details.)
  • Create a new connection in Knowi via the Datasources icon --> New Datasource --> MongoDB.
  • Add the replicaset hosts into into the Host(s) section. Example:,,
  • Leave the port empty, if you already have ports in the hosts section; specify database to connect to, along with the user and password.
  • Under the 'Database Properties' field, add the following: ssl=true&authSource=admin&replicaSet=<yourRelplicaSetName>
  • Use the 'Test Connection' button to ensure that the connection is successful. Save.
Now the fun begins. Let's bring this data to life.


After the datasource is saved, click on the Configure Queries link. This will open up the query page. Let's start with a count of restaurants by borough by cuisine.

  • Open up the Query Generator.
  • Collections should be automatically populated with the collections from Atlas. Choose Restaurant collection. This will trigger our field discovery process to determine fields in the collection.
  • Add borough and cuisine to the dimensions dropdown.
  • Add _id into the Metrics section; click on it to add an count aggregation to it.
  • Notice the auto generated MongoDB query with aggregations.
  • Click on Preview to instantly preview the results.


Create a dashboard and add our newly created report/widget into it:

Set up a stacked bar chart with Borough in the X axis, cuisine in the Y axis.

Set up a few filters (via the filter icon):

In a few simple steps, you have your first visualization from data in Atlas.


Now, let's take this a step further with a drilldown to produce a cluster map of restaurants  for a given 'borough' & 'cuisine.'

Add another query, this time without aggregations and including the nested address object.

Drill into the address object to drag & drop the 'coord' array into the ad hoc analysis grid, along with the 'borough', 'cuisine' and 'name':


Set it up as a Geo Marker Cluster Visualization. Note that the coord nested array is automatically unwound as lat/long coordinates on the map.

On Visualization settings, set the Name to the restaurant name, and center the map to NYC (Lat/Long of 40, -74). Save the new query/Widget.

Setup a drilldown from our parent bar chart into the map.

Click on any point on the bar chart to get a cluster map of restaurants for the clicked combination 'borough' & 'cuisine':

To summarize, we imported data into an Atlas cluster, connected to it using Knowi, generated a few native MongoDB queries on it and set up a few visualizations from it.

Now it's your turn to bring your own data to life!


Wednesday, April 19, 2017

We've Got Some Exciting News: New Machine Learning Capabilities and a New Name!

We have a couple of exciting updates.

First, our latest release brings all of the power of our Business Intelligence platform and integrates Artificial Intelligence. With this release, you can blend hindsight and foresight to drive actions from your data. Not only that, it also enables you to offer value added machine learning capabilities to your customers, for embedded use cases.

Second, we are rebranding Cloud9 Charts to Knowi. With the latest product updates, it was the right time to pull the trigger on the name update that we’ve been simmering on for some time.

In a short period, we’ve built a leadership position in NoSQL/Polyglot analytics, with customers range from Fortune 500 to startups. Now we are aiming even higher.  With Knowi at the top of your modern data stack, you can blend SQL and NoSQL data and integrate machine learning to deliver smarter data applications.

If you’d like us to go over the Machine Learning capabilities, please contact us. It’s included as part of your service during the beta period for the next four months.

Thank you from all of us for being part of our journey thus far...we look forward to many more years to come.

Onwards & Upwards!

Jay Gopalakrishnan
Founder and CEO, Knowi

Sunday, April 16, 2017

4 Reasons Why Your BI Tool Is Preventing You from Becoming Data-Driven

Becoming data-driven company-wide, where data is integrated into the decision-making process all levels of the organization because employees, partners, and customers can access the right data at the right time, is often the ultimate goal when implementing an analytics solution. 

There are dozens of BI tools, from Tableau to Qlik to more modern tools like Looker, who promise to help achieve this goal.  However, the path is often overwhelmingly difficult because of hidden barriers to implementation and adoption that only reveal themselves as you try to implement.  What many don’t realize is the most common issues are not on the front-end, where end users enjoy high-quality experiences with most traditional BI tool, but rather on the back-end.  Here data engineers and IT developers are knitting together complex data platforms to move and manipulate unstructured data back into structured tables and get data prepped just so these BI tools will work. 

In recent years, we’ve seen a fast and furious evolution of data particularly in the emergence of Big Data and IoT technologies.  Not so long ago, most data used for analytics lived inside your four walls in well-understood systems and was always structured.  Data warehouses provided a single source for analytics and business analysts happily built retrospective visualizations and reports using day old data. 

Then came Big Data with a whole new world of real-time insights into customer sentiment or behavior tracking.  Now, looking at yesterday data was so, well, yesterday.  The new face of business intelligence was real-time dashboards built using data from mostly external sources and housed in multiple data stores.  As the data evolution progress further into IoT, data only gets bigger and faster and is almost always unstructured or multi-structured.

However, virtually every BI tool is stuck in the age of structured data.  Modern data analytics blends structured, unstructured and multi-structured data together to glean insights.  If your organization cannot leverage NoSQL data, then how do you integrate that data into your decision-making process?  These traditional SQL-friendly BI tools tell you they support NoSQL but there is a catch and that catch permeates everything else putting your data-driven dreams at risk. Here’s are four reasons why:

They are not NoSQL-friendly

No data lives in a silo and one of the significant barriers to fully leverage your data with traditional BI tools like Tableau, Qlik or Looker is they only understand structured data.  To use NoSQL data with any SQL-based BI tool, you are:
  1. Writing and maintaining custom extract queries using proprietary query languages provided by the NoSQL database that someone must learn
  2. Install proprietary ODBC drivers for each NoSQL database
  3. Using batch ETL processes to move unstructured NoSQL data into relational tables
  4. Doing analytics on unstructured data using different data discovery tools and then doing ETL
  5. Dumping everything into a Hadoop or a data lake to clean prep and eventually moving it to a traditional data warehouse
  6. Some mash-up of the above
In all cases, schemas must be defined, extract queries must be written and unstructured data must be shoehorned back into relational tables.  Congrats, you’ve just taking your beautiful modern NoSQL data and made it old again! 

All kidding aside, this is the main barrier to becoming data-driven using your traditional BI tool.  If you cannot natively integrate your analytics platform to your modern data stack, then you simply cannot fully leverage your enterprise data.

Ex.  MongoDB BI Connector Moves Data From MongoDB to MySQL

They are slow to adapt and lack intelligent actions

Bottom-line is the way people interact with analytics in data-driven enterprises is not a “one-size-fits-all” answer.  You have a diverse set of users with differing experience expectations and use case complexity levels which all must be managed to achieve company-wide adoption.

For example, not surprisingly, many people hate looking at data.  They don’t wake up every morning excited about looking at dashboards and drilling down to try to figure out why an application, a market, a region, a product is not performing like it did yesterday.  They don’t think trying to find the needle in the haystack is a good use of their time Instead, they want their analytics platform to tell them where the problem is, why it has happened, is it likely to happen again and, in some cases, automatically trigger what to do next.  At the same time, not everyone is looking for a needle some are just looking for the haystack.  For this group, a shareable or embeddable dashboard works great. 

Then you have the citizen business analyst who just wants to ask a question using Slack or even Siri and have the right dashboard or report appear. 

Traditional BI tools were built with the purpose of handling the “haystack” scenario but are falling behind in helping people find the “needle.”  The “needle” challenge requires more advanced analytics capabilities.  In many cases, this means predictive analytics with integrated machine learning.  Additionally, natural language processing (NLP) interfaces are emerging as alternatives to embedded dashboards to help empower non-technical users with simple analytics needs like “show me the sales for today at store 123”.

Many traditional BI tools lack the ability to adapt to these emerging modern analytics requirements and coupled with their lack of native integration to modern analytics data stacks; other solutions are acquired to manage these different use cases and user experience requirements.  As a result, instead of one analytics solution, most organizations have multiple solutions which operate in data silos to solve very specific use cases using a subset of the enterprise’s data.  The proliferation of analytics solutions is hardly a recipe for becoming a data-driven enterprise.

Not as data democratic as you think

By having to move and manipulate your unstructured data back into relational tables, an artificial wall is built between business users and all available enterprise data.  Expensive IT developers must be involved in virtually every project to integrate NoSQL data.  This adds months to projects, makes changes very expensive and increase the overall cost of data. Arguably, the real cost is the loss of understanding of the value of newer NoSQL data sources.  The business ends up so far away from the original data that it is hard for them to know what questions are possible to ask and therefore realize the value of newer data.  As a result, the questions, instead, become centered around the cost of acquiring and storing NoSQL data, and that is where many Big Data initiatives start to fail.

Not Big Data scalable

As mentioned earlier, data is only getting bigger and faster as IoT analytics hit the mainstream.  When ETL or ODBC drivers are used to move and manipulate NoSQL data to relational tables, then data limits are also added to prevent these processes from failing at volume.  Typically, record retrieval limits or aggregated data sets are employed to combat these performance issues.  A separate data discovery process is used to determine what data to retrieve or what aggregations to create.  Data Discovery, in this case, requires a different tool or custom coding and is almost always done in IT with input from the business.  From a process, technology and people perspective that is just not scalable or sustainable when it comes to leveraging Big Data to become data-driven.  There are simply too many restrictions on data and moving parts for the performance to meet business needs.

These traditional BI tools claim to reduce data silos, reduce time to insights and enable data discovery across the enterprise.  However, our customers repeatedly tell us they fail at achieving this when it comes to integrating NoSQL data into their business analytics. 

To be data-driven requires a modern analytics platform that enables data discovery across your modern data stack with support for descriptive, predictive and prescriptive analytics to derive insights and drive actions in a way that support the specific needs of a diverse end-user community and their use cases. 

Cloud9 Charts is modern analytics platform that has already enabled dozens of organizations, large and small, to become data-driven.  We natively integrate NoSQL and SQL data sources and enable multi-data source joins for seamless blending of structured, unstructured and multi-structured data without the need to move it.  You can share or embed over 30 different visualizations or use our advanced analytics capabilities to automate alerts and actions.

In the coming days, we have an exciting release announcement that will bring you one big step closer to achieving your goal of becoming a data-driven enterprise using a single analytics platform.  Stay tuned…

Tuesday, April 11, 2017

Twitter Analysis Using MongoDB Atlas and Knowi

GUEST POST: Matthew Plummer, Plymouth University U.K.

This project is a product of a final year honors project by student Matthew Plummer from Plymouth University U.K. At the core of this project is sentiment analysis of large data sets. More specifically, based on sentiment analysis of social network platforms focusing mainly on the social network Twitter. There are multiple parts to this project including Twitter API, Java, NoSQL database storage model using MongoDB and data visualization using Knowi.

Why carry out sentiment analysis on social networks?
It is widely known that online social networking has grown rapidly in past years and will continue to do so in the years to come. The volume of data produced via social media by its users is growing at a phenomenal rate with a report from (PaCroffre, 2012) predicting that by the year 2020 data volumes on social media platforms will increase by a staggering factor of 44.

These vast volumes of data provide a hugely beneficial insight into the public's opinion on certain topic areas and/or events. In recognizing this source of information and by taking this opportunity businesses could benefit greatly in many aspects by analyzing this data. This is where sentiment analysis comes into play, by sentimentally analyzing Twitter tweets for specific topic areas or products by using keywords, phrases or hashtags valuable information can be extracted and used to make business decisions.

System architecture
As stated previously this project is made up of 4 key components, to reiterate, these are Twitter API, Java, NoSQL database storage model using MongoDB and data visualization using Knowi. The architecture can be seen in the architecture diagram which was created for this project. This diagram was iterated upon many time as the project developed and came into its own.


Java and Twitter4J
At its core this project is built using Java, Twitter4J ( was used to construct and make API calls to Twitter in order to retrieve tweets. Twitter4J is a Java based library for use with the Twitter API. In using Twitter4J, access to the Twitter API can be easily integrated into any Java application or IDE. Twitter4J provided a set of predefined code classes which are used to request information from the Twitter servers for specific data. The fact that Twitter4J is an open source and free to use library it is easily downloaded and installed. The number of developers working on and using this package has been on the rise since its conception. It has grown as a platform over the years adding new functionality which can be used for multiple tasks, some examples of these functionalities are live Twitter tweet streams and Twitter tweet searches which were both used for this project. A Twitter stream of live tweets could be retrieved or a search for past tweets may also be carried out. In both cases, the user inputs search words which are used to filter tweets to only those needed and in the case of carrying out a search the user selects a past date (within the 6-9 days which the Twitter API stores) for the search to execute for.

Java was used for Twitter4J but also to carry out the sentiment analysis and in order to access, populate and update the database which was created and is stored on MongoDB Atlas (

Data storage using MongoDB Atlas
This project heavily implemented both of the concepts of cloud storage and cloud computation. The chosen storage method is the use of MongoDB Atlas, the official definition from MongoDB is that:
“MongoDB Atlas is a cloud service for running, monitoring, and maintaining MongoDB deployments, including the provisioning of dedicated servers for the MongoDB instances.”
(MongoDB, 2017)
In order to store the vast volumes of data which were retrieved from Twitter a cloud-based storage system, such as that of MongoDB Atlas, was needed. A number of cloud storage options was considered including, but not limited to, AWS (Amazon Web Services) Redshift, and Oracle. Upon considering the options it was decided that the NoSQL approach of MongoDB was of best for this project. As a result, a cluster was started in MongoDB Atlas which is where all the data was stored.

MongoDB is a NoSQL data storage system, if you aren't sure what that is or how the storage system/ architecture of a typical NoSQL system is constructed, this is briefly explained below in term of the more traditional relational database model  If you already have an understanding of NoSQL data structures you can skip over the below.


Database: this is same as a database in relational data models. The database acts as the highest level of storage where all other data and information is stored. In the case of NoSQL, it is modeled by means other than a tabular format, which is what relational models employ.
Collection: in terms of a relational data model this acts as a ‘table’, here all data is inserted into and stored. Collections are used for sorting similar data into groups for ease of accessing, processing and analyzing.
Document/ object: an object, or document which it is most commonly referred to as acts as a ‘record’. Within a document pairs of corresponding keys and values are stored. Documents are used to organize data into beyond the basic level of key-pair matching. Each document will be assigned a unique object ID which can be used to uniquely identify and retrieve the document.
Key: the key is, in a sense, a ‘field name’, the key will give information regarding the stored value. This allows the user, or analytics tool, to relate the value data to a corresponding key in order to understand what the retrieved values represent.
Value: this is the data that is ingested into the data storage system. This acts as the raw data and also processed information. The value, along with the matching key, is the core of the data and the structure. The value is a record of data in relational data models.

In researching both relational and NoSQL database management systems, it was decided that NoSQL would be best suited for this project. Due to the fact that NoSQL databases are best suited for the management of large unrelated data sets, with the ability to scale and maintain the performance of the manipulation and access to the data and the fact that, for this project, there will be no relations between data sets, NoSQL was, therefore, the approach to be taken forward for this piece of work.

Sentiment analysis and Knowi
Each tweet received was analyzed for sentiment words. A collection of over 6,000 sentiment words was constructed which was used for the analysis. Java code was used to access the MongoDB collection to retrieve the sentiment words and then this was iterated through, along with the tweet, in order to find matches.The overall sentiment of the tweet was then calculated by the following:

  • By default, the overall sentiment of all tweets is initially 0.
  • If a positive sentiment is found the overall sentiment is increased by 1.
  • If a negative sentiment is found the overall sentiment is decreased by 1.
  • After all sentiment words are have been checked against the tweet, a polarity is assigned to the tweet. If the overall sentiment is:
    • Greater than 0 it is positive
    • Less than 0 it is negative
    • Equal to 0 zero, with no sentiment words present, it is neutral
    • Equal to 0, with sentiment words present, it is balanced.

The tweet date, tweet text, overall sentiment, polarity and sentiment words found are all inserted into a collection.

At this point, Knowi takes over. MongoDB Atlas is integrated into Knowi easily and it is supported natively. Within Knowi additional fields are created if needed, an example for this project being a stripped down date field to use as a parameter to graphical visualizations i.e. from dd-mm-yyyy hh:mm:ss to dd-mm-yyyy. A number of graphs and charts are created by the user and these widgets are added to a dashboard for presentation, sharing etc.

This system has been run over several days using #Trump as an example search word in order to retrieve data from Twitter.

Matthew Plummer