Tuesday, August 25, 2015

New BI Paradigm

We are in the midst of a fundamental shift in how data is stored, with rapid adoption of NoSQL and other storage technologies for storing unstructured and semi-structured data.

With an estimated 80% of enterprise data in unstructured/semi-structured form, enterprises are increasingly adopting “Polyglot Persistence” architecture, which is a fancy term for using different database technologies to handle unique persistence needs. In practice, this means traditional SQL based relational databases for structured data, in conjunction with newer NoSQL databases and file based sources for unstructured/semi-structured data.

So, lots of great options on how to store the data. But, when it comes to unlocking the value of this data across heterogeneous sources, there are some significant challenges given that traditional Business Intelligence solutions are built for structured data.

The standard approach to BI looks like this:

1. Extract, Transform & Load (ETL): Relevant data across heterogenous sources is typically stored into a data warehouse, but first step in that process is the dreaded ETL process. This usually entails:
a)  A team to write custom scripts to ETL data across various SQL based data sources into the predefined schema/warehouse

b)  Develop another set of ETL processes to pull relevant data from NoSQL databases, and yet another for files and shoehorn that data into a relational structure

2.  Data Warehouse: Generally a SQL based database, this requires schema definitions mapped upfront that determine what gets stored where and a dedicated database team to manage it. For any new metric or changes to existing metrics means changes to the ETL layer as well as the warehousing tier. 

3.  BI Tool: Once the data is in a structured form in the warehouse, a BI tool comes into play. With traditional BI tools, this means:
a)  Desktop licenses for use by report creators
b)  Server licenses with IT involvement for publishing and sharing the reports

Each stage of this process involves specific product suites along with specialized engineering and IT resources. For example, for the warehousing layer, typical choices include a custom SQL database or Amazon Redshift, Pentaho/Alteryx for ETL and data processing, Tableau/Qlik for data visualization, etc.

The problems with this approach are that the path from data to insights is far too cumbersome, brittle and expensive and takes a significant amount of time, resources and skills before any tangible results are in sight. Given that a database schema has to be designed upfront, every new metric or modifications to existing metrics requires BI and engineering touch points, thus increasing the time and effort to obtain insights from the data, impacting the organization’s ability to react to data changes quickly.

Cloud9 Charts takes a different approach. Specifically built for modern unstructured and structured data, we dramatically simplify every step to go from data to insights seamlessly. 
Let’s contrast the above with an example below using a screenshot gif:

The example sets up connectivity to MongoDB and MySQL, auto-generates queries against MongoDB and MySQL and performs an inner join on them (this example determines the sent message volume by customer and combines it with the address for each customer from a MySQL database) and lastly, creates a shareable, embeddable dashboard from the data.

 Key things to note: 

-  We’ve obtained insights from data within MongoDB, then combined it with data from MySQL in just a few clicks. No cumbersome data cleanup, complex ETL processing or data warehousing setup/management.

The query results are seamlessly stored into a flexible, scalable data lake/warehouse, without the need to pre-define schema or write complex ETL jobs. The warehouse provides fast access to the results, store multi-structured data, provides schema-on-read and SQL like capabilities and incrementally track the results. Reporting and dashboards can be driven off this warehouse, shielding your production systems against reporting workloads.

- While the warehousing scheme provides significant benefits, there may be cases where you need to execute queries & visualization directly against the source database, which is also supported in a single click.  

-  Dashboard creation, sharing, etc. are enabled right inside the portal, eliminating the need for multiple product suites with one for data discovery and report creation and yet another for sharing.

-  A range of sophisticated ways to query, aggregate and pipeline the data is built in (for example: run a query against 1000’s of databases and merge the results, query capabilities on logs/JSON/XML, plug-in custom machine learning algorithms, etc.)

By quickly connecting the dots from the raw source to insights in an integrated platform, we not only enable teams to make driven decisions across all their data sources, but also react to data changes quickly.

Monday, August 24, 2015

Elasticsearch Business Intelligence & Reporting

Blend Elasticsearch and MySQL Data on the Fly Using Knowi

UPDATED: December 2018

This integration combines Elasticsearch query and search capabilities to easily produce shareable visual insights quickly. In addition, for the very first time, our join capabilities provide unified insights into Elasticsearch as well other NoSQL and SQL based datasources, eliminating cumbersome engineering and heavy ETL processes into relational databases.  

To get started, visit our Elasticsearch page, which is preconfigured with a demo Elasticsearch cluster settings to get started.

In the following example, we’ll walk through the following:
  • Connect to a demo Elasticsearch cluster
  • Execute queries against it
  • Join the results with some data in MySQL
  • Visualize, track and share the results

The following animated gif shows how to connect to our ElasticSearch cluster. 

(Note: If the cluster is behind your firewall, please use our agent)

Let's also set up connectivity to our demo MySQL datasource using the default settings -  that we’ll use for Joins later.

Once the datasource is set up, click on the Configure Queries link:
  1. Specify the Elasticsearch JSON query. In our example, let's search for matches for all documents that contain Transactional.
  2.   {
         "query": {
             "query_string": {
                 "query": "Transactional*"
  3. Click on Preview to return all documents for that query. This returns the first 1000 matches. 

  4. Let's add some Cloud9QL to post-process the data further. Note that the Cloud9QL is optional, and is not a replacement to Elasticsearch query - it provides a powerful way to complement Elasticsearch query to transform the results. In the following example, let's transform the results further to get the sum of the sent field, grouped by customer. 
  5. select sum(sent) as Sent, customer 
    group by customer
    Click on Preview to immediately see the updated results.

  6. Joins: Now that we have some basic data from Elasticsearch, say that we need to join this data with customer details that exist in a MySQL database. 
    • Click on Join --> Select the new MySQL datasource --> Under Query Generator, select the customer table
    • Add customer=customer in the Join Fields box. This implies that customer fields from the Elasticsearch query will be used to lookup (inner join) against customer field in the MySQL query.
            More on Joins: https://docs.knowi.com/hc/en-us/articles/115006062748-multi-data-source-joins
            Click on Preview.
ElasticSearch Query

The results now include customer details from MySQL combined together with the data from Elasticsearch.

5. Click on 'Save' to save it, then add it to a dashboard to easily share and embed it.