Tuesday, July 24, 2018

Knowi Product Update - Q2 2018



Join us on Wednesday, August 8, 2018, at 10:30AM Pacific for a live demo of the new features and enhancements we've added to the Knowi platform. If you can't make the time, register anyway and we will send you the recording of the session.  In the meantime, please keep reading to see a summary of the product updates for this past quarter.

Anomaly Detection

Time-series anomaly detection is used to identify unusual patterns that do not conform to expected behavior otherwise know as outliers. There are a number of business applications for this type of machine learning. For example, IoT use cases where an unusual traffic pattern that might indicate an issue with a traffic light and business applications like detecting strange network behavior that could indicate a hack attempt.

We provide a number of anomaly forecasting algorithms within the workspace so you can determine the best one for your specific use case. To use, simply select Anomaly Detection as your machine learning option and create a new workspace. Follow the configuration steps and test out different algorithms for accuracy. The precision of the model increases over time as more data is made available.



The anomaly detection visualization itself consists of a configurable blue band range of expected values (acceptable threshold limit) along with the actual metric data points. Any values outside of the blue band range are considered anomalies and will appear in red.





Sparkly New Data Visualizations

Data Grid

You may think data grids are boring but then you haven't tried our new data grid visualization. You can do a ton with this new grid type, including formatting, conditional highlighting, sorting, grouping, and search along with the ability to download formatted grid information in Excel format.

The other pretty cool thing you can do is add charts to cells. The embedded chart options are sparkline, area, bar, spline, spline area, and pie. Not so boring anymore, right?

Knowi Data Grid with Sparkle line


Image Overlay Heatmap

You can now upload an image and overlay x, y coordinate values. Example use cases are tracking how a stadium is filling up as people are scanned at the entry gates or showing the location of IoT sensors deployed on each floor of a building.

Knowi image overlay data visualization




It's Your Dashboard - Customize It!

We enhanced the level of customization of the look and feel of your dashboards by adding several new configurable options.





Getting Clicky With It

Tightly integrating Knowi visualizations within your applications is an important step to give users a seamless experience. To further that integration, we've added an OnClick Event Handler which is available for various visualizations. You can use it to customize what action to take when a data point is clicked on the visualization.






Other Cool Stuff
Connected Charts
Added a drill down feature to allow the current dashboard to be filtered with settings based on the clicked points.

Hidden Filters
Admin or dashboard owners can now hide certain filters from end users.

REST-API Enhancements
Added epoch secs support for REST API datasource url parameters, which allows you to pass in date in epoch time formats, dynamically.

Added a way to handle the return of multiple JSON objects within the same file

Cloud9QL Enhancements
We've added an ARRAY function that creates an array with a specified grouping to return an array for the field.  We've also added  UPPER and LOWER functions to change the case of string fields.

Baseball Analytics - A Knowi Case Study

Knowi Baseball Stats Analytics

Baseball Stats and Analytics

How do Runners on Base Impact a Batter's Ability to get a Hit?


Often times what we perceive as logic contradicts reality. In psychology, this tendency of believing we know more than we actually do is known as overconfidence. Too frequently do people solely rely on their intuition and logic to make decisions rather than basing them on data and analysis. During an observational study, I discovered a contradiction between my own logical reasoning and what the data actually suggested in regards to baseball.

When considering the correlation between runners on base and batting average, you would think when runners are on base, batters will have a lower batting average than when the bases are empty.

Quick Summary of American Baseball

For those who are unfamiliar with baseball, there are three bases a runner can occupy: first, second,
and third. The goal of the offense is for the batter to hit the ball and get to first base before the defense can get them out. Once on first base, the batter becomes a runner and attempts to reach each base before running home to score. The goal of the defense is to get three outs by striking out the batter, catching the ball in the air, tagging a runner between bases, or beating a runner to the base with the ball. Batting average is a stat that measures a batter’s percentage of getting a hit by dividing a player’s number of Hits by At-bats. 

A Hit is defined as when a batter hits the ball into fair territory and then reaches first base without the defense making an error or getting a runner out. If the batter hits the ball but the defense gets a runner out but the batter is safe it is called a Fielder’s Choice and is not counted as a hit. At-bats differ from Plate Appearances, a Plate Appearance is any time a batter is up to bat, whereas an At-bat is only counted if the player gets a hit, an out, or a fielder’s choice.

The Question

Knowi Baseball Analytics Logically, it would make sense that if a runner is on base, the batter would have a lower chance of getting a hit because the defense could get the runner or the batter out. This would maximize the defense’s odds of making an out and preventing the batter from obtaining a hit. As a result, this would lead to lower batting averages when there is a runner on base, and higher batting averages when there is nobody on base.

The Data

I chose a sample size of 270 MLB players, the nine players from each team with the most plate appearances in 2017. Using Baseball Reference, I was able to collect stats on each player and created a database of 216 stats per player in Google Sheets, totaling over 62,000 data points. The sample of 270 players are only 28% of the players who had an At-bat in 2017, but they account for 74% of the season’s At-bats.

All my charts and calculations were exclusively derived from the data I collected, team statistics only use data from the 9 players sampled from that team in the calculation. I then accessed the data in Google Shets via an API using the Knowi REST-API integration in order to build queries and visualize my data. One of the advantages of using the Knowi API method to connect to my data was it allowed me to edit my spreadsheet to fix errors and add new statistics and the changes were immediately reflected in my visualizations.  Not having to upload my data every time I changed it was a huge time saver.

What the Data Shows

After collecting all of my data, I created a dashboard (click to play around with it) and visualizations using Knowi in order to make observations and analyze the data. The first visualization I created compared team batting averages when there are runners on base and when there are no runners on base. 

The results were surprising, only two teams: Philadelphia and Kansas City, had a higher team average when the bases were empty than when there were runners on base. Philadelphia was only 1 batting average point off. This means that only 6% of the teams fit my hypothesis of having a higher batting average when the bases are empty.



Next, I decided to look at a graph of the players’ averages. Only 37% of the sample of players had a higher batting average when the bases were empty compared to when there were runners on base. 

Again, the data did not fit my hypothesis. I also observed that most players had a large deviation between each of their averages. In other words, most players had either a really high average when runners were on base and a really low average when the bases were empty, and other players had the exact opposite result. 

I then created two new charts: one that only contained players who had higher averages when runners were on base which I called Group A, and one that only included players who had higher averages when the bases were empty, called Group B. By comparing the two groups to one another, I observed that players in Group A had greater deviations in their averages than players in Group B.



To model this, I created another chart where I calculated the difference between each player’s highest average and their lowest average. The top eleven players with the highest differential were all in Group A. The average deviation of Group A players was 41 batting average points and the average for Group B players was 30.
            

My Conclusion

It turns out that the results did not support my hypothesis whatsoever if anything it showed the opposite. Only 6% of the teams and 37% of players fit my hypothesis. Not only did more players have a higher average with runners on base, the players who did had a large deviation between their averages than players who did not. 

Overall, the League average for when runners are on base was .275 and when the bases were empty was .261. A difference of 14 points seems insignificant, but to put it into context, given the sample size of more than 123,000 At-bats, it results in a difference of 1,800 hits. I also tested the statistical significance of the data. Although the name can be misleading, Batting Average is actually a proportion rather than an average, so in order to conduct hypothesis testing, I would need to use the difference between two proportions formula. 

With the formula below, I came to the conclusion that with over 99% confidence, batting averages when runners on base are greater than averages when the bases are empty. Additionally, with 97% confidence, averages, when runners are on base, are higher by 10 points than when runners are not on base. 


The data clearly rejects my original hypothesis that batting averages would be higher when the bases are empty and goes as far as to provide solid evidence towards the exact opposite.

A Real-World Application

Without applying it to real life, data is useless. Using players’ situational batting averages when setting a team’s batting order can help increase the number of hits the team gets in a game, in turn leading to more runs. The leadoff hitter, the first batter in the lineup, is most likely to come up to bat with the bases empty, roughly 65% of the time. The second batter is the second most likely to come up to bat with the bases empty. By batting players with the highest average when the bases are empty in the 1 or 2 spot, those players will have a greater chance of getting a hit. However, only 47% of leadoff hitters and 24% of batters in the 2 spot have higher batting averages when the bases are empty. 

Additionally, the 4 and 3 spot in the lineup have the highest chance of coming to bat with a runner on base, so a team should put their best hitters with runners on base in those spots. A really good example of a team who bats their players in the best spots is the World Champion Houston Astros. Jose Altuve and Carlos Correa bat 3rd and 4th and when runners are on base have batting averages of .350 and .340 respectively. To put those averages into perspective let me remind you that the league overall batting average is .267. The Astros also bat either Alex Bregman or Josh Reddick 2nd, the players with the two highest averages when the bases are empty. Now compare that to a really bad example, the last place Detroit Tigers. The Tigers 4th hitter is Victor Martinez, being the 4th hitter Martinez is the most likely player on the team to come up to bat with runners on base. However, Martinez is the worst on the team with runners on base, and the second best player on the team when the bases are empty. The Tigers best hitter with runners on base is Nicholas Castellanos with an average of .341. Despite his high average with runners on base, Castellanos only has an average of .223 when the bases are empty. Castelllanos’ average drops by a whole 118 batting points when the bases are empty, yet he bats 2nd and is second most likely on the team to come to bat with the bases empty. Being able to utilize data in a way to make data-driven decisions is the most important reason to use data analytics.

Through this observational study and visualizing my data using Knowi, I was able to reshape the way I think about baseball and use real data to disprove my own logical reasoning. All it takes to bridge the gap between what you know and what you think you know is investigation, observation, and analysis. It can be done with baseball and it can be done with anything. How can data analysis transform your perception and more importantly your decision making?





Source(s):
Sports Reference LLC. Baseball-Reference.com - Major League  

Statistics and Information. https://www.baseball-reference.com/. (5/21/18)



Tuesday, June 19, 2018

The Six Steps of Creating a Machine Learning Model in Knowi


Knowi Machine Learning

Creating a Classification Model for Breast Cancer Screening

The UCI Machine Learning Repository contains many full data sets that can be used to test and train machine learning models. One such example is the Breast Cancer Wisconsin (Diagnostic) Data Set which relates whether breast cancer is benign or malignant to 10 specific aspects of the tumor. Based on this dataset, we can develop a model that will be able to determine the likelihood of breast cancer being benign or malignant.

The process of using machine learning to analyze data is made easy with Knowi Adaptive Intelligence. Given a training dataset, Knowi can apply either classification or regression algorithms to build valuable insights from the data.
Here is a step-by-step guide about how to turn that data into a powerful machine learning model using Knowi:

1. Create the Workspace and Upload Data

To start the machine learning process, go to www.knowi.com. If you are not already a Knowi user, sign up for a free trial to complete this tutorial. Once in, go into the machine learning section that can be found on the left-hand side of the screen. From there, start a new workspace and you will be given a choice of either making a classification or regression model. In the case of the breast cancer example, the workspace will be classification due to the nature of the data where the variable that we are predicting will always fall into either of two categories. Next, upload the Breast Cancer Wisconsin (Diagnostic) Data Set.






Knowi Machine Learning Workspace

2. Choose Response Variable and View Full Dataset

After uploading, and possibly manipulating the file, choose the Attribute to Predict from the drop-down list. In the case of the breast cancer data, the attribute that is being predicted is the class of the tumor. Following the choice of the prediction variable, the initial analysis takes place by using the Analyze Data button. This displays the data on the screen and allows an opportunity to scroll through the data looking for patterns.
Knowi Machine Learning Prediction
Knowi Machine Learning Data Analysis

3.  Prepare the Data

After analyzing, data preparation begins. Data preparation is an optional, wizard-driven process that involves going through a step-by-step process where the program confirms the training set datatypes, identifies and allows for the removal of outliers, reports missing data with the option to remove or impute values, allows for rescaling of the data, groups into discrete bins and, finally, provides the option to create dummy variables. All decisions can be changed by moving backward and forwards through the steps at any time.

For the Breast Cancer data, a small amount of rescaling and grouping were necessary to increase accuracy.

Knowi Machine Learning Data Prep

4. Feature Selection


Whether you came in with prepared data, or just finished the process, the next step is to select which variables to be used in the model. To make this decision it is essential to check back at the data, looking for patterns and correlations.

Knowi Machine Learning Feature Selection

5. Create and Compare the Models


At this point, you are left with choosing between the available algorithms (i.e. Decision Tree, Logistic Regression, K-Nearest Neighbor, or Naive Bayes). Knowi makes it easy to choose all available and compare them with useful attributes such as accuracy or the absolute deviation. Pressing the little eye next to the model created in the results section will show a preview of the input data along with the predictions of the program. Next, to the eye, there is a plus sign that, when pressed, will display the details of that specific model. It is beneficial to produce many models and tweak settings each time to find the best one for the situation. All past models are saved in the history and can be viewed, compared, and even published.

Knowi Machine Learning Model Training


6. Publish

The last step is publication. This step involves the button next to the plus sign. Upon publishing, a prompt to name the model will be displayed. It is possible to publish as many models as needed from the same data. All models that are created can be viewed and compared directly in the ‘Published Models’ tab within Machine Learning.



Knowi Machine Learning Model Publication


How to Apply a Model to a Query

Now you have officially created a machine learning model that can seamlessly be applied to any query. To integrate it into a dataset simply press ‘Apply Model’ while performing a query and this will add a field where all the machine learning models will be available to be selected and used. Pressing the preview button on the screen will show the data along with the predictions made by the model.


Moving Knowi Machine Learning Model into Production

Knowi Predictive Analytics








Actions from Insight Made Easy

With those six steps, you have a machine learning model that can be integrated into any workflow and create new visualizations and insights that will drive downstream actions. The applications of the machine learning model are endless and can be tailored to the individual need. Once a model is made and put in place, there are many actions that can be performed to gain meaning and spark reactions. This is done through trigger notifications. A trigger notification is a notification that will act in the case that a certain condition is met. In the scope of the breast cancer machine learning model, an alert can be set to email a doctor the patient’s information in the situation that the model found a tumor to be malignant. This enables more than just insights, it generates action.

Summary

The process of creating a model within Knowi is so easy that anyone can do it, and it starts by simply uploading a dataset. Data can be uploaded from a file, SQL, and NoSQL sources, along with REST-APIs. Following the uploading of a file, Knowi has built-in algorithms available, or the option to create your own, along with a designated page to review multiple factors and evaluate the best algorithm for your situation. Using this method, the Breast Cancer training data was loaded from the UCI Machine Learning Repository into a Knowi workspace, then analyzed with the built-in data prepping tools. The resulting model was ready to be integrated into any workflow and autonomously perform actions based on the results, such as sending an alert to a doctor depending on the outcome of the test. Give Knowi a try and see how easy visualizing and learning from your data can be.


References

Dheeru, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository. Retrieved from University of California, Irvine, School of Information and Computer Sciences: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Knowi. (2017). Adaptive Intelligence for Modern Data. Retrieved from Knowi Website: www.knowi.com









Wednesday, June 6, 2018

MongoDB Analytics Tips and Traps

Free Advice When Starting a MongoDB Analytics Project




MongoDB excels at analytics. Every week, we have customers building MongoDB analytics, queries and dashboards to help their business teams uncover new insights.  We have seen outstanding performance from good practices and have seen issues with common bad practices.

We have customers performing many different types of analytics from performance monitoring, usage metrics, financial and business performance.  These are general tips and traps that our customers mention when starting to do analytics on MongoDB. It’s not an extensive list but a start.

Arrays and Query Performance

There are some inherent difficulties involved when working with embedded documents and arrays which can impacts your query performance.  It’s best think simple when it comes to your documents so you can optimally leverage indexing for query performance. Not to say, don’t use nested arrays but avoid, if possible, arrays that are deeply nested or at least be aware that your query performance may be impacted.
Knowi has a handy tool called Data Explorer to allow you to explore your nested arrays so you can see what data is available to query.  This can help you quickly determine if you might need to set up additional indexes to more efficiently access that data.

MongoDB Join Collections or, well, Anything.

Some people think joins are the root of all evil.  We don’t but you can’t take a willy-nilly approach to Joins in Mongo either.  Think about joins early to ensure you don’t kill your query performance, especially if joining across multiple sources.  For example, if you have a logically key that would join Mongo collections, you probably don’t want that buried deep in a nested array and you’ll want to index it.
Knowi allows you to join across collections as well as with other NoSQL, SQL or REST-API sources.  Just tell us the join key(s) and the type of join you want to perform and we take care of the rest.  We optimize join performance to allows joins across large collections/datasets.

Use Mongo Query Generators

MongoQL is powerful but comes with a pretty steep learning curve.  An intuitive query builder will not only simplify the query building process but enable more people to build queries.  Drag and Drop query based query builders allow data engineers to leverage MongoDB data query functionalities without necessarily knowing MongoQL.
Knowi auto-generates MongoQL through an intuitive drag and drop interface.  It’s designed for data engineers who understand the data but may not be fluent in MongoQL. Easily group, filter and sort data without writing any code.

Trap:  ODBC drivers; SQL Wrappers, Mongo BI Connector

Don’t be fooled into thinking you can simply add a SQL layer on top of your MongoDB and your existing analytics tools will work.  You are about to perform unnatural acts on your MongoDB data because SQL-ifying your MongoDB data is as bad as it sounds. You will be defining schemas, moving data and building complex ETL or ELT processes.  Potentially, worst of all, you will have to determine very early on what data is “important”. Important enough to move, transform and duplicate.
There is a significant cost to adding a SQL layer on top of your Mongo data starting with your initial project costs and timeline. Post implementation, changes are extremely expensive because you have a lot of moving parts between your end users and the data they are trying to analyze.  For example, adding a new field requires schema changes and changes to ETL processes which can be weeks of work.

The real cost here is the ability for the business to experiment with analytics on their Mongo data.  Delivering the first visualization usually just spawns more questions so enabling experimentation is important.
Native integration matters when it comes to enabling business self-service and experimentation.  Knowi natively integrates to MongoDB so there is no need to move data out of MongoDB and no transforming data back into relational structures.  By eliminating the ETL step, you effectively accelerate delivery of your MongoDB analytics project by 10x. Changes are as easy as modifying the query to pull the new field and it is immediately available for use in visualizations.

Managing Visualization Performance and Long Running Queries

Long running queries is a common issue when it comes to analytics and MongoDB is not immune to the issue. No matter how much you optimize your queries and Mongo, if you are dealing with large datasets or complex joins, your queries will take time to execute.  Your end-users will not be patient enough to wait if its more than a few seconds which will make them less likely to adopt your analytics platform.

To provide excellent end-user visualization experience, consider adding a layer to persist query results.  This persistence layer could be another MongoDB instance that is used to cache query results for the purpose of powering reporting, for example.
Knowi comes with an optional persistence layer which is essentially a schema-less data warehouse called Elastic Store.  Queries can then be scheduled to run every x min/hours/days and visualizations automatically go against data in Elastic Store and not directly against your Mongo database.

Conclusion

MongoDB is an excellent data source for analytics but most traditional analytics tools like Tableau, Qlik, Looker, etc. do not work directly against MongoDB.  This is the biggest trap when implementing MongoDB analytics projects. You will have to introduce a number of intermediary steps to extract data from MongoDB, transform it and load it into a pre-defined relational schema for any of these tools to work.  While this sounds straightforward, the devil is in the details in the fact you are adding significant time and cost to your MongoDB Analytics projects. On top of that, you are immediately limiting what data business teams can use for analytics and therefore limiting their ability to experiment to gain new insights.
Instead, look at new analytics tools that are purpose-built for performing analytics on NoSQL databases, like MongoDB.  Native integration matters when it comes to accelerating delivery of your analytics project and building a data architecture that is sustainable and scalable over time.  Knowi is one such tool but there are others.

Resoures: