Monday, September 26, 2016

Analyzing 1.2 Billion NYC Taxi Rides

Recently, we held a webinar with our friends at Ocean9 focused on-demand, self-serve analytics on large scale datasets. We needed a large dataset for our demo, for which we turned to New York City Taxi data, nicely put together on github by toddwschneider. Plus, it was opportune time for me as I’m headed to New York City this week with lots of cab rides in between meetings.
The dataset consists of trip details of 1.237 billion rides (237GB on disk of RAW CSV data).


We used the following:
SAP HANA: In-memory, relational database. Given its in-memory architecture, it provides fast lookups on large scale datasets (but by no means cheap). Database-as-a-Service for SAP HANA that enables one click deployment in the cloud.
Cloud9 Charts: Our analytics platform.
This setup enables to do the analysis all in the cloud, with nothing to install (and tearing it down post-analysis).
Frank Stienhans, CTO at Ocean9, has put together a nice blog post on the data import process into HANA. It’s definitely worth a read if you are thinking of using HANA in the cloud.
The import process took approximately 60 minutes into a R3.8xlarge instance on AWS using Ocean9.


After the data was loaded, we connected to it directly using Cloud9 Charts, which features a full HANA integration, including point-and-click HANA specific SQL query generation.
Example query:
SELECT avg(total_amt), avg(tip_amt) FROM nyc.yellow
Took 1 sec to return. Not too shabby for a query touching 1.2 billion records.
So let’s turn to some analysis. You can find the full interactive dashboard here:

Trip Geo Clusters

Pickups are heavily concentrated around Manhattan (midtown in particular), as well as JFK and La Guardia.

Day/Hour Trends

Heatmap of trips by day by the hour of day:
A few observations:
Peak times are 6–10PM and 8–10 AM weekdays, but notice the dip between 4–5PM. This was a but puzzling to me at first as you’d expect more trips during rush hour. Turns out that around 5PM is when the shift change occurs where cab drivers are heading back to the garage. It might be a bit harder to hail a yellow cab during that time.

Monthly Trends & Predictions

The chart below plots the following:
  • Total Monthly Rides since 2009 (Blue)
  • 3 Month Moving Average (Green)
  • Predicted Values (Yellow). This uses Cloud9 Charts’s prediction modelsout of the box that’ll automatically backtest the data to select the best model.
According to the data, trips have gone from a peak of almost 16 million trips in May 2012 to a low of 11m in Feb 2016, which also coincides with the rise of Uber, Lyft and Green Cabs in NYC.
However, the prediction model values (in yellow) indicates that the downtrend appears to have stabilized, with a slight uptick expected over the next year.


This particular taxi analysis just scratches the surface: ride data is not just about going from point A to B, but in some ways provides a pulse of the city itself. Time permitting, I’ll put together more detailed analysis of the data in future posts.
The convergence of cloud, optionality in database types for the right workload, infrastructure provisioning using a Database-as-a-Service, with an Analytics-as-a-Service drastically accelerates the time to insights to make self-service analytics a reality.


Guest Post: NYC Cab Rides using SAP HANA and Ocean9

This is a guest blog post from Frank Stienhas, CTO and co-founder of Ocean9 that provides a one-click SAP HANA as-a-service. This post discusses the setup of a HANA cluster for analysis of NYC Yellow Cab Taxi Rides from 2009-2016.

We recently held a BrightTALK webinar around self-service analytics on SAP HANA with our friends from Knowi (formerly Cloud9 Charts). We selected a public dataset with NYC Yellow Taxi Rides from 2009 to 2016. Nicely stored in Amazon S3 like so many public datasets.

Properties: 217 GB RAW CSV data with 1.231 billion rows.

A word on data oceans

I think it is no longer a discussion that the data persistence layer for data oceans is cloud object storage such as Amazon S3, Azure Blob Storage or Google Cloud Storage.
Standard price points are 2.4 to 3.0 cents / GB / month across the 3 providers.
Also each of the 3 providers have the equivalent of infrequently accessed object storage, which comes in at 1.0 to 1.25 cents / GB / month.
AWS and Google both promise a durability of 99.999999999 %. Azure has no published durability statement at this point.
Don't think though that object storage is a commodity. Between the 3 providers there are massive differences in the domains of performance, security and best practices.

A word on data lakes

There might also be no better storage for your corporate datasets, including your most confidential ones. That is because
  • Strong Security, Data Protection and Compliance capabilities
  • Durability, Availability design
  • unlimited immediate scaling
And the price points above.
Security Options include that you can lock down access in a number of ways and the specify strong encryption methods. However as mentioned earlier the 3 providers are not equals on Security and Data Protection.

SAP HANA and data lakes and oceans

In the following I will describe a low tech and high tech approach to load datasets from the data ocean into SAP HANA.
Let me describe some general rules that you should consider.

General Rules

1) Co-Location

You will want to put your SAP HANA System next to the data set. This will provide you with maximum performance of the data load and minimum cost (outbound dataset traffic to your HANA system).
In the NYC Yellow Taxi case, we can deduce from the Dataset URL that it is stored in Amazon S3 - us-east-1. So you will want to place your HANA System there.

2) Private Network Optimization

If your HANA System runs in a private subnet then you should configure an AWS VPC Endpoint for S3, to provide you with High Performance access at minimum cost. Otherwise all S3 traffic will go through your NAT layer which will certainly not bring you higher performance but it will lead to higher cost.

3) Use an instance type with 10 Gbit Networking for High Performance Data Loading

We will see later what Amazon S3 can provide in terms of throughput. If you care about data load speed then you should select
  • R3.8xlarge (244 GB RAM)
  • M4.10xlarge (160 GB RAM)
  • C3 / C4.8xlarge (60 GB RAM)
Why am I not listing X1 with 2 TB RAM? We are still waiting for the High Performance Network Drivers for SUSE Linux to activate the 20 Gbit mode. Until then X1 will not reach the throughput performance of the instances above.

Manual Approach

After taking care of the above you can continue with

4) Copy the Dataset to the HANA Machine.

You should use AWS Command Line Interface for achieving a decent storage throughput of 200 MB / second. I would also suggest to store the dataset on a different storage device than what is underneath /hana/data and /hana/log. Otherwise SAP HANA will compete for Storage Bandwidth during data load.

5) Load Data

Now you can load the data into SAP HANA, either using HANA Studio or the HANA Command Line. This blog post describes nicely how to do this.

Cloud Native Approach

Well a cloud native approach is to do the above in two clicks. One for Provisioning the system and one for loading the data.
Ocean9 provides this to you and more.
At Ocean9 we are permanently seeking to get the maximum out of the cloud. We have realized a direct path for data loading from S3 to SAP HANA without persistence in between.
CPU was permanently in the corridor of 80-90%. Disk I/O was not an issue with less than 170 MB / second.
The data load took exactly 60 minutes. (and yes there is further room for improvement)
On X1.32xlarge data loading took 125 minutes (because of the Driver status described above)

HANA Performance

I just ran one statement to get an impression.
select sum(total_amt) from nyc.yellow_taxi
The SQL statement needs to touch all "rows" and can theoretically use all vCPUs in parallel.
The command completes in
  • r3.8xlarge :    1.0 second  (32 vCPUs)
  • r3.4xlarge :    1.8 seconds (16 vCPUs)
  • x1.32xlarge : 0.4 seconds (128 vCPUs)

Backup and Restore

Using our Advanced Backup and Restore Implementation for SAP HANA, it is always a good idea to perform a data backup to S3 after data load.
For this dataset backup takes 10 minutes and restore takes 8 minutes.
Now we can create a new brand new system including this dataset in 20 minutes, with data loaded in Memory !
For the brighttalk webinar we will launch the system on Wednesday morning and terminate it before noon.
This is how the cloud should be used.

What's next ?

See it yourself ! Watch the BrightTALK webinar to see this in action combined with an Analytics Service from Knowi !
Try it yourself !  Use my HANA SQL Schema for the NYC Taxi Dataset.