Monday, September 26, 2016

Guest Post: NYC Cab Rides using SAP HANA and Ocean9

This is a guest blog post from Frank Stienhas, CTO and co-founder of Ocean9 that provides a one-click SAP HANA as-a-service. This post discusses the setup of a HANA cluster for analysis of NYC Yellow Cab Taxi Rides from 2009-2016.

We recently held a BrightTALK webinar around self-service analytics on SAP HANA with our friends from Knowi (formerly Cloud9 Charts). We selected a public dataset with NYC Yellow Taxi Rides from 2009 to 2016. Nicely stored in Amazon S3 like so many public datasets.

Properties: 217 GB RAW CSV data with 1.231 billion rows.

A word on data oceans

I think it is no longer a discussion that the data persistence layer for data oceans is cloud object storage such as Amazon S3, Azure Blob Storage or Google Cloud Storage.
Standard price points are 2.4 to 3.0 cents / GB / month across the 3 providers.
Also each of the 3 providers have the equivalent of infrequently accessed object storage, which comes in at 1.0 to 1.25 cents / GB / month.
AWS and Google both promise a durability of 99.999999999 %. Azure has no published durability statement at this point.
Don't think though that object storage is a commodity. Between the 3 providers there are massive differences in the domains of performance, security and best practices.

A word on data lakes

There might also be no better storage for your corporate datasets, including your most confidential ones. That is because
  • Strong Security, Data Protection and Compliance capabilities
  • Durability, Availability design
  • unlimited immediate scaling
And the price points above.
Security Options include that you can lock down access in a number of ways and the specify strong encryption methods. However as mentioned earlier the 3 providers are not equals on Security and Data Protection.

SAP HANA and data lakes and oceans

In the following I will describe a low tech and high tech approach to load datasets from the data ocean into SAP HANA.
Let me describe some general rules that you should consider.

General Rules

1) Co-Location

You will want to put your SAP HANA System next to the data set. This will provide you with maximum performance of the data load and minimum cost (outbound dataset traffic to your HANA system).
In the NYC Yellow Taxi case, we can deduce from the Dataset URL that it is stored in Amazon S3 - us-east-1. So you will want to place your HANA System there.

2) Private Network Optimization

If your HANA System runs in a private subnet then you should configure an AWS VPC Endpoint for S3, to provide you with High Performance access at minimum cost. Otherwise all S3 traffic will go through your NAT layer which will certainly not bring you higher performance but it will lead to higher cost.

3) Use an instance type with 10 Gbit Networking for High Performance Data Loading

We will see later what Amazon S3 can provide in terms of throughput. If you care about data load speed then you should select
  • R3.8xlarge (244 GB RAM)
  • M4.10xlarge (160 GB RAM)
  • C3 / C4.8xlarge (60 GB RAM)
Why am I not listing X1 with 2 TB RAM? We are still waiting for the High Performance Network Drivers for SUSE Linux to activate the 20 Gbit mode. Until then X1 will not reach the throughput performance of the instances above.

Manual Approach

After taking care of the above you can continue with

4) Copy the Dataset to the HANA Machine.

You should use AWS Command Line Interface for achieving a decent storage throughput of 200 MB / second. I would also suggest to store the dataset on a different storage device than what is underneath /hana/data and /hana/log. Otherwise SAP HANA will compete for Storage Bandwidth during data load.

5) Load Data

Now you can load the data into SAP HANA, either using HANA Studio or the HANA Command Line. This blog post describes nicely how to do this.

Cloud Native Approach

Well a cloud native approach is to do the above in two clicks. One for Provisioning the system and one for loading the data.
Ocean9 provides this to you and more.
At Ocean9 we are permanently seeking to get the maximum out of the cloud. We have realized a direct path for data loading from S3 to SAP HANA without persistence in between.
CPU was permanently in the corridor of 80-90%. Disk I/O was not an issue with less than 170 MB / second.
The data load took exactly 60 minutes. (and yes there is further room for improvement)
On X1.32xlarge data loading took 125 minutes (because of the Driver status described above)

HANA Performance

I just ran one statement to get an impression.
select sum(total_amt) from nyc.yellow_taxi
The SQL statement needs to touch all "rows" and can theoretically use all vCPUs in parallel.
The command completes in
  • r3.8xlarge :    1.0 second  (32 vCPUs)
  • r3.4xlarge :    1.8 seconds (16 vCPUs)
  • x1.32xlarge : 0.4 seconds (128 vCPUs)

Backup and Restore

Using our Advanced Backup and Restore Implementation for SAP HANA, it is always a good idea to perform a data backup to S3 after data load.
For this dataset backup takes 10 minutes and restore takes 8 minutes.
Now we can create a new brand new system including this dataset in 20 minutes, with data loaded in Memory !
For the brighttalk webinar we will launch the system on Wednesday morning and terminate it before noon.
This is how the cloud should be used.

What's next ?

See it yourself ! Watch the BrightTALK webinar to see this in action combined with an Analytics Service from Knowi !
Try it yourself !  Use my HANA SQL Schema for the NYC Taxi Dataset.