Alluxio Community Office Hour
June 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Lu Qiu, Alluxio
Bin Fan, Alluxio
The hybrid cloud model, where cloud resources run Spark or Presto jobs against data stored on-premises, is an appealing solution to reduce resource contention in on-premise environments while also saving in overall costs. One key flaw in a hybrid model is the overhead associated with transferring data between the two environments. Data and metadata locality within the compute application must be achieved in order to maintain the similar performance of analytics jobs as if the entire workload was run on-premises.
In this office hour, we demonstrate how a “zero-copy burst” solution helps to speed up Spark and Presto queries in the public cloud while eliminating the process of manually copying and synchronizing data from the on-premise data lake to cloud storage. This approach allows compute frameworks to decouple from on-premise data sources and scale efficiently by leveraging Alluxio and public cloud resources such as AWS.
We will cover:
- Typical challenges of moving data to the cloud and expanding compute capacity.
- Details about “zero-copy” hybrid cloud solution for burst computing
- A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR
Report
Share
Report
Share
1 of 13
Download to read offline
More Related Content
Bursting Spark or Presto Jobs to AWS using Alluxio
1. Bursting Spark or Presto Jobs to AWS using Alluxio
Lu Qiu
Bin Fan
06/23/2020
1
2. Goals
● The birth of hybrid cloud solution
● Details about “zero-copy” hybrid cloud solution for burst computing
● A demo of running Presto analytic queries using remote on-prem
HDFS data with Alluxio deployed in AWS EMR
2
3. Problem in Single On-prem Cluster
Hadoop cluster is
often compute-bound
Complex to maintain
Datacenter
Spark Presto Hive
Tensor
Flow
3
4. Solution in a Nutshell
4
Separate and bridge compute & storage with Alluxio
5. Alluxio Overview
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Lines of Business
5
7. Embrace Public Cloud?
● Independent Scaling and on-demand provisioning of both
compute and storage
● Flexibility, any amount in any region
● Reduced cost
● Reduced overhead on existing infrastructure
8
Many benefits, but ….
8. How to Move everything to cloud?
● Existing infrastructure on-premises
● Existing data ingestion pipeline
● Regulatory restrictions
● Time consuming and affect existing workloads
9
9. Our Solution: “Zero-copy” bursting
compute for hybrid cloud
▪ Ease of deployment and manageability: provisioned & managed by EMR
▪ Elasticity: adding deployments & capacity on the fly by EMR
▪ Cost effectiveness: reduced load in On-Prem cluster & reduced data transfer
Alluxio
▪ Orchestrates compute
access to on-prem data
▪ Working set of data, not
FULL set of data
▪ Local performance without
manually copying and
synchronizing data
10
Benefits:
10. Today’s Demo
Launch 2 clusters connected using terraform
a. An Apache Storage Cluster (mocking on-prem) in VPC A
b. An EMR Compute Cluster w/ Alluxio in VPC B
c. VPC creation and Peering is part of terraform
d. Alluxio is configured to mount the remote HDFS
e. Presto is configured with remote Hive metastore and access remote HDFS via Alluxio
Burst workloads to EMR without manual copies and synchronization
11
11. Tutorials
● 20 mins to setup both clusters using tf and run an actual query
● AWS tutorial can be found here
● GCP tutorial can be found here
12
12. Extension: Moving data to the cloud
● Flexible timing to migrate, with less dependencies
● Instead of hard switch over, migrate at own pace
● Moves the data per policy – e.g. migrate data
which has not been used for 7 days from hdfs to s3
Alluxio policy-driven data management
13