[go: nahoru, domu]

SlideShare a Scribd company logo
Bursting Spark or Presto Jobs to AWS using Alluxio
Lu Qiu
Bin Fan
06/23/2020
1
Goals
● The birth of hybrid cloud solution
● Details about “zero-copy” hybrid cloud solution for burst computing
● A demo of running Presto analytic queries using remote on-prem
HDFS data with Alluxio deployed in AWS EMR
2
Problem in Single On-prem Cluster
Hadoop cluster is
often compute-bound
Complex to maintain
Datacenter
Spark Presto Hive
Tensor
Flow
3
Solution in a Nutshell
4
Separate and bridge compute & storage with Alluxio
Alluxio Overview
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Lines of Business
5
A New Challenge
7
How to handle bursty & transient compute workloads?
Embrace Public Cloud?
● Independent Scaling and on-demand provisioning of both
compute and storage
● Flexibility, any amount in any region
● Reduced cost
● Reduced overhead on existing infrastructure
8
Many benefits, but ….
How to Move everything to cloud?
● Existing infrastructure on-premises
● Existing data ingestion pipeline
● Regulatory restrictions
● Time consuming and affect existing workloads
9
Our Solution: “Zero-copy” bursting
compute for hybrid cloud
▪ Ease of deployment and manageability: provisioned & managed by EMR
▪ Elasticity: adding deployments & capacity on the fly by EMR
▪ Cost effectiveness: reduced load in On-Prem cluster & reduced data transfer
Alluxio
▪ Orchestrates compute
access to on-prem data
▪ Working set of data, not
FULL set of data
▪ Local performance without
manually copying and
synchronizing data
10
Benefits:
Today’s Demo
Launch 2 clusters connected using terraform
a. An Apache Storage Cluster (mocking on-prem) in VPC A
b. An EMR Compute Cluster w/ Alluxio in VPC B
c. VPC creation and Peering is part of terraform
d. Alluxio is configured to mount the remote HDFS
e. Presto is configured with remote Hive metastore and access remote HDFS via Alluxio
Burst workloads to EMR without manual copies and synchronization
11
Tutorials
● 20 mins to setup both clusters using tf and run an actual query
● AWS tutorial can be found here
● GCP tutorial can be found here
12
Extension: Moving data to the cloud
● Flexible timing to migrate, with less dependencies
● Instead of hard switch over, migrate at own pace
● Moves the data per policy – e.g. migrate data
which has not been used for 7 days from hdfs to s3
Alluxio policy-driven data management
13
Thank you!
14

More Related Content

Bursting Spark or Presto Jobs to AWS using Alluxio

  • 1. Bursting Spark or Presto Jobs to AWS using Alluxio Lu Qiu Bin Fan 06/23/2020 1
  • 2. Goals ● The birth of hybrid cloud solution ● Details about “zero-copy” hybrid cloud solution for burst computing ● A demo of running Presto analytic queries using remote on-prem HDFS data with Alluxio deployed in AWS EMR 2
  • 3. Problem in Single On-prem Cluster Hadoop cluster is often compute-bound Complex to maintain Datacenter Spark Presto Hive Tensor Flow 3
  • 4. Solution in a Nutshell 4 Separate and bridge compute & storage with Alluxio
  • 5. Alluxio Overview Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Lines of Business 5
  • 6. A New Challenge 7 How to handle bursty & transient compute workloads?
  • 7. Embrace Public Cloud? ● Independent Scaling and on-demand provisioning of both compute and storage ● Flexibility, any amount in any region ● Reduced cost ● Reduced overhead on existing infrastructure 8 Many benefits, but ….
  • 8. How to Move everything to cloud? ● Existing infrastructure on-premises ● Existing data ingestion pipeline ● Regulatory restrictions ● Time consuming and affect existing workloads 9
  • 9. Our Solution: “Zero-copy” bursting compute for hybrid cloud ▪ Ease of deployment and manageability: provisioned & managed by EMR ▪ Elasticity: adding deployments & capacity on the fly by EMR ▪ Cost effectiveness: reduced load in On-Prem cluster & reduced data transfer Alluxio ▪ Orchestrates compute access to on-prem data ▪ Working set of data, not FULL set of data ▪ Local performance without manually copying and synchronizing data 10 Benefits:
  • 10. Today’s Demo Launch 2 clusters connected using terraform a. An Apache Storage Cluster (mocking on-prem) in VPC A b. An EMR Compute Cluster w/ Alluxio in VPC B c. VPC creation and Peering is part of terraform d. Alluxio is configured to mount the remote HDFS e. Presto is configured with remote Hive metastore and access remote HDFS via Alluxio Burst workloads to EMR without manual copies and synchronization 11
  • 11. Tutorials ● 20 mins to setup both clusters using tf and run an actual query ● AWS tutorial can be found here ● GCP tutorial can be found here 12
  • 12. Extension: Moving data to the cloud ● Flexible timing to migrate, with less dependencies ● Instead of hard switch over, migrate at own pace ● Moves the data per policy – e.g. migrate data which has not been used for 7 days from hdfs to s3 Alluxio policy-driven data management 13