Integrating Alluxio and Presto for Fast and Interactive Data Queries on Azure

Background

Alluxio, formerly known as Tachyon, is a virtual distributed storage system. It blends data access and bridges computation frameworks and underlying storage systems. Applications can connect to Alluxio and access data from underlying file systems.

Figure 1: Alluxio architectural diagram (source: Alluxio official website)

Presto is attractive to organizations who want to query multiple data sources with the same database, and such a use case is common in the business world, since we have many organizations with distributed, diverse data storage technologies and architectures, that have yet to realize value from their data.

TheDataTeam is premium Microsoft partners for data engineering and data science related engagements on Azure.In our experiences, we worked together with Microsoft , to configure Presto with Alluxio on Azure, while using Azure Blob Storage as underlying storage. The objective of the exercise was to integrate Alluxio and Presto on Azure VMs and run few gated queries, using Presto on data sets present on Azure blob storage. This setup and the steps mentioned below were not performed on a production environment.This post is technical in nature, and outlines the steps and some of our learnings from the exercise. In the rest of this post, we will cover component installation, setting up and configuring Alluxio and Presto on Azure virtual machines, using Azure blob storage.

Component Installation and Configuration

In this section, we will look at the key components we’ve used for this big data system integration exercise, and how we went about configuring and installing them.

Technology or Framework Installed Version
Apache Hive 2.0.1
MySQL Database (community edition) 5.5.33
Alluxio 1.4.0
Presto 0.163
Hadoop 2.7.2

Hive, Hadoop and MySQL… and Alluxio

Let’s now look at how these components are installed and configured. We get started with the basic database environment within which queries can be run – we choose Hive, Hadoop and MySQL here.

  1. Download Hive, Hadoop and MySQL. In a production environment, it is recommended to use MySQL as a metastore, so we canconfigure MySQL metastore for Hive, instead of the default Derby Metastore. At this stage, it is important to add the Hadoop path in your .bashrc file, since the Hive shell requires the Hadoop path to be set, before it starts. This link is a good resource to understand the configuration process for MySQL metastore with Hive.
  2. Now that we have installed and configured Hive, Hadoop and MySQL, we can download and configure Alluxio, which is a memory-speed virtual distributed storage system. Please follow this link to setup the Alluxio, locally on a single machine. Alluxio is distributed by default, and this install of Alluxio can act both as the master node and the worker node. Once Alluxio setup is done and tested, we need to configure Alluxio, and in our case, we integrated Alluxio with Azure blob storage. Here is a description of how this configuration can be done.

Working with Alluxio On a Cluster

Setting up Alluxio on a cluster requires us to set master and worker nodes. Additionally, we have to configure password-less logins on the cluster we set up. These steps are described below:

  1. Add the Inet address (private IP address) of the master and worker machines to the /etc/hosts file of master machine
  2. Add the Inet address (private IP address) of the master machine to the /etc/hosts file of all worker machines
  3. Add the private IP address of all the worker machines to the alluxio/conf/workers file of master machine.
  4. Add the master’s private IP address to alluxio/conf/alluxio-env.sh
  5. Enable password-less login between the master and all worker machines, and from master to master as well.
  6. Add the below property in hive/conf/hive-site.xml file:
name: fs.defaultFS
value: alluxio://:19998

Note:

  1. Alluxio can be configured to run in High Availability mode ( HA) using Zookeeper or shared file system for storing journals. In that case Step 4 and Step 5 would change accordingly.
  2. If you don’t wish to use the master as a worker, then don’t add the master’s IP address in alluxio/conf/workers file

Once configured, Alluxio’s web interface can be opened using this URL: http://:19999

Installing and Configuring Presto

Presto is an open source SQL query engine with a distributed architecture. Presto’s key benefits are that it was designed for interactive query workloads, which very much suite data science and fast querying use cases, such as those we have seen with the Hive and LLAP use case described in this earlier post by our colleague, Divya. Now that we understand what Presto is for, let’s look at how to download, install and configure Presto. Presto is available at this link, and in this case, it has to be installed on all machines in the cluster. A detailed treatment of Presto configuration processes is available at this link.

While we’re on Presto configuration, it is good to remember a couple of things:

  • Make sure that all the worker node has different value for node.id property in the presto/etc/node.properties file
  • Presto requires Java version “1.8.0_60” or above

Once the Presto configuration is done, we can configure the Hive connector and then, configure the jmx connector (this is an optional step).

Azure Virtual Machine Settings

So far we’ve dealt with the configurations and settings for Hadoop, Hive, MySQL, Alluxio and Presto. Now, we can make changes to the Azure VMs. Let’s look at what properties to change:

  1. Enable swap memory in the /etc/waagent.conffile, and restart the VM
    1. ResourceDisk.EnableSwap=y

    Increase the max user process in /etc/security/limits.d/90-nproc.conf file and restart the VM. Default value is 1024. Note: This step is mandatory for the master node of Presto.

Setting up the Presto JAR

Our Azure Virtual Machine is now configured to run Presto. We need to now download and deploy the Presto JAR file. Download presto-cli-0.177-executable.jar to the master VM and rename it as presto. Ensure that you make it executable. Once this is done, the presto shell can be easily opened up using this command:

./presto –server :8080 –catalog hive –schema default

Once configured, Presto’s web interface can be opened up by using: http://:8080

Querying Data from Azure Blob Storage

In our installation and configuration process so far, we’ve set up a system capable of fast and interactive queries using Alluxio and Presto. Now we will see how to query data on the Azure Blob storage using this system:

  1. First mount the underlying file system path “uri” into the Alluxio namespace as “path” using command:
    alluxio/bin/alluxio fs mount “path” “uri”
  2. Open the Hive shell
  3. Then mention the “path” in the CREATE query statement, including the path to the data directory of interest:
    E.g.

    CREATE EXTERNAL TABLE IF NOT EXISTS person_details(f_name string,l_name string)
    STORED AS ORC
    LOCATION ‘alluxio://:19998/path/
    tblproperties (“orc.compress”=”ZLIB”);
  4. Run msck repair query to add the metadata about the partitions to the Hive catalogs.msck repair table person_details
  5. Now, exit the Hive shell. You should be able to open the Presto shell now, and run your query.

Figure 2: Screenshot of the Presto web interface for a particular query

Concluding Remarks

In this post, we’ve seen how Presto can be configured with Alluxio, and how the system can be used to query data on Azure blog storage. We’ve taken a look at the installation and configuration aspects of the basic Hadoop, Hive and MySQL databases at first, and then Alluxio and Presto configuration. We’ve also looked at some of the tweaks and gotchas in this post, including how to configure Presto master nodes, and setting up the Presto runtime files.

Once the above setup is completed one can analyze the query performance on a cluster using the Presto’s handy web interface. The Presto web interface displays multiple parameters like degree of parallelism, total memory utilization in cluster, number of rows processed per second, and others. All these parameters can enable data engineering teams that are monitoring resources a fair sense of how resources are being utilized in such low latency, interactive querying use cases.

In real world querying workloads, such as in data science teams that use Apache Zeppelin and Apache Spark to workloads spread across different machines, a setup like this featuring Alluxio and Presto can be a real force multiplier for data engineering teams who support and enable such an infrastructure to deliver value.

Leave a Comment