Apache Presto

4 min readSep 16, 2019

Ever thought of joining a table in SQL Server with a table in MySql or a table in RDBMS with a table in NoSQL. That’s strange requirement and can we do it???

Yes. We can, using presto.

What is Presto:

Apache Presto is a distributed parallel query execution engine, optimised for low latency and interactive query analysis. Presto runs queries easily and scales without down time even from gigabytes to petabytes.

A single Presto query can process data from multiple sources like HDFS, MySQL, Cassandra, Hive and many more data sources. Presto is built in Java and easy to integrate with other data infrastructure components.

History:

Presto started as a project at Facebook, to run interactive analytic queries against a 300PB data warehouse, built with large Hadoop/HDFS-based clusters. Prior to building Presto, Facebook used Apache Hive, which it created and rolled out in 2008, to bring the familiarity of the SQL syntax to the Hadoop ecosystem. Hive had a significant impact on the Hadoop ecosystem for simplifying complex Java MapReduce jobs into SQL-like queries, while being able to execute jobs at high scale. However, it wasn’t optimised for fast performance needed in interactive queries.

In 2012, the Facebook Data Infrastructure group built Presto, an interactive query system that could operate quickly at petabyte scale. It was rolled out company-wide in spring, 2013. In November, 2013, Facebook open sourced Presto under the Apache Software License, and made it available for anyone to download on Github. Today, Presto has become a popular choice for doing interactive queries on Hadoop, and has a lot of contributions from Facebook, and other organisations. Facebook’s implementation of Presto is used by over a thousand employees, who run more than 30,000 queries, processing one petabyte of data daily.

Architecture overview with Hive Connector:

The architecture of Presto is almost similar to classic MPP (massively parallel processing) DBMS architecture. The following diagram illustrates the architecture of Presto.

Client:

Client (Presto CLI) submits SQL statements to a coordinator to get the result.

Coordinator:

Coordinator is a master daemon. The coordinator initially parses the SQL queries then analyses and plans for the query execution. Scheduler performs pipeline execution, assigns work to the closest node and monitors progress.

Connector:

Storage plugins are called as connectors. Hive, HBase, MySQL, Cassandra and many more act as a connector; otherwise you can also implement a custom one. The connector provides metadata and data for queries. The coordinator uses the connector to get metadata for building a query plan.

Worker:

The coordinator assigns task to worker nodes. The workers get actual data from the connector. Finally, the worker node delivers result to the client.