Little Things about AWS Glue

It happened and it was not a smooth drive.

Published in

Ephod Technology

6 min readOct 6, 2020

This year we received a case that we were asked to support to build a data pipeline on AWS Glue. As a heavy EMR user, Glue is not a familiar concept. Through months of struggling, digging, and thinking, here is some of the observations.

AWS Glue was first announced in 2017. In order to fulfill what was not provided by EMR, AWS released Glue. Glue is a serverless ETL service built on the top of AWS EMR.

AWS breaks ETL service down into distinct modules that work both independently and coherently. Although AWS Glue was labeled an ETL service, it provides a full data life cycle. From Data catalog to Security settings, AWS Glue intends to provide a fully managed data service.

AWS Glue provides a list of functions to facilitate the needs of data processing at different stages. Some of the fundamental components to draw a full picture of a data pipeline including data application development.

Fundamental Glue components to compose a full picture of data processing.

Databases & Tables

Under Data catalog section, you would find databases and tables combo. AWS Glue, as a fully managed service for big data processing, databases and tables combo is not from the concept of relational databases. Both AWS Glue and EMR are built upon Hadoop. Hadoop is a distributed file system that stores and processes massive amount of data across multiple servers. It is obvious that AWS intends to separate data source from its interface. This is a decent design for big data as data is piling up with distinct criterias. However, here is the question we would ask. Are we most unlikely to access the entire dataset on a regular basis? If not, why do we need to store data in a managed storage where we pay high maintenance fee. Hence here we have tables to hold definitions of dataset which is alienated from the data source itself. Data is then accessed through pre-defined data definitions and all-time favorite SQL-like syntax when it is needed.

Connections

In addition to AWS file system, AWS provides one of its oldest service, S3, as a source and destination of a data process. Glue also provides easy interface to access multiple data provider services through Connections function. By setting up a connection in Glue, you are able to access data through pre-configured connection in a Glue job without hard-coding secrets and password in a connection string.

Crawlers

Crawler is a new concept in data processing. It reminds me of a scene in Star Trek series when Voyager meets aliens it needs to use different language patterns to adapt data from alien spaceship. In the good old days we data developers need to know what data to expect before data importing process begins. By using create table syntax, data schema needs to be discussed and planned beforehand. Data importing process breaks when unexpected data is delivered to our system. Now Glue is smart enough to analyze new data real-time. Crawler helps to produce a schema automatically upon a schedule or triggered event. After a crawler finishes its analysis on a dataset, the schema is transformed into a table in Glue.

Crawler, however, has two-sided stories. It is smart enough to analyze data real-time. If you have unexpected data coming into the system, this would keep data flow working properly even if we do not know the quality of incoming data. Crawler, however, is slow. It does not analyze data format base on a random sample from a dataset as we humans usually do for gigabytes of data. It reads through the entire dataset before drawing a conclusion on data format. The analytical process can be long and not worthwhile when you find most of the data type turn into string.

Jobs

Glue Job is the center of AWS Glue. Most of the coding is done in a job. AWS Glue job currently supports python and scala with three modes, python shell, spark, and pyspark. Python shell is a bit tricky as it took me a while to finally understand the difference between pyspark and python shell. Python shell is used for non-spark job. If you want to copy one data file from a s3 to another, creating a spark job for this scenario is clearly an over-statement. AWS Glue provides an alternative to avoid overstatement as such. A python shell job is simply a serverless computing unit to execute a python job. Python shell provides basic python libraries but you will need to zip your own python files and specify the location of python zip file for python library path setting on Glue console. It is worthwhile to note that a Glue spark job usually takes minutes to initiate before running the actual job while python shell is almost instantaneous.

Triggers

AWS also provides a way to kick-off a Glue job. It is called a trigger. A trigger can be an event, a scheduled time, or a manual action. Trigger is important in a workflow. It connects a series of jobs together to build up an ETL service.

Workflows

AWS Glue Workflows is a real breakthrough for an ETL service. A data pipeline can be as simple as loading data and as complicated as 10 distinct data processes with multiple types of data with distinct type of triggers at the same time. A workflow can implemented with a simple drag and drop on triggers, jobs, and conditions. AWS simplifies a data pipeline to a concept of triggers and jobs. Each Glue job by default return a status that is recognized by Glue. Glue Trigger can then be used to initiate next job based on the execution status of previous job.

AWS Glue Studio

AWS Glue Studios is a new service came out in Oct. 2020. Provided with a user friendly interface that serves similar purpose as AWS Glue Workflows, AWS Glue Studio added S3 storage in data pipeline configuration which brings more flexibility with a user-friendly Glue-job oriented dashboard.

Dev Endpoint

Dev Endpoint is a helpful development tool to work Glue. There are different ways to develop on Glue. However, running a spark job may take a while for each small try. If you are Jupyter Notebook type of person, Dev Endpoint will be at your service to speed up your development. Dev Endpoint is a computing service sits on the top of EMR. With support of Jupyter Notebook, development may be as comfortable as it can be.

Notebook

AWS integrated Jupyter Notebook as part of its big data service. Although different options on Glue development are available, Jupyer Notebook on AWS allows easy-to-access interface without any installation. Jupyter Notebook can also be installed locally on your own computer with constant connection to Glue dev endpoint. AWS also recommends Zeppelin Notebook in combination with Apache Livy and dev endpoint but personally I prefer Jupyter Notebook than Zepline.

Overall, AWS Glue provides a more convenient way to engage in data process field. Starting from dev endpoint and Jupyter Notebook, AWS Glue job development cycle should be greatly minimized. AWS Glue workflows allows users to set up both simple and complicated data pipeline in simple steps. Nowadays when manual operation is expected to be minimized, AWS Glue is surely a good tidings to big data industry.

There are downsides on AWS Glue. First of all, AWS Glue costs more than EMR. AWS charges DPU-hour (data point unit hour) for Glue service. On the latest AWS posting, Glue costs 0.44 / DPU-hour. Compare to EMR, a m5.xlarge in US-East Ohio region costs 0.192 / EC2-hour + 0.048 / EMR-hour. Glue costs about three times more than EMR in general. If price is not a big concern, Glue might be a good choice.

AWS Glue wraps Spark and Hadoop in order to position data programming into context of a data pipeline on AWS. Spark itself is not a mystery as supporting materials can be massively found online. AWS Glue versioned Spark mixed with multiple AWS services can be quite a mystery to general public. There are few traps I stepped on when working on Glue and end up using spark dataframe along. I would recommend Glue for simple data conversion such as csv to parquet format conversion but not for complicated data analysis and computation at the stage of Glue.