Skip to main content

AWS Glue

What is it

A serverless, fully managed data integration service that makes it easy to discover, prepare, move, and integrate data from various sources for analytics, machine learning, and application development.

What it's for

Perform ETL (Extract, Transform, Load) operations at scale, automate data pipelines, and manage data metadata.

Use cases

  • Data preparation for data warehousing and data lakes
  • Data transformation for machine learning
  • Data catalog for data discovery and governance
  • Integration of data from different sources
  • Automation of ETL tasks

Key points

  • Serverless: No infrastructure to provision or manage
  • Data Catalog: The AWS Glue Data Catalog is a central repository for metadata about all your data assets
  • ETL: Supports visual ETL with AWS Glue Studio, as well as Python and Scala scripts
  • Crawlers: Automatically discovers data schema and metadata
  • Integration: Integrates with other AWS services, such as S3, Redshift, Athena, EMR, etc.

Comparison with Amazon EMR

  • AWS Glue: A serverless, fully managed ETL service, ideal for simpler and automated data transformations, and for cataloging data. More suitable for developers who prefer a no-code or low-code approach.
  • Amazon EMR: More flexible for complex big data workloads that require control over the cluster and the ability to use different frameworks (Hadoop, Spark, etc.). Requires more management of the underlying infrastructure.