Data Engineering

How to run Dremio with AWS Glue on your local machine

Nov 26, 2024

Data Engineering AWS Data Lakehouse

Dremio Glue Diagram

In this article, I’ll show you how to replicate a cloud-based or on-premise infrastructure locally using Docker, Dremio, LocalStack, and Spark. Whether you’re debugging complex configurations or experimenting with new features, this hands-on tutorial provides everything you need to get started.

Shopping list:

Docker environment — I use colima which is lightweight and easy to manage from the CLI
Dremio — Available on DockerHub
LocalStack Pro license — A Pro license is required for AWS Glue (available only in the Pro version). There’s a 14 days free trial you can use
Apache Spark with optional Delta Lake

Setting up the Docker Environment

To start, we’ll create a Docker composition that includes both a Dremio instance and LocalStack. This setup will allow us to emulate a local environment similar to our cloud setup, making testing and experimentation easy and safe.

Here’s a reference docker-compose.yml file you can use:

version: '3.8'
services:
  dremio:
    image: dremio/dremio-oss:25.2
    container_name: dremio
    ports:
      - "9047:9047"
      - "31010:31010"
      - "45678:45678"
      - "32010:32010"
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist\=

file:///opt/dremio/data/dist
    depends_on:
      - localstack
    restart: always
localstack:
    image: localstack/localstack-pro
    container_name: localstack
    ports:
      - "4566:4566"
    environment:
      - SERVICES=s3,glue
      - LOCALSTACK_AUTH_TOKEN=redacted :)
      - USE_SSL=1
    volumes:
      - localstack_data:/var/lib/localstack
      - /var/run/docker.sock:/var/run/docker.sock
      - ./data:/data
    restart: always
volumes:
  dremio_data:
  localstack_data:

As you can see, the docker-compose.yml file is divided into two main blocks: dremio and localstack.

Dremio: This section is straightforward. We’re pulling the Dremio 25.2 image from DockerHub, mapping essential ports between our host machine and the Dremio container, and setting a few environment variables as specified in the Dremio documentation. These configurations ensure we can access and run Dremio locally just as we would in a production environment.
LocalStack: Similarly, this block pulls the LocalStack image from DockerHub, maps port 4566 (LocalStack’s default API gateway port), and activates the required services, specifically S3 and Glue. Make sure to include your LOCALSTACK_AUTH_TOKEN, which you can get with a LocalStack Pro license (a 14-day free trial is available if you’re just experimenting). We’re also mapping some volumes here, including ./data, which we’ll be using shortly.

Deploying the Docker Composition

Once you have your docker-compose.yml file set up, it’s time to launch the environment.

Run the following command to start the Docker composition:

docker-compose up

This will spin up both the Dremio and LocalStack instances. Once the services are up, navigate to http://localhost:9047 in your browser to access your fresh local installation of Dremio.

Now the fun part

With Dremio and LocalStack up and running, let’s jump into connecting and configuring them to emulate a production-like environment. I’m sharing my findings here, but feel free to experiment with different configurations — these should provide a solid starting point!

Adding the Glue Data Catalog as a Source in Dremio

Open Dremio: Navigate to http://localhost:9047 and log in
Create a New Source: Go to the Sources tab and select Add Source, then choose AWS Glue Data Catalog from the list of available sources.

Configure the General Tab: set any options you like in the General tab, also for AWS Credentials. Given that we are on a mock environment we do not need much customization around security.
Set the Advanced Options:
Here we need to leverage some advanced settings.
S3 and Glue are managed services so it makes sense for Dremio to hide the following configurations. But since we need to redirect all the traffic to our local AWS stack we need to tune a couple of parameters 🔨

* aws.glue.endpoint = http://localstack:4566
* fs.s3a.endpoint = http://localstack:4566
* fs.s3a.path.style.access = true
* dremio.s3.compat = true

aws.glue.endpoint = http://localstack:4566
This is the hardest part to find. Given that AWS Glue Metastore is a managed service you will not find many resources or documentation on how to configure an application to go on your LocalStack’s Glue instance. After some debugging, I discovered here that the property we need to configure for routing Dremio to our local Glue instance is this one.
fs.s3a.endpoint = http://localstack:4566
This sets the endpoint for S3-compatible storage, routing Dremio to the S3 service running locally. It enables seamless data access through LocalStack’s simulated S3 service. This configuration is very well documented online with respect to the previous one.
fs.s3a.path.style.access = true
This option enables path-style access for S3 (using http://localstack:4566/bucket/key). LocalStack’s S3 service generally requires this access style.
dremio.s3.compat = true
Enabling this option ensures that Dremio uses compatibility mode for S3, as it mimics AWS S3’s behavior closely but not identically. This parameter avoids Dremio passing through AWS STS to acquire credentials for reading the bucket. Source

Click Save and we will have a fresh new — but still empty — source in Dremio 🚀

Creating tables and metatables with Spark and Glue

Now, let’s populate our Dremio Source with Parquet and Delta tables.

Let’s start with the Parquet one. Open a spark-shell on your machine and write a table:

> spark-shell

In your Spark shell, define some sample data and schema:

import org.apache.spark.sql.Row

val data = Seq(
  Row("P001", "Smartphone X", 699L, "USD", "Electronics", 1.675e12),
  Row("P002", "Laptop Y", 999L, "USD", "Electronics", 1.675e12),
  Row("P003", "Headphones Z", 199L, "EUR", "Accessories", 1.675e12),
  Row("P004", "Gaming Console", 399L, "USD", "Gaming", 1.675e12),
  Row("P005", "Smartwatch W", 299L, "GBP", "Wearables", 1.675e12)
)

import org.apache.spark.sql.types._

val schema = StructType(
  List(
    StructField("product_id", StringType, nullable = true),
    StructField("product_name", StringType, nullable = true),
    StructField("price", LongType, nullable = true),
    StructField("CURRENCY", StringType, nullable = true),
    StructField("category", StringType, nullable = true),
    StructField("updated_at", DoubleType, nullable = true)
  )
)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.write.parquet("data/parquet")

🔍 Here, we’re writing the data into the data/parquet directory, which is mapped to the LocalStack volume in docker-compose.yml.

Let’s log into our LocalStack instance and create the metatable/Glue Table.

docker exec -it --user root localstack /bin/bash

awslocal glue create-database --database-input '{"Name": "database"}'
awslocal glue create-table \
    --database-name database \
    --table-input '{
        "Name": "parquet",
        "StorageDescriptor": {
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "SortColumns": [],
            "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                "Parameters": {
                    "serialization.format": "1"
                }
            },
            "Location": "s3://parquet/",
            "NumberOfBuckets": -1,
            "StoredAsSubDirectories": false,
            "Columns": [
                {
                    "Type": "string",
                    "Name": "product_id"
                },
                {
                    "Type": "string",
                    "Name": "product_name"
                },
                {
                    "Type": "bigint",
                    "Name": "price"
                },
                {
                    "Type": "string",
                    "Name": "CURRENCY"
                },
                {
                    "Type": "string",
                    "Name": "category"
                },
                {
                    "Type": "double",
                    "Name": "updated_at"
                }
            ],
            "Compressed": false
        },
        "PartitionKeys": [],
        "Parameters": {
            "transient_lastDdlTime": "1728663697",
            "spark.sql.sources.provider": "parquet",
            "spark.sql.create.version": "3.3.0-amzn-1"
        },
        "TableType": "EXTERNAL_TABLE",
        "Retention": 0
    }'

This creates a Glue table called “parquet”, with the S3 location set to s3://parquet/. But it still does not exist! Let’s create it and upload the Parquet data:

awslocal s3 mb s3://parquet/
awslocal s3 cp --recursive /data/parquet/ s3://parquet

Let’s go back to Dremio and update the source. After refreshing it, you should see the newly created Parquet table under the configured Glue catalog — all running in your local environment!

We can replicate the process for Delta Lake tables with just one minor difference: Dremio requires a Native Delta Table written in Glue. To achieve this, we need to write the Delta table in a specific way, ensuring it’s compatible with Glue metastore. More about this here and here.

🔥 Let’s fire up Spark again and create a Delta Table with Spark first.

spark-shell --packages io.delta:delta-core_2.12:2.4.0 --conf

"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"

--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.

catalog.DeltaCatalog"

import org.apache.spark.sql.Row

val data = Seq(
  Row("P001", "Smartphone X", 699L, "USD", "Electronics", 1.675e12),
  Row("P002", "Laptop Y", 999L, "USD", "Electronics", 1.675e12),
  Row("P003", "Headphones Z", 199L, "EUR", "Accessories", 1.675e12),
  Row("P004", "Gaming Console", 399L, "USD", "Gaming", 1.675e12),
  Row("P005", "Smartwatch W", 299L, "GBP", "Wearables", 1.675e12)
)

import org.apache.spark.sql.types._

val schema = StructType(
  List(
    StructField("product_id", StringType, nullable = true),
    StructField("product_name", StringType, nullable = true),
    StructField("price", LongType, nullable = true),
    StructField("CURRENCY", StringType, nullable = true),
    StructField("category", StringType, nullable = true),
    StructField("updated_at", DoubleType, nullable = true)
  )
)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.write.format("delta").save("data/delta")

So we get our Delta Table.

Now, let’s create the Glue table (which is going to be a Native Delta Table)

> docker exec -it --user root localstack /bin/bash

awslocal glue create-table \
    --database-name database \
    --table-input '{
        "Name": "delta",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {"Name": "product_id", "Type": "string"},
                {"Name": "product_name", "Type": "string"},
                {"Name": "price", "Type": "bigint"},
                {"Name": "currency", "Type": "string"},
                {"Name": "category", "Type": "string"},
                {"Name": "updated_at", "Type": "double"}
            ],
            "Location": "s3://delta/",
            "AdditionalLocations": [],
            "InputFormat": "org.apache.hadoop.mapred.SequenceFileInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.

LazySimpleSerDe",
                "Parameters": {
                    "serialization.format": "1",
                    "path": "s3://delta/"
                }
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {
                "EXTERNAL": "true",
                "UPDATED_BY_CRAWLER": "delta-lake-native-connector",
                "spark.sql.sources.schema.part.0": "{\"type\":\"struct\",\"fields\":

[{\"name\":\"product_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},

{\"name\":\"product_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},

{\"name\":\"price\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\

":\"CURRENCY\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\

"category\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\

"updated_at\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}",
                "CrawlerSchemaSerializerVersion": "1.0",
                "CrawlerSchemaDeserializerVersion": "1.0",
                "spark.sql.partitionProvider": "catalog",
                "classification": "delta",
                "spark.sql.sources.schema.numParts": "1",
                "spark.sql.sources.provider": "delta",
                "delta.lastCommitTimestamp": "1653462383292",
                "delta.lastUpdateVersion": "6",
                "table_type": "delta"
            },
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "EXTERNAL": "true",
            "UPDATED_BY_CRAWLER": "delta-lake-native-connector",
            "spark.sql.sources.schema.part.0": "{\"type\":\"struct\",\"fields\":[{\"name\"

:\"product_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":

\"product_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\"

,\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"CURRENCY\",\"type\":

\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"category\",\"type\":

\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"updated_at\",\"type\":

\"double\",\"nullable\":true,\"metadata\":{}}]}",
            "CrawlerSchemaSerializerVersion": "1.0",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "spark.sql.partitionProvider": "catalog",
            "classification": "delta",
            "spark.sql.sources.schema.numParts": "1",
            "spark.sql.sources.provider": "delta",
            "delta.lastCommitTimestamp": "1653462383292",
            "delta.lastUpdateVersion": "6",
            "table_type": "delta",
            "sourceType": "api"
        }
    }'

Again, this will create a Glue table called “delta”, with the S3 location set to s3://delta/. In addition, there is a schema written directly by Spark which makes the delta table native.

Let’s run a query in Dremio on our delta table!

Conclusion and future developments

With this setup, you can test your Dremio’s interactions with Glue locally. Experiment with additional configurations or use this setup as a base for cloud deployment.

Technical note: As you may have noticed, we didn’t write a native Delta table in Glue through Spark in this current setup. This is because we’re not using Glue as the catalog for our local Spark instance. I’m actively working on incorporating Glue as the catalog for local Spark in upcoming developments 🧑💻

How to run Dremio with AWS Glue on your local machine

Setting up the Docker Environment

Deploying the Docker Composition

Now the fun part

Adding the Glue Data Catalog as a Source in Dremio

Creating tables and metatables with Spark and Glue

Conclusion and future developments

Similar posts

Build your own Spark frontends with Spark Connect Part 1: The CLI tool

A new spin on literate programming

FinOps Architecture for Data Products