Skip to content
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
class-container-curriculum-dev/02-containerized-environment/
class-container-curriculum-dev/02-containerized-environment/

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
September 9, 2024 14:41

02-containerized-environment

Warning: Before you start this section, make sure that you have the required prerequisites to run the workshop.

2.0 Overview

In this section, we will set up the containerized environment for our data pipeline. We'll create a Dockerfile to define our container image, a docker-compose.yml file to orchestrate our services, and a requirements.txt file to specify our Python dependencies. We'll also set up environment variables for database configuration.

By the end of this section, you will have:

  1. A Dockerfile that sets up a multi-stage build for extract, load, and transform operations.
  2. A docker-compose.yml file that defines services for PostgreSQL and our ETL processes.
  3. A requirements.txt file listing all necessary Python packages.
  4. An .env file for environment variable configuration.

These files will form the foundation of our containerized data pipeline, allowing for consistent and reproducible deployments across different environments.

Prerequisites

Before starting this lesson, please ensure that you have:

  1. Completed the prerequisites for this workshop
  2. Read the data processing fundamentals document
  3. Docker and Docker Compose installed on your system
  4. A basic understanding of Python and SQL

Lesson Content

2.1 Creating the Dockerfile

We'll start by creating a Dockerfile that defines a multi-stage build process for our data pipeline.

  1. Navigate to the data-pipeline folder in your project directory.

  2. Open up the file named Dockerfile in your text editor.

  3. Copy the following content into the Dockerfile:

    # Base image
    FROM python:3.9-slim AS base
    WORKDIR /usr/src/app
    COPY requirements.txt ./
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Extract stage
    FROM base AS extract
    WORKDIR /usr/src/app
    COPY scripts/extract.py ./
    COPY entrypoints/extract.sh ./
    RUN chmod +x extract.sh
    ENTRYPOINT ["sh", "extract.sh"]
    
    # Load stage
    FROM base AS load
    WORKDIR /usr/src/app
    COPY scripts/load.py ./
    COPY entrypoints/load.sh ./
    RUN chmod +x load.sh
    ENTRYPOINT ["sh", "load.sh"]
    
    # Transform stage
    FROM base AS transform
    WORKDIR /usr/src/app
    COPY scripts/transform.py ./
    COPY entrypoints/transform.sh ./
    RUN chmod +x transform.sh
    ENTRYPOINT ["sh", "transform.sh"]

This Dockerfile uses a multi-stage build process to create separate, lightweight images for each stage of our ETL pipeline:

  1. The base stage:

    • Uses the official Python 3.9 slim image as a starting point
    • Sets the working directory to /usr/src/app
    • Copies the requirements.txt file and installs the Python dependencies
  2. The extract, load, and transform stages:

    • Each stage builds upon the base stage
    • Copies the respective Python script (extract.py, load.py, or transform.py) into the container
    • Copies the corresponding shell script entrypoint
    • Makes the entrypoint script executable
    • Sets the entrypoint to run the shell script

This approach allows us to create separate, specialized containers for each part of our ETL process, improving modularity and reducing the overall image size.

2.2 Managing Python Dependencies

Next, we'll create a requirements.txt file to manage our Python dependencies.

  1. In the data-pipeline folder, open the file named requirements.txt in your text editor.

  2. Add the following content to the requirements.txt file:

    boto3==1.26.0
    botocore==1.29.0
    pandas==2.0.0
    numpy==1.25.0
    sqlalchemy==2.0.0
    psycopg2-binary==2.9.6
    

This requirements.txt file specifies the exact versions of the Python packages our project needs:

  • boto3 and botocore: AWS SDK for Python, used for interacting with AWS services
  • pandas and numpy: Data manipulation and numerical computing libraries
  • sqlalchemy: SQL toolkit and Object-Relational Mapping (ORM) library for database operations
  • psycopg2-binary: PostgreSQL adapter for Python

By specifying exact versions, we ensure that our environment is consistent and reproducible across different setups.

2.3 Creating the Docker Compose File

Now, we'll create a docker-compose.yml file to orchestrate our services.

  1. In the data-pipeline folder, open the file named docker-compose.yml in your text editor.
  2. Add the following content to the docker-compose.yml file:
version: '3'
services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_NAME}
    networks:
      - mynetwork
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  extract:
    build:
      context: .
      target: extract
    environment:
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
    networks:
      - mynetwork      
    volumes:
      - shared-data:/data
    depends_on:
      - postgres

  load:
    build:
      context: .
      target: load
    environment:
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
    networks:
      - mynetwork      
    volumes:
      - shared-data:/data
    depends_on:
      - extract

  transform:
    build:
      context: .
      target: transform
    environment:
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
    networks:
      - mynetwork      
    volumes:
      - shared-data:/data
    depends_on:
      - load

volumes:
  pgdata:
  shared-data:

networks:
  mynetwork:
    driver: bridge

This docker-compose.yml file defines our multi-container application:

  1. The postgres service:

    • Uses the official PostgreSQL 13 image
    • Sets up environment variables for the database using values from the .env file
    • Exposes port 5432 for database connections
    • Uses a named volume (pgdata) for data persistence
  2. The extract, load, and transform services:

    • Build from the Dockerfile, targeting the respective stage
    • Share environment variables for database connection
    • Mount the scripts directory and a shared data volume
    • Depend on the postgres service (and each other, in sequence)
  3. Volumes:

    • pgdata: For PostgreSQL data persistence
    • shared-data: For sharing data between ETL stages
  4. Networks:

    • Creates a bridge network mynetwork for inter-container communication

This setup allows our services to run in isolated containers while still being able to communicate with each other and share necessary data.

2.4 Configuring Environment Variables

Lastly, we'll create an .env file to store our environment variables.

  1. In the data-pipeline folder, open the file named .env in your text editor.

  2. Add the following content to the .env file:

    DB_NAME=weather_data
    DB_USER=your_user
    DB_PASSWORD=your_password
    DB_HOST=postgres
    DB_PORT=5432
    

This .env file contains key-value pairs for our environment variables:

  • DB_NAME: The name of our PostgreSQL database
  • DB_USER and DB_PASSWORD: Credentials for accessing the database
  • DB_HOST: Set to postgres, which is the service name of our PostgreSQL container
  • DB_PORT: The port on which PostgreSQL is running (default is 5432)

Using a .env file allows us to keep sensitive information out of our version-controlled files and easily change configurations without modifying our Docker Compose file.

Warning: While .env files are convenient for development, it's important to follow best practices, especially when moving towards production. We're using .env files for simplicity in this workshop, production deployments often require more robust security measures. Take a look at .env Best Practices for more guidance.

Conclusion

In this lesson, you learned how to set up a containerized environment for a data pipeline using Docker. You created a Dockerfile for multi-stage builds, wrote a docker-compose.yml file to orchestrate multiple services, managed Python dependencies using a requirements.txt file, and configured environment variables for database connections.

In the next lesson, we'll dive into the data processing scripts that will run inside these containers.

Key Points

  • Dockerfiles allow us to define the environment and dependencies for our application
  • Docker Compose helps orchestrate multiple containers and their interactions
  • Using environment variables increases the portability and security of our setup
  • Multi-stage builds in Docker allow for more efficient and organized container images

Further Reading

Appendix

.env Best Practices

While .env files are convenient for development, it's important to follow best practices, especially when moving towards production:

  1. Never commit .env files to version control:

    • Add .env to your .gitignore file.
    • Provide an .env.example file with dummy values as a template.
  2. Use different .env files for different environments:

    • Create separate files like .env.development, .env.staging, and .env.production.
  3. Limit access to production .env files:

    • Only give access to team members who absolutely need it.
  4. Regularly rotate credentials:

    • Change passwords and API keys periodically.
  5. Use strong, unique values:

    • Avoid simple or reused passwords.
  6. Consider using a secrets management system for production:

    • Tools like AWS Secrets Manager or HashiCorp Vault can be more secure for production environments.
  7. Validate your .env files:

    • Ensure all required variables are present before running your application.
  8. Use specific variable names:

    • Prefix variables with the service they relate to, e.g., DB_PASSWORD instead of just PASSWORD.
  9. Don't store sensitive data in .env if not necessary:

    • Only keep truly sensitive or configurable data in .env files.
  10. Encrypt .env files when transmitting:

    • If you need to send .env files, use secure, encrypted channels.

Remember, while we're using .env files for simplicity in this workshop, production deployments often require more robust security measures.