02-containerized-environment

Warning: Before you start this section, make sure that you have the required prerequisites to run the workshop.

2.0 Overview

In this section, we will set up the containerized environment for our data pipeline. We'll create a Dockerfile to define our container image, a docker-compose.yml file to orchestrate our services, and a requirements.txt file to specify our Python dependencies. We'll also set up environment variables for database configuration.

By the end of this section, you will have:

A Dockerfile that sets up a multi-stage build for extract, load, and transform operations.
A docker-compose.yml file that defines services for PostgreSQL and our ETL processes.
A requirements.txt file listing all necessary Python packages.
An .env file for environment variable configuration.

These files will form the foundation of our containerized data pipeline, allowing for consistent and reproducible deployments across different environments.

Prerequisites

Before starting this lesson, please ensure that you have:

Completed the prerequisites for this workshop
Read the data processing fundamentals document
Docker and Docker Compose installed on your system
A basic understanding of Python and SQL

Lesson Content

2.1 Creating the Dockerfile

We'll start by creating a Dockerfile that defines a multi-stage build process for our data pipeline.

Navigate to the data-pipeline folder in your project directory.
Open up the file named Dockerfile in your text editor.

Copy the following content into the Dockerfile:

# Base image
FROM python:3.9-slim AS base
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

# Extract stage
FROM base AS extract
WORKDIR /usr/src/app
COPY scripts/extract.py ./
COPY entrypoints/extract.sh ./
RUN chmod +x extract.sh
ENTRYPOINT ["sh", "extract.sh"]

# Load stage
FROM base AS load
WORKDIR /usr/src/app
COPY scripts/load.py ./
COPY entrypoints/load.sh ./
RUN chmod +x load.sh
ENTRYPOINT ["sh", "load.sh"]

# Transform stage
FROM base AS transform
WORKDIR /usr/src/app
COPY scripts/transform.py ./
COPY entrypoints/transform.sh ./
RUN chmod +x transform.sh
ENTRYPOINT ["sh", "transform.sh"]

This Dockerfile uses a multi-stage build process to create separate, lightweight images for each stage of our ETL pipeline:

The base stage:
- Uses the official Python 3.9 slim image as a starting point
- Sets the working directory to /usr/src/app
- Copies the requirements.txt file and installs the Python dependencies
The extract, load, and transform stages:
- Each stage builds upon the base stage
- Copies the respective Python script (extract.py, load.py, or transform.py) into the container
- Copies the corresponding shell script entrypoint
- Makes the entrypoint script executable
- Sets the entrypoint to run the shell script

This approach allows us to create separate, specialized containers for each part of our ETL process, improving modularity and reducing the overall image size.

2.2 Managing Python Dependencies

Next, we'll create a requirements.txt file to manage our Python dependencies.

In the data-pipeline folder, open the file named requirements.txt in your text editor.

Add the following content to the requirements.txt file:

boto3==1.26.0
botocore==1.29.0
pandas==2.0.0
numpy==1.25.0
sqlalchemy==2.0.0
psycopg2-binary==2.9.6

This requirements.txt file specifies the exact versions of the Python packages our project needs:

boto3 and botocore: AWS SDK for Python, used for interacting with AWS services
pandas and numpy: Data manipulation and numerical computing libraries
sqlalchemy: SQL toolkit and Object-Relational Mapping (ORM) library for database operations
psycopg2-binary: PostgreSQL adapter for Python

By specifying exact versions, we ensure that our environment is consistent and reproducible across different setups.

2.3 Creating the Docker Compose File

Now, we'll create a docker-compose.yml file to orchestrate our services.

In the data-pipeline folder, open the file named docker-compose.yml in your text editor.
Add the following content to the docker-compose.yml file:

version: '3'
services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_NAME}
    networks:
      - mynetwork
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  extract:
    build:
      context: .
      target: extract
    environment:
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
    networks:
      - mynetwork      
    volumes:
      - shared-data:/data
    depends_on:
      - postgres

  load:
    build:
      context: .
      target: load
    environment:
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
    networks:
      - mynetwork      
    volumes:
      - shared-data:/data
    depends_on:
      - extract

  transform:
    build:
      context: .
      target: transform
    environment:
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
    networks:
      - mynetwork      
    volumes:
      - shared-data:/data
    depends_on:
      - load

volumes:
  pgdata:
  shared-data:

networks:
  mynetwork:
    driver: bridge

This docker-compose.yml file defines our multi-container application:

The postgres service:
- Uses the official PostgreSQL 13 image
- Sets up environment variables for the database using values from the .env file
- Exposes port 5432 for database connections
- Uses a named volume (pgdata) for data persistence
The extract, load, and transform services:
- Build from the Dockerfile, targeting the respective stage
- Share environment variables for database connection
- Mount the scripts directory and a shared data volume
- Depend on the postgres service (and each other, in sequence)
Volumes:
- pgdata: For PostgreSQL data persistence
- shared-data: For sharing data between ETL stages
Networks:
- Creates a bridge network mynetwork for inter-container communication

This setup allows our services to run in isolated containers while still being able to communicate with each other and share necessary data.

2.4 Configuring Environment Variables

Lastly, we'll create an .env file to store our environment variables.

In the data-pipeline folder, open the file named .env in your text editor.

Add the following content to the .env file:

DB_NAME=weather_data
DB_USER=your_user
DB_PASSWORD=your_password
DB_HOST=postgres
DB_PORT=5432

This .env file contains key-value pairs for our environment variables:

DB_NAME: The name of our PostgreSQL database
DB_USER and DB_PASSWORD: Credentials for accessing the database
DB_HOST: Set to postgres, which is the service name of our PostgreSQL container
DB_PORT: The port on which PostgreSQL is running (default is 5432)

Using a .env file allows us to keep sensitive information out of our version-controlled files and easily change configurations without modifying our Docker Compose file.

Warning: While .env files are convenient for development, it's important to follow best practices, especially when moving towards production. We're using .env files for simplicity in this workshop, production deployments often require more robust security measures. Take a look at .env Best Practices for more guidance.

Conclusion

In this lesson, you learned how to set up a containerized environment for a data pipeline using Docker. You created a Dockerfile for multi-stage builds, wrote a docker-compose.yml file to orchestrate multiple services, managed Python dependencies using a requirements.txt file, and configured environment variables for database connections.

In the next lesson, we'll dive into the data processing scripts that will run inside these containers.

Key Points

Dockerfiles allow us to define the environment and dependencies for our application
Docker Compose helps orchestrate multiple containers and their interactions
Using environment variables increases the portability and security of our setup
Multi-stage builds in Docker allow for more efficient and organized container images

Appendix

.env Best Practices

While .env files are convenient for development, it's important to follow best practices, especially when moving towards production:

Never commit .env files to version control:
- Add .env to your .gitignore file.
- Provide an .env.example file with dummy values as a template.
Use different .env files for different environments:
- Create separate files like .env.development, .env.staging, and .env.production.
Limit access to production .env files:
- Only give access to team members who absolutely need it.
Regularly rotate credentials:
- Change passwords and API keys periodically.
Use strong, unique values:
- Avoid simple or reused passwords.
Consider using a secrets management system for production:
- Tools like AWS Secrets Manager or HashiCorp Vault can be more secure for production environments.
Validate your .env files:
- Ensure all required variables are present before running your application.
Use specific variable names:
- Prefix variables with the service they relate to, e.g., DB_PASSWORD instead of just PASSWORD.
Don't store sensitive data in .env if not necessary:
- Only keep truly sensitive or configurable data in .env files.
Encrypt .env files when transmitting:
- If you need to send .env files, use secure, encrypted channels.

Remember, while we're using .env files for simplicity in this workshop, production deployments often require more robust security measures.

README.md

02-containerized-environment

2.0 Overview

Prerequisites

Lesson Content

2.1 Creating the Dockerfile

2.2 Managing Python Dependencies

2.3 Creating the Docker Compose File

2.4 Configuring Environment Variables

Conclusion

Key Points

Further Reading

Appendix

.env Best Practices

Latest commit

Git stats

Files

README.md

02-containerized-environment

2.0 Overview

Prerequisites

Lesson Content

2.1 Creating the Dockerfile

2.2 Managing Python Dependencies

2.3 Creating the Docker Compose File

2.4 Configuring Environment Variables

Conclusion

Key Points

Further Reading

Appendix

.env Best Practices