02-containerized-environment
Warning: Before you start this section, make sure that you have the required prerequisites to run the workshop.
2.0 Overview
In this section, we will set up the containerized environment for our data pipeline. We'll create a Dockerfile to define our container image, a docker-compose.yml file to orchestrate our services, and a requirements.txt file to specify our Python dependencies. We'll also set up environment variables for database configuration.
By the end of this section, you will have:
- A Dockerfile that sets up a multi-stage build for extract, load, and transform operations.
- A docker-compose.yml file that defines services for PostgreSQL and our ETL processes.
- A requirements.txt file listing all necessary Python packages.
- An .env file for environment variable configuration.
These files will form the foundation of our containerized data pipeline, allowing for consistent and reproducible deployments across different environments.
Prerequisites
Before starting this lesson, please ensure that you have:
- Completed the prerequisites for this workshop
- Read the data processing fundamentals document
- Docker and Docker Compose installed on your system
- A basic understanding of Python and SQL
Lesson Content
2.1 Creating the Dockerfile
We'll start by creating a Dockerfile that defines a multi-stage build process for our data pipeline.
-
Navigate to the
data-pipeline
folder in your project directory. -
Open up the file named
Dockerfile
in your text editor. -
Copy the following content into the
Dockerfile
:# Base image FROM python:3.9-slim AS base WORKDIR /usr/src/app COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt # Extract stage FROM base AS extract WORKDIR /usr/src/app COPY scripts/extract.py ./ COPY entrypoints/extract.sh ./ RUN chmod +x extract.sh ENTRYPOINT ["sh", "extract.sh"] # Load stage FROM base AS load WORKDIR /usr/src/app COPY scripts/load.py ./ COPY entrypoints/load.sh ./ RUN chmod +x load.sh ENTRYPOINT ["sh", "load.sh"] # Transform stage FROM base AS transform WORKDIR /usr/src/app COPY scripts/transform.py ./ COPY entrypoints/transform.sh ./ RUN chmod +x transform.sh ENTRYPOINT ["sh", "transform.sh"]
This Dockerfile uses a multi-stage build process to create separate, lightweight images for each stage of our ETL pipeline:
-
The
base
stage:- Uses the official Python 3.9 slim image as a starting point
- Sets the working directory to
/usr/src/app
- Copies the
requirements.txt
file and installs the Python dependencies
-
The
extract
,load
, andtransform
stages:- Each stage builds upon the
base
stage - Copies the respective Python script (
extract.py
,load.py
, ortransform.py
) into the container - Copies the corresponding shell script entrypoint
- Makes the entrypoint script executable
- Sets the entrypoint to run the shell script
- Each stage builds upon the
This approach allows us to create separate, specialized containers for each part of our ETL process, improving modularity and reducing the overall image size.
2.2 Managing Python Dependencies
Next, we'll create a requirements.txt
file to manage our Python dependencies.
-
In the
data-pipeline
folder, open the file namedrequirements.txt
in your text editor. -
Add the following content to the
requirements.txt
file:boto3==1.26.0 botocore==1.29.0 pandas==2.0.0 numpy==1.25.0 sqlalchemy==2.0.0 psycopg2-binary==2.9.6
This requirements.txt
file specifies the exact versions of the Python packages our project needs:
boto3
andbotocore
: AWS SDK for Python, used for interacting with AWS servicespandas
andnumpy
: Data manipulation and numerical computing librariessqlalchemy
: SQL toolkit and Object-Relational Mapping (ORM) library for database operationspsycopg2-binary
: PostgreSQL adapter for Python
By specifying exact versions, we ensure that our environment is consistent and reproducible across different setups.
2.3 Creating the Docker Compose File
Now, we'll create a docker-compose.yml
file to orchestrate our services.
- In the
data-pipeline
folder, open the file nameddocker-compose.yml
in your text editor. - Add the following content to the
docker-compose.yml
file:
version: '3'
services:
postgres:
image: postgres:13
environment:
POSTGRES_USER: ${DB_USER}
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_DB: ${DB_NAME}
networks:
- mynetwork
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
extract:
build:
context: .
target: extract
environment:
- DB_NAME=${DB_NAME}
- DB_USER=${DB_USER}
- DB_PASSWORD=${DB_PASSWORD}
- DB_HOST=${DB_HOST}
- DB_PORT=${DB_PORT}
networks:
- mynetwork
volumes:
- shared-data:/data
depends_on:
- postgres
load:
build:
context: .
target: load
environment:
- DB_NAME=${DB_NAME}
- DB_USER=${DB_USER}
- DB_PASSWORD=${DB_PASSWORD}
- DB_HOST=${DB_HOST}
- DB_PORT=${DB_PORT}
networks:
- mynetwork
volumes:
- shared-data:/data
depends_on:
- extract
transform:
build:
context: .
target: transform
environment:
- DB_NAME=${DB_NAME}
- DB_USER=${DB_USER}
- DB_PASSWORD=${DB_PASSWORD}
- DB_HOST=${DB_HOST}
- DB_PORT=${DB_PORT}
networks:
- mynetwork
volumes:
- shared-data:/data
depends_on:
- load
volumes:
pgdata:
shared-data:
networks:
mynetwork:
driver: bridge
This docker-compose.yml
file defines our multi-container application:
-
The
postgres
service:- Uses the official PostgreSQL 13 image
- Sets up environment variables for the database using values from the
.env
file - Exposes port 5432 for database connections
- Uses a named volume (
pgdata
) for data persistence
-
The
extract
,load
, andtransform
services:- Build from the Dockerfile, targeting the respective stage
- Share environment variables for database connection
- Mount the
scripts
directory and a shared data volume - Depend on the
postgres
service (and each other, in sequence)
-
Volumes:
pgdata
: For PostgreSQL data persistenceshared-data
: For sharing data between ETL stages
-
Networks:
- Creates a bridge network
mynetwork
for inter-container communication
- Creates a bridge network
This setup allows our services to run in isolated containers while still being able to communicate with each other and share necessary data.
2.4 Configuring Environment Variables
Lastly, we'll create an .env
file to store our environment variables.
-
In the
data-pipeline
folder, open the file named.env
in your text editor. -
Add the following content to the
.env
file:DB_NAME=weather_data DB_USER=your_user DB_PASSWORD=your_password DB_HOST=postgres DB_PORT=5432
This .env
file contains key-value pairs for our environment variables:
DB_NAME
: The name of our PostgreSQL databaseDB_USER
andDB_PASSWORD
: Credentials for accessing the databaseDB_HOST
: Set topostgres
, which is the service name of our PostgreSQL containerDB_PORT
: The port on which PostgreSQL is running (default is 5432)
Using a .env
file allows us to keep sensitive information out of our version-controlled files and easily change configurations without modifying our Docker Compose file.
Warning: While .env files are convenient for development, it's important to follow best practices, especially when moving towards production. We're using .env files for simplicity in this workshop, production deployments often require more robust security measures. Take a look at .env Best Practices for more guidance.
Conclusion
In this lesson, you learned how to set up a containerized environment for a data pipeline using Docker. You created a Dockerfile for multi-stage builds, wrote a docker-compose.yml file to orchestrate multiple services, managed Python dependencies using a requirements.txt file, and configured environment variables for database connections.
In the next lesson, we'll dive into the data processing scripts that will run inside these containers.
Key Points
- Dockerfiles allow us to define the environment and dependencies for our application
- Docker Compose helps orchestrate multiple containers and their interactions
- Using environment variables increases the portability and security of our setup
- Multi-stage builds in Docker allow for more efficient and organized container images
Further Reading
Appendix
.env Best Practices
While .env files are convenient for development, it's important to follow best practices, especially when moving towards production:
-
Never commit .env files to version control:
- Add
.env
to your.gitignore
file. - Provide an
.env.example
file with dummy values as a template.
- Add
-
Use different .env files for different environments:
- Create separate files like
.env.development
,.env.staging
, and.env.production
.
- Create separate files like
-
Limit access to production .env files:
- Only give access to team members who absolutely need it.
-
Regularly rotate credentials:
- Change passwords and API keys periodically.
-
Use strong, unique values:
- Avoid simple or reused passwords.
-
Consider using a secrets management system for production:
- Tools like AWS Secrets Manager or HashiCorp Vault can be more secure for production environments.
-
Validate your .env files:
- Ensure all required variables are present before running your application.
-
Use specific variable names:
- Prefix variables with the service they relate to, e.g.,
DB_PASSWORD
instead of justPASSWORD
.
- Prefix variables with the service they relate to, e.g.,
-
Don't store sensitive data in .env if not necessary:
- Only keep truly sensitive or configurable data in .env files.
-
Encrypt .env files when transmitting:
- If you need to send .env files, use secure, encrypted channels.
Remember, while we're using .env files for simplicity in this workshop, production deployments often require more robust security measures.