Data Pipeline Flow

graph TD
    subgraph AWS
        A[AWS Data Source]
    end

    subgraph "Container: Extract"
        B[extract.py]
    end

    subgraph "Container: Load"
        C[load.py]
    end

    subgraph "Container: Transform"
        D[transform.py]
    end

    subgraph "PostgreSQL Database"
        E[Loading Table]
        F[Final Table]
    end

    subgraph "Container: Visualize"
        G[Flask App]
    end

    A -->|Data| B
    B -->|Extracted Data| C
    C -->|Load Data| E
    E -->|Read Data| D
    D -->|Transformed Data| F
    F -->|Read Data| G

    classDef container fill:#e6f3ff,stroke:#333,stroke-width:2px;
    class B,C,D,G container;
    classDef scriptText fill:#e6f3ff,stroke:#333,stroke-width:2px,color:black;
    class B,C,D,G scriptText;

Flow Explanation

The entire process is orchestrated using shell scripts (extract.sh, load.sh, transform.sh) which manage the execution of each step in the pipeline:

Extract: Data is sourced from AWS and extracted using extract.py in the Extract container.
Load: The extracted data is then loaded into the Loading Table of the PostgreSQL database using load.py in the Load container.
Transform: Data from the Loading Table is read, transformed using transform.py in the Transform container, and then stored in the Final Table.
Visualize: Finally, a Flask app in the Visualize container reads data from the Final Table to create visualizations or serve data via an API.

This pipeline ensures a structured flow of data from its source to a visualized or API-accessible format, with clear separation of concerns at each stage. The use of shell scripts for orchestration allows for flexible and controllable execution of the pipeline steps.

class-container-curriculum-dev/diagram.md

Users who have contributed to this file

Data Pipeline Flow

Flow Explanation