Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
@tmanik
Latest commit 1b22c77 Sep 17, 2024 History
1 contributor

Users who have contributed to this file

Data Pipeline Flow

graph TD
    subgraph AWS
        A[AWS Data Source]
    end

    subgraph "Container: Extract"
        B[extract.py]
    end

    subgraph "Container: Load"
        C[load.py]
    end

    subgraph "Container: Transform"
        D[transform.py]
    end

    subgraph "PostgreSQL Database"
        E[Loading Table]
        F[Final Table]
    end

    subgraph "Container: Visualize"
        G[Flask App]
    end

    A -->|Data| B
    B -->|Extracted Data| C
    C -->|Load Data| E
    E -->|Read Data| D
    D -->|Transformed Data| F
    F -->|Read Data| G

    classDef container fill:#e6f3ff,stroke:#333,stroke-width:2px;
    class B,C,D,G container;
    classDef scriptText fill:#e6f3ff,stroke:#333,stroke-width:2px,color:black;
    class B,C,D,G scriptText;

Flow Explanation

The entire process is orchestrated using shell scripts (extract.sh, load.sh, transform.sh) which manage the execution of each step in the pipeline:

  1. Extract: Data is sourced from AWS and extracted using extract.py in the Extract container.

  2. Load: The extracted data is then loaded into the Loading Table of the PostgreSQL database using load.py in the Load container.

  3. Transform: Data from the Loading Table is read, transformed using transform.py in the Transform container, and then stored in the Final Table.

  4. Visualize: Finally, a Flask app in the Visualize container reads data from the Final Table to create visualizations or serve data via an API.

This pipeline ensures a structured flow of data from its source to a visualized or API-accessible format, with clear separation of concerns at each stage. The use of shell scripts for orchestration allows for flexible and controllable execution of the pipeline steps.