From 1b22c7778720135449127a98397bdf2ceda1347c Mon Sep 17 00:00:00 2001 From: tmanik <tmanik@internet2.edu> Date: Tue, 17 Sep 2024 10:39:20 -0400 Subject: [PATCH] Added diagram --- diagram.md | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 diagram.md diff --git a/diagram.md b/diagram.md new file mode 100644 index 0000000..dc5fcee --- /dev/null +++ b/diagram.md @@ -0,0 +1,59 @@ +# Data Pipeline Flow + +<div align="center"> + +```mermaid +graph TD + subgraph AWS + A[AWS Data Source] + end + + subgraph "Container: Extract" + B[extract.py] + end + + subgraph "Container: Load" + C[load.py] + end + + subgraph "Container: Transform" + D[transform.py] + end + + subgraph "PostgreSQL Database" + E[Loading Table] + F[Final Table] + end + + subgraph "Container: Visualize" + G[Flask App] + end + + A -->|Data| B + B -->|Extracted Data| C + C -->|Load Data| E + E -->|Read Data| D + D -->|Transformed Data| F + F -->|Read Data| G + + classDef container fill:#e6f3ff,stroke:#333,stroke-width:2px; + class B,C,D,G container; + classDef scriptText fill:#e6f3ff,stroke:#333,stroke-width:2px,color:black; + class B,C,D,G scriptText; +``` + +</div> + +## Flow Explanation + +The entire process is orchestrated using shell scripts (extract.sh, load.sh, transform.sh) which manage the execution of each step in the pipeline: + +1. **Extract**: Data is sourced from AWS and extracted using `extract.py` in the Extract container. + +2. **Load**: The extracted data is then loaded into the Loading Table of the PostgreSQL database using `load.py` in the Load container. + +3. **Transform**: Data from the Loading Table is read, transformed using `transform.py` in the Transform container, and then stored in the Final Table. + +4. **Visualize**: Finally, a Flask app in the Visualize container reads data from the Final Table to create visualizations or serve data via an API. + +This pipeline ensures a structured flow of data from its source to a visualized or API-accessible format, with clear separation of concerns at each stage. The use of shell scripts for orchestration allows for flexible and controllable execution of the pipeline steps. \ No newline at end of file