diff --git a/diagram.md b/diagram.md new file mode 100644 index 0000000..dc5fcee --- /dev/null +++ b/diagram.md @@ -0,0 +1,59 @@ +# Data Pipeline Flow + +
+ +```mermaid +graph TD + subgraph AWS + A[AWS Data Source] + end + + subgraph "Container: Extract" + B[extract.py] + end + + subgraph "Container: Load" + C[load.py] + end + + subgraph "Container: Transform" + D[transform.py] + end + + subgraph "PostgreSQL Database" + E[Loading Table] + F[Final Table] + end + + subgraph "Container: Visualize" + G[Flask App] + end + + A -->|Data| B + B -->|Extracted Data| C + C -->|Load Data| E + E -->|Read Data| D + D -->|Transformed Data| F + F -->|Read Data| G + + classDef container fill:#e6f3ff,stroke:#333,stroke-width:2px; + class B,C,D,G container; + classDef scriptText fill:#e6f3ff,stroke:#333,stroke-width:2px,color:black; + class B,C,D,G scriptText; +``` + +
+ +## Flow Explanation + +The entire process is orchestrated using shell scripts (extract.sh, load.sh, transform.sh) which manage the execution of each step in the pipeline: + +1. **Extract**: Data is sourced from AWS and extracted using `extract.py` in the Extract container. + +2. **Load**: The extracted data is then loaded into the Loading Table of the PostgreSQL database using `load.py` in the Load container. + +3. **Transform**: Data from the Loading Table is read, transformed using `transform.py` in the Transform container, and then stored in the Final Table. + +4. **Visualize**: Finally, a Flask app in the Visualize container reads data from the Final Table to create visualizations or serve data via an API. + +This pipeline ensures a structured flow of data from its source to a visualized or API-accessible format, with clear separation of concerns at each stage. The use of shell scripts for orchestration allows for flexible and controllable execution of the pipeline steps. \ No newline at end of file