# Putting It All Together

:::{admonition} Overview
:class: tip

**Teaching:** 60 min

**Exercises:** 8 min

**Questions:**
  * Can you show me an example?

**Objectives:**
  * Create a simple workflow using a cloud VM and cloud object storage.
  * Update a VM Instance software for important security updates.
  * Create a VM Instance with the appropriate storage scope.
  * Create a private regional storage bucket with appropriate security settings.
  * Using the CLI to install software.
  * Download source code using git
  * Retrieve data from a bucket
  * Run the python analysis code
  * Store results in a bucket
  * View the results in the Cloud Storage browser.

:::

## A Research Computational and Data Workflow - Drew's story

Drew needs to do some analysis on the data.  They need data (satellite images stored in the cloud), computational resources (a virtual machine), some software (we will supply this), and a place to store the results (Cloud Storage).  We will assemble and process all these parts in the cloud with a simple example.


:::{tip}

If you get disconnected, you will need to reconnect to the virtual machine (`gcloud compute ssh essentials`) and re-run the following commands:
```
BUCKET="essentials-$USER-$(date +%F)"
REGION="us-west2"
echo "bucket: $BUCKET region: $REGION"
cd ~/CLASS-Examples/landsat
```
:::

## Create a VM

Since we only create resources as we need them in the cloud, we will now create a new virtual machine (VM)instance for Drew to use for their analysis.

We will do this as an exercise to give you practice in creating resources. Since the virtual machine will need access to storage on your behalf, you will need to change the **access scope** to give **Full** access to the **Storage** API to the virtual machine.

Before you do anything, think about (and check) your **who**, **where**, and **what**!


*Instructor: place the exercise instructions below on the screen*

*When you are done feel free to connect to the virtual machine on your own for additional practice.  Once everyone has created their VM we will connect to the machine as described below.*

:::{admonition} Exercise 6
:class: exercise

Using the console navigate to the "Compute Engine" service and create a new VM with the following properties.
  * Set the VM instance **name** to "**essentials**"
  * Select a bit larger VM instance by changing the machine **type** to "**e2-standard-2**".
  * Set the VM instance API access for "**Storage**" to "**Full**".  This can be found under "Identity and API access" on the "create an instance" page and then selecting  "**Set access for each API**" and change "Storage" to "Full". *This will allow the VM to create, read, write, and delete all storage buckets in the project"*
  
:::

Please verify that the virtual machine was created as above.  If you are unsure delete the virtual machine instance and create it again.

Verify that the **Compute Engine default service account** is being used.
![compute-iam-service-account](img/compute-iam-service-account.png)

Change **Access scopes** to **Set access for each API**
![compute-iam-scope-top](img/compute-iam-scope-top.png)

And set **Storage** to **Full**.
![compute-iam-scope-storage-full](img/compute-iam-scope-storage-full.png)

## Connect to the VM

Now login to the new virtual machine instance by opening up the Cloud Shell (don't forget to check your **who**, **where**, and **what**) and then running the following command:
```
gcloud compute ssh essentials
```
If prompted for a zone select `n` to find it automatically.  You can see an example session below.

```
learner@cloudshell:~ (just-armor-301114)$ gcloud compute ssh essentials
Did you mean zone [us-central1-b] for instance: [essentials] (Y/n)?  n

No zone specified. Using zone [us-west2-c] for instance: [essentials].
Linux essentials 4.19.0-18-cloud-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Nov  9 20:12:49 2021 from 34.133.99.196
learner@essentials:~$
```


## Secure the VM

We first make sure that the VM is up to date with the latest security patches by running the following commands. Note: the `sudo unattended-upgrades` command only installs important security packages and does not upgrade all packages.

In [1]:
sudo apt update
sudo unattended-upgrades

Hit:1 http://security.debian.org/debian-security buster/updates InRelease
Hit:2 http://deb.debian.org/debian buster InRelease
Hit:3 http://deb.debian.org/debian buster-updates InRelease
Hit:4 http://deb.debian.org/debian buster-backports InRelease
Hit:5 http://packages.cloud.google.com/apt cloud-sdk-buster InRelease
Get:6 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-buster InRelease [5553 B]
Hit:7 http://packages.cloud.google.com/apt google-compute-engine-buster-stable InRelease
Get:8 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-buster/main amd64 Packages [389 B]
Fetched 5942 B in 1s (7839 B/s)33m0m[33m
Reading package lists... Done
Building dependency tree       
Reading state information... Done
2 packages can be upgraded. Run 'apt list --upgradable' to see them.


## Setup Storage

Before we do any work we will first create a bucket to place the results with a reasonable set of options.  We do this first to make sure we can store the results when we are done, it is easier to fix problems now than later. 

We first store the bucket name in the `BUCKET` environment variable for future use.  This time we will specify a realistic set of options for a private bucket used for computation.

Options (run `gsutil mb --help` for more information):
 * `-b on` specifies uniform bucket-level access.
 * `-l $REGION` puts the data in a specific region for lower cost and lower latency.
 * `--pap enforced` turns on public access prevention to help keep data private.  
 
The uniform bucket level access (Bucket Policy Only enabled: true) puts the data access permissions (ACL) on the entire bucket, not on each object in the bucket.  This makes the permissions obvious and makes security much more predictable.
 
As usual, we must set our environment.  In this case we also set a `REGION` environment variable to indicate where in the world we want the data to be stored.


In [2]:
BUCKET="essentials-$USER-$(date +%F)"
REGION="us-west2"
echo "bucket: $BUCKET region: $REGION"

bucket: essentials-learner-2022-02-14 region: us-west2


In [3]:
gsutil mb -b on -l $REGION --pap enforced gs://$BUCKET

Creating gs://essentials-learner-2022-02-14/...


And verify the bucket was created

In [4]:
gsutil ls -b gs://$BUCKET

gs://essentials-learner-2022-02-14/


## Get Example Code

We will now install `git` and use it to download the example code into your home directory.  For those of you who are unfamiliar with git, it is a way to collaboratively manage files and we will only be using it to download the example that we will be using. 

In [1]:
sudo apt-get install --yes git

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.30.2-1).
0 upgraded, 0 newly installed, 0 to remove and 11 not upgraded.


In [2]:
cd ~

In [3]:
git clone https://github.internet2.edu/CLASS/CLASS-Examples.git

Cloning into 'CLASS-Examples'...
remote: Enumerating objects: 117, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 117 (delta 1), reused 9 (delta 1), pack-reused 107[K
Receiving objects: 100% (117/117), 20.66 KiB | 542.00 KiB/s, done.
Resolving deltas: 100% (44/44), done.


We now change the current directory to the `landsat` directory in the `CLASS-Examples` directory that was just created by the previous git command.

In [4]:
cd ~/CLASS-Examples/landsat/

Your prompt should now change showing the current directory as follows.
```
learner@essentials:~/CLASS-Examples/landsat$
```

In [5]:
ls -l

total 32
-rw-r--r-- 1 learner learner 964 May 20 17:41 ReadMe.md
-rw-r--r-- 1 learner learner  72 May 20 17:41 clean.sh
-rw-r--r-- 1 learner learner 280 May 20 17:41 download.sh
-rw-r--r-- 1 learner learner 113 May 20 17:41 get-data.sh
-rw-r--r-- 1 learner learner 345 May 20 17:41 get-index.sh
-rw-r--r-- 1 learner learner 613 May 20 17:41 process_sat.py
-rw-r--r-- 1 learner learner  95 May 20 17:41 search.json
-rw-r--r-- 1 learner learner 851 May 20 17:41 search.py


## Access the Bucket

Now we need to verify that Drew has access to the analysis data. 

We do this by testing that our tools are working and that we can access the public bucket that we will be using.

In [6]:
gsutil ls gs://gcp-public-data-landsat/

gs://gcp-public-data-landsat/index.csv.gz
gs://gcp-public-data-landsat/LC08/
gs://gcp-public-data-landsat/LE07/
gs://gcp-public-data-landsat/LM01/
gs://gcp-public-data-landsat/LM02/
gs://gcp-public-data-landsat/LM03/
gs://gcp-public-data-landsat/LM04/
gs://gcp-public-data-landsat/LM05/
gs://gcp-public-data-landsat/LO08/
gs://gcp-public-data-landsat/LT04/
gs://gcp-public-data-landsat/LT05/
gs://gcp-public-data-landsat/LT08/


## Getting the Metadata

Since the Landsat data is *huge* we do not, and cannot, download everything to the virtual machine.  We will only analyzing a subset of the data.

We will use the the `index.csv.gz` file, which is a list of all the files and additional metadata in the bucket and we can use it to search and filter the data.

We will first get the index and uncompress the file placing it in the `data/` directory (this is ignored by git). This should take around 2 min with a `e2-medium` instance in the `us-west2` region.

In [7]:
mkdir -v data

mkdir: created directory 'data'


In [8]:
gsutil cp gs://gcp-public-data-landsat/index.csv.gz data/

Copying gs://gcp-public-data-landsat/index.csv.gz...
\ [1 files][731.9 MiB/731.9 MiB]   54.4 MiB/s                                   
Operation completed over 1 objects/731.9 MiB.                                    


We will now uncompress the index file to make it easier to use.  This may take some time depending on the machine type you are using. (This is also why it is good to write scripts to do the entire process).

In [9]:
gzip -df data/index.csv.gz

Again, check what happened.

In [10]:
ls -lh data

total 2.5G
-rw-r--r-- 1 learner learner 2.5G May 20 17:42 index.csv


We will now explore the data.  The `head` command simply displays the first few lines of the file.

In [11]:
head data/index.csv

SCENE_ID,PRODUCT_ID,SPACECRAFT_ID,SENSOR_ID,DATE_ACQUIRED,COLLECTION_NUMBER,COLLECTION_CATEGORY,SENSING_TIME,DATA_TYPE,WRS_PATH,WRS_ROW,CLOUD_COVER,NORTH_LAT,SOUTH_LAT,WEST_LON,EAST_LON,TOTAL_SIZE,BASE_URL
LE71800592011134ASN00,LE07_L1TP_180059_20110514_20161209_01_T1,LANDSAT_7,ETM,2011-05-14,01,T1,2011-05-14T08:50:07.5251363Z,L1TP,180,59,74.0,2.39913,0.50961,18.09062,20.2487,181281962,gs://gcp-public-data-landsat/LE07/01/180/059/LE07_L1TP_180059_20110514_20161209_01_T1
LT51360422008226BKT00,LT05_L1GS_136042_20080813_20161030_01_T2,LANDSAT_5,TM,2008-08-13,01,T2,2008-08-13T04:03:49.0450690Z,L1GS,136,42,92.0,26.9495,25.03915,91.40541,93.81099,141994748,gs://gcp-public-data-landsat/LT05/01/136/042/LT05_L1GS_136042_20080813_20161030_01_T2
LE71760312020339NSG00,LE07_L1TP_176031_20201204_20201230_01_T1,LANDSAT_7,ETM,2020-12-04,01,T1,2020-12-04T07:41:11.6084536Z,L1TP,176,31,3.0,42.75649,40.7935,33.66313,36.65653,188511155,gs://gcp-public-data-landsat/LE07/01/176/031/LE07_L1TP_176031_20201204_

::::{tip}

To run the above commands in one step run
```
bash get-index.sh
```
::::

:::{admonition} Break (Optional)
:class: exercise

Now our virtual machine instance is ready and we can access the code and data.  Now is a great time to take a short break.

:::

## Getting the Data

We can see the data is well formed and what we expect.  We will now use this data to download data related to a specific point and limit it to only Landsat 8.  The following script does a simple filter.

In [12]:
cat search.py

#!/usr/bin/python3
import json
import csv
import sys

# Example: Burr Oak Tree
# 38.899313,-92.464562 (Lat north+, Long west-) ; Landsat Path 025, Row 033
config=json.load(open("search.json"))
lat,lon=config['lat'],config['lon']
landsat=config['landsat']
limit=config['limit']

reader=csv.reader(sys.stdin)
header=next(reader) # skip header
for l in reader:
    SCENE_ID,PRODUCT_ID,SPACECRAFT_ID,SENSOR_ID,DATE_ACQUIRED,COLLECTION_NUMBER,COLLECTION_CATEGORY,SENSING_TIME,DATA_TYPE,WRS_PATH,WRS_ROW,CLOUD_COVER,NORTH_LAT,SOUTH_LAT,WEST_LON,EAST_LON,TOTAL_SIZE,BASE_URL=l
    west,east=float(WEST_LON),float(EAST_LON)
    north,south=float(NORTH_LAT),float(SOUTH_LAT)
    if SPACECRAFT_ID==landsat and north >= lat and south <= lat and west <= lon and east >= lon:
        print(BASE_URL) # output BASE_URL
        limit and sys.exit(0) # limit results


We can see that the actual search data comes from the file `search.json`.  The program reads the data from the standard input and iterates over all rows in the CSV file.  It filters the results for which the image contains the pint and prints out the bucket URL for them. We are interested in all products that contain the Burr Oak Tree.

In [13]:
cat search.json

{
    "lat": 38.899313,
    "lon": -92.464562,
    "landsat": "LANDSAT_8",
    "limit": true
}


Now lets test this on a subset of the data (note the 'limit' option).

In [14]:
python3 search.py < data/index.csv

gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1


Now that we have a list of folders we are interested, we will now download them with a simple script that takes bucket addresses (URL's) and downloads them with the `gsutil` program.

In [15]:
cat download.sh

#!/bin/bash

# Read space separated URL from STDIN and download 
while read -r URL ; do
    echo "+++ $URL"
    # -m parallel
    # -n no-clobber (do not re-download data)
    # -r recursive (download all the data in the specified URL)
    gsutil -m cp -n -r "${URL}/" data/
done


Get the first dataset (limit option)

In [16]:
python3 search.py < data/index.csv | bash download.sh

+++ gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1
Copying gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1/LC08_L1TP_025033_20201007_20201016_01_T1_ANG.txt...
Copying gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1/LC08_L1TP_025033_20201007_20201016_01_T1_B10.TIF...
Copying gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1/LC08_L1TP_025033_20201007_20201016_01_T1_B1.TIF...
Copying gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1/LC08_L1TP_025033_20201007_20201016_01_T1_B3.TIF...
Copying gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1/LC08_L1TP_025033_20201007_20201016_01_T1_B4.TIF...
Copying gs://gcp-public-data-landsat/LC08/01/025/033/LC08_L1TP_025033_20201007_20201016_01_T1/LC08_L1TP_025033_20201007_20201016_01_T1_B11.TIF...
Copying gs://gcp-public-data-landsat/

Check that the data was downloaded

In [17]:
ls -l data

total 2564792
drwxr-xr-x 2 learner learner       4096 May 20 17:42 [0m[01;34mLC08_L1TP_025033_20201007_20201016_01_T1[0m
-rw-r--r-- 1 learner learner 2626336574 May 20 17:42 index.csv


::::{tip}

To run the above analysis in one step run

```
bash get-data.sh
```

::::

## Processing the Data

We will now use simple script to combine the color bands and export it as a PNG

In [18]:
cat process_sat.py

#!/usr/bin/python3
import os
import rasterio

# Open the first directory in data, could walk entire tree
for dirname, dirs, files in os.walk('data'):
    source = dirs[0]
    break

# Open band (B2/Blue) and copy metadata for result.
with rasterio.open("data/%s/%s_B2.TIF" % (source, source)) as band2:
    meta = band2.meta

# Combine bands into PNG
meta.update(count = 3, driver='PNG')
with rasterio.open("output/result-%s.png" % source, 'w+', **meta) as output:
    for i in range(1, 4):
        print(source, i)
        output.write_band(i,rasterio.open("data/%s/%s_B%d.TIF" % (source, source, i+3)).read(1))


This code writes to the output folder, so let's create it

In [19]:
mkdir -v output/

mkdir: created directory 'output/'


In [20]:
python3 process_sat.py
/bin/true # ignore this line used for jupyter

Traceback (most recent call last):
  File "/home/learner/CLASS-Examples/landsat/process_sat.py", line 3, in <module>
    import rasterio
ModuleNotFoundError: No module named 'rasterio'


Oops, let's install the library (note: the output will be slightly different due to how this Lesson is built).

In [21]:
sudo apt-get install python3-rasterio --yes

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  gdal-data libaec0 libaom0 libarmadillo10 libarpack2 libblas3 libcfitsio9
  libcharls2 libdap27 libdapclient6v5 libdav1d4 libde265-0 libepsilon1
  libfreexl1 libfyba0 libgdal28 libgeos-3.9.0 libgeos-c1v5 libgeotiff5
  libgfortran5 libgif7 libhdf4-0-alt libhdf5-103-1 libhdf5-hl-100 libheif1
  libicu67 libkmlbase1 libkmldom1 libkmlengine1 liblapack3 liblcms2-2 libltdl7
  libmariadb3 libminizip1 libnetcdf18 libnspr4 libnss3 libnuma1 libodbc1
  libogdi4.1 libopenjp2-7 libpoppler102 libpq5 libproj19 libqhull8.0
  librttopo1 libspatialite7 libsuperlu5 libsz2 liburiparser1 libx265-192
  libxerces-c3.2 libxml2 libxslt1.1 mariadb-common mysql-common odbcinst
  odbcinst1debian2 poppler-data proj-bin proj-data python3-affine python3-attr
  python3-bs4 python3-certifi python3-chardet python3-click
  python3-click-plugins python3-cligj python3-colora

In [22]:
python3 process_sat.py

LC08_L1TP_025033_20201007_20201016_01_T1 1
LC08_L1TP_025033_20201007_20201016_01_T1 2
LC08_L1TP_025033_20201007_20201016_01_T1 3


*Note: if you get an "ERROR 4" it is expected in earlier versions of rasterio.*

Verify the results were created.

In [23]:
ls -lh output

total 192M
-rw-r--r-- 1 learner learner 192M May 20 17:43 [0m[01;35mresult-LC08_L1TP_025033_20201007_20201016_01_T1.png[0m
-rw-r--r-- 1 learner learner  941 May 20 17:43 result-LC08_L1TP_025033_20201007_20201016_01_T1.png.aux.xml


## Saving the Results

We now will store the data in the bucket we created in the beginning of the episode.  First we verify the environment variable and that it exists.


In [28]:
echo $BUCKET

essentials-learner-2022-02-14


In [29]:
gsutil ls -b gs://$BUCKET

gs://essentials-learner-2022-02-14/


Now copy the output data to the bucket. The `-r` flag recursively copies the output directory and `-m` copies the files in parallel.  Note the locations of the `-m` and `-r` switches as they apply globally and to the `cp` command respectively.

In [30]:
gsutil -m cp -r output gs://$BUCKET

Copying file://output/result-LC08_L1TP_025033_20201007_20201016_01_T1.png.aux.xml [Content-Type=application/xml]...
Copying file://output/result-LC08_L1TP_025033_20201007_20201016_01_T1.png [Content-Type=image/png]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\ [2/2 files][191.6 MiB/191.6 MiB] 100% Done                                    
Operation compl

Verify that the output was uploaded.

In [31]:
gsutil ls gs://$BUCKET

gs://essentials-learner-2022-02-14/output/


In [32]:
gsutil ls -lh gs://$BUCKET/output

191.58 MiB  2022-02-14T20:36:53Z  gs://essentials-learner-2022-02-14/output/result-LC08_L1TP_025033_20201007_20201016_01_T1.png
     910 B  2022-02-14T20:36:51Z  gs://essentials-learner-2022-02-14/output/result-LC08_L1TP_025033_20201007_20201016_01_T1.png.aux.xml
TOTAL: 2 objects, 200890195 bytes (191.58 MiB)


## Viewing the Results

You now can view the results by using the Google Cloud Web Console and navigating to "Cloud Storage", selecting the bucket, and then the result object you wish to view (select the `.png` file).  You will need to click the "Preview" button given the large size of the image.

:::{admonition} Exercise
:class: exercise

  * Try to find and view the results on your own

:::

Navigate to **Cloud Storage** -> **Buckets** -> **Bucket** -> **output** folder and then click on the **result** object.
![example-object](img/example-object.png)

And press the **Preview** button below the object details (check the details) and you should see something similar to the following:
![example-object-preview](img/example-object-preview.png)

## Cleanup

We will now leave the resources running in order to learn more about sharing and monitoring costs and will clean up all the resources as the end of Lesson.  

:::{admonition} Danger
:class: danger

**Do not forget to do remove the Cloud Storage Bucket and the Compute Engine Instance (Virtual Machine) when you are done with this Lesson!** Running resources and stored data costs money! Cleanup when you are done!

:::