Geospatial Data Processing using Python [Part 1-Accessing The Dataset via STAC]
Geospatial analysis has been increasing significantly in the recent couple of years. Analyzing large datasets has been a new norm for geospatial processing. Just around 10 years ago, the geospatial sector is hugely dependent on licensed software. There are seeds of new applications, such as QGIS, which started to be developed in 2002. But it only came into my knowing around 2012ish, 10 years after its establishment. Fast forward, open-source, scripting and programming have been widely used in geospatial practice.
Maybe you are as worried as me. As someone who tries to transition to programming, I am a beginner, super beginner. The wave came so fast it starts to be overwhelming. Everyone seems so great at dealing with that. People start to do lots of cool stuff with the available tool. My circle seems so graceful and swift in transitioning into programming. To some extent, I felt left behind. (huhu malah curhat)
I have no option aside from adapting and keep learning as much as I can. There are many tools and information available on the internet, gladly. Even though reading the article on the internet is not equal to understanding the meaning of that article. Let’s give it a try. Hence, Medium is my only place to document things that I tried to learn. Many of my previous articles are on how to install stuff, how to download stuff, etc. Maybe it is the time to do something that I am afraid of the most. Let me document several steps for doing GIS stuff in Python (or maybe whatever programming language I bumped into).
A couple of weeks ago, I got training from eScience for an introduction to geospatial raster and vector data with Python. I will be sharing the knowledge that I acquired during the session.
In this article, I am trying to do clipping using Jupyter Notebook. This is a simple feature in QGIS or ArcGIS. But in case you’re managing a lot of data and do not want to see the software crashing, Python can be one of the options. The steps and setup can be seen in this link from eScience, which is open to the public and accessible to all. They have many interesting training available on their website.
Installation
The installation aims to have rioxarray package, which is a powerful tool for raster analysis and geopandas for vector analysis.
- Download and install Anaconda
- Setup the directory and get the data
a. Open the terminal/shell (Anaconda Prompt) on Windows
b. Change the work directory
Create a new directory on your computer called geospatial-python
and change into it:
mkdir geospatial-python
cd geospatial
mkdir data
cd data
c. Create a New Environment
First, test whether the conda is working by typing conda
conda
Create the new conda environment. Well this is the first approach to try. In my case, this script failed to run in my computer.
conda create -n geospatial -c conda-forge -y \
python=3.10 jupyterlab numpy matplotlib \
xarray rasterio geopandas rioxarray earthpy descartes xarray-spatial pystac-client python-graphviz
Then I moved to another approach suggested in the guidance by using environment.yaml file. Download the yaml file and save it ingeospatial-python/data
and run this following script.
conda env create -f environment.yaml
Once it is done installing, activate the new environment.
conda activate geospatial
Then run the jupyter lab.
jupyter lab
Downloading the data for exercise
Download the data that is provided by the eScience through this link. Put the downloaded files into the folder we created geospatial-python/data
Accessing the data
Let’s create a notebook to write the script to access the data. First we import rioxarray package. Why rioxarray? It is a Python package that is able to read and process raster data, such as GeoTiff and netCDF files. It is capable to clip, merge and reproject rasters. It is an extension of xarray package. Also, import geopandas to work with vector data.
import rioxarray
import geopandas
# go to pystac to get the URL
# open https://pystac.readthedocs.io/en/stable/quickstart.html
# create a variable named api_url to direct us to the STAC's API
api_url = "https://earth-search.aws.element84.com/v0"
# PySTAC-Client (pystac-client) builds upon PySTAC library to add support for STAC APIs
from pystac_client import Client
# Open the client
client = Client.open(api_url)
# See the details of the data
client
# documentation for pystac client https://pystac-client.readthedocs.io/en/stable/
for collection in client.get_collections():
print(collection)
# create variable called collection_id
# choose the third Sentinel collection
collection_id = "sentinel-s2-l2a-cogs"
# Create a point
# shapely is the library in python to work with a vector
from shapely.geometry import Point
# Create a variable called a point and define its location
# First parameter is longitude, second parameter is latitude
point = Point(4.89, 52.37)
# Display the point
point
# intersect the collection with the point
# max_items = The maximum number of items to return from the search, even if there are more matching results
search = client.search(
collections= [collection_id],
intersects=point,
max_items=10,)
# inspect search results
search
# Inspect total matched results
search.matched()
# Put the matched result into variable called items
items = search.get_all_items()
# check up the items
items
# Check the length of the items
len(items)
# Looping
for item in items:
print(item)
# called the first item
item = items[0]
# check the date of the item
item.datetime
# see the geometrynya and the location of the item
item.geometry
# import another shape (polygon)
from shapely.geometry import shape
#visualize the footprint of the data
shape(item.geometry)
# check the item properties
item.properties
# query: bbox, datetime, cloud cover
# the bounding box defined with buffer 0.01
bbox = point.buffer(0.01).bounds
# add datetime range
date_range = "2020-03-20/2020-03-30"
# cloud cover represented in percentage
query_cloud_cover = "eo:cloud_cover<10"
# repeat the intersection with bbox (polygon)
search = client.search(
collections=[collection_id],
bbox=bbox,
datetime=date_range,
query=[query_cloud_cover]
)
# see the matching image
search.matched()
# get all items (beware the items are overwritten)
items = search.get_all_items()
# check the length of items
len(items)
# save the data that we have searched in json format
items.save_object("search.json")
# can be saved in geojson as well if you wish
items.save_object("search.geojson")
# check the asset of the item
# result --> ['thumbnail', 'overview', 'info', 'metadata', 'visual', 'B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12', 'AOT', 'WVP', 'SCL'])
item.assets
# take the thumbnail attribute
asset = item.assets["thumbnail"]
# show the result
asset.href
# href is the data image stored as b01_href
b01_href = item.assets["B01"].href
# open the data
b01 = rioxarray.open_rasterio(b01_href)
# open the b01 xarray details
b01