Geospatial Data Processing using Python [Part 1-Accessing The Dataset via STAC]

6 min readMay 15, 2023

Geospatial analysis has been increasing significantly in the recent couple of years. Analyzing large datasets has been a new norm for geospatial processing. Just around 10 years ago, the geospatial sector is hugely dependent on licensed software. There are seeds of new applications, such as QGIS, which started to be developed in 2002. But it only came into my knowing around 2012ish, 10 years after its establishment. Fast forward, open-source, scripting and programming have been widely used in geospatial practice.

Maybe you are as worried as me. As someone who tries to transition to programming, I am a beginner, super beginner. The wave came so fast it starts to be overwhelming. Everyone seems so great at dealing with that. People start to do lots of cool stuff with the available tool. My circle seems so graceful and swift in transitioning into programming. To some extent, I felt left behind. (huhu malah curhat)

I have no option aside from adapting and keep learning as much as I can. There are many tools and information available on the internet, gladly. Even though reading the article on the internet is not equal to understanding the meaning of that article. Let’s give it a try. Hence, Medium is my only place to document things that I tried to learn. Many of my previous articles are on how to install stuff, how to download stuff, etc. Maybe it is the time to do something that I am afraid of the most. Let me document several steps for doing GIS stuff in Python (or maybe whatever programming language I bumped into).

A couple of weeks ago, I got training from eScience for an introduction to geospatial raster and vector data with Python. I will be sharing the knowledge that I acquired during the session.

In this article, I am trying to do clipping using Jupyter Notebook. This is a simple feature in QGIS or ArcGIS. But in case you’re managing a lot of data and do not want to see the software crashing, Python can be one of the options. The steps and setup can be seen in this link from eScience, which is open to the public and accessible to all. They have many interesting training available on their website.

Installation

The installation aims to have rioxarray package, which is a powerful tool for raster analysis and geopandas for vector analysis.

Download and install Anaconda
Setup the directory and get the data

a. Open the terminal/shell (Anaconda Prompt) on Windows

The Anaconda Prompt (type Anaconda Prompt directly from the Start Menu)

b. Change the work directory

Create a new directory on your computer called geospatial-python and change into it:

mkdir geospatial-python
 cd geospatial

mkdir data
 cd data

create the directory to geospatial-python\data

c. Create a New Environment

First, test whether the conda is working by typing conda

conda

Create the new conda environment. Well this is the first approach to try. In my case, this script failed to run in my computer.

conda create -n geospatial -c conda-forge -y \
   python=3.10 jupyterlab numpy matplotlib \
   xarray rasterio geopandas rioxarray earthpy descartes xarray-spatial pystac-client python-graphviz

Then I moved to another approach suggested in the guidance by using environment.yaml file. Download the yaml file and save it ingeospatial-python/data and run this following script.

conda env create -f environment.yaml

Once it is done installing, activate the new environment.

conda activate geospatial

If the installation successful, then the bracket on the left will change from “base” to “geospatial”. The (geospatial) on the left indicates that we are in the virtual environment

Then run the jupyter lab.

jupyter lab

Type jupyter lab to launch the jupyter notebook

The jupyter lab will be automatically opened in our browser

Downloading the data for exercise

Download the data that is provided by the eScience through this link. Put the downloaded files into the folder we created geospatial-python/data

Once you download and put the data in the directory, the files will be displayed automatically on the left corner of the Jupyter notebook

Accessing the data

Let’s create a notebook to write the script to access the data. First we import rioxarray package. Why rioxarray? It is a Python package that is able to read and process raster data, such as GeoTiff and netCDF files. It is capable to clip, merge and reproject rasters. It is an extension of xarray package. Also, import geopandas to work with vector data.

import rioxarray
import geopandas

I create a notebook named 01-data-access.ipynb and import rioxarray and geopandas

# go to pystac to get the URL
# open https://pystac.readthedocs.io/en/stable/quickstart.html
# create a variable named api_url to direct us to the STAC's API

api_url = "https://earth-search.aws.element84.com/v0"

# PySTAC-Client (pystac-client) builds upon PySTAC library to add support for STAC APIs 

from pystac_client import Client

# Open the client

client = Client.open(api_url)

# See the details of the data

client

# documentation for pystac client https://pystac-client.readthedocs.io/en/stable/

for collection in client.get_collections():
  print(collection)

# create variable called collection_id
# choose the third Sentinel collection

collection_id = "sentinel-s2-l2a-cogs"

# Create a point
# shapely is the library in python to work with a vector

from shapely.geometry import Point

# Create a variable called a point and define its location
# First parameter is longitude, second parameter is latitude

point = Point(4.89, 52.37)

# Display the point
point

# intersect the collection with the point
# max_items = The maximum number of items to return from the search, even if there are more matching results

search = client.search(
    collections= [collection_id],
    intersects=point,
    max_items=10,)

# inspect search results
search

# Inspect total matched results
search.matched()

# Put the matched result into variable called items

items = search.get_all_items()

# check up the items

items

# Check the length of the items

len(items)

# Looping

for item in items:
    print(item)

# called the first item

item = items[0]

# check the date of the item

item.datetime

# see the geometrynya and the location of the item

item.geometry

# import another shape (polygon)

from shapely.geometry import shape

#visualize the footprint of the data

shape(item.geometry)

# check the item properties 

item.properties

# query: bbox, datetime, cloud cover
# the bounding box defined with buffer 0.01

bbox = point.buffer(0.01).bounds

# add datetime range

date_range = "2020-03-20/2020-03-30"

# cloud cover represented in percentage

query_cloud_cover = "eo:cloud_cover<10"

# repeat the intersection with bbox (polygon)

search = client.search(
    collections=[collection_id],
    bbox=bbox,
    datetime=date_range,
    query=[query_cloud_cover]
)

# see the matching image

search.matched()

# get all items (beware the items are overwritten)

items = search.get_all_items()

# check the length of items

len(items)

# save the data that we have searched in json format

items.save_object("search.json")

# can be saved in geojson as well if you wish

items.save_object("search.geojson")

# check the asset of the item
# result --> ['thumbnail', 'overview', 'info', 'metadata', 'visual', 'B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12', 'AOT', 'WVP', 'SCL'])

item.assets

# take the thumbnail attribute 

asset = item.assets["thumbnail"]

# show the result
asset.href

# href is the data image stored as b01_href

b01_href = item.assets["B01"].href

# open the data

b01 = rioxarray.open_rasterio(b01_href)

# open the b01 xarray details

b01

Geospatial Data Processing using Python [Part 1-Accessing The Dataset via STAC]

Installation

Downloading the data for exercise

Accessing the data

Written by Sry Handini Puteri