-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add satellite image processing benchmark #1550
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
import datetime | ||
|
||
import fsspec | ||
import geojson | ||
import planetary_computer | ||
import pystac_client | ||
import stackstac | ||
import xarray as xr | ||
from distributed import wait | ||
|
||
|
||
def harmonize_to_old(data): | ||
""" | ||
Harmonize new Sentinel-2 data to the old baseline. | ||
|
||
Parameters | ||
---------- | ||
data: xarray.DataArray | ||
A DataArray with four dimensions: time, band, y, x | ||
|
||
Returns | ||
------- | ||
harmonized: xarray.DataArray | ||
A DataArray with all values harmonized to the old | ||
processing baseline. | ||
""" | ||
cutoff = datetime.datetime(2022, 1, 25) | ||
offset = 1000 | ||
bands = [ | ||
"B01", | ||
"B02", | ||
"B03", | ||
"B04", | ||
"B05", | ||
"B06", | ||
"B07", | ||
"B08", | ||
"B8A", | ||
"B09", | ||
"B10", | ||
"B11", | ||
"B12", | ||
] | ||
|
||
old = data.sel(time=slice(cutoff)) | ||
|
||
to_process = list(set(bands) & set(data.band.data.tolist())) | ||
new = data.sel(time=slice(cutoff, None)).drop_sel(band=to_process) | ||
|
||
new_harmonized = data.sel(time=slice(cutoff, None), band=to_process).clip(offset) | ||
new_harmonized -= offset | ||
|
||
new = xr.concat([new, new_harmonized], "band").sel(band=data.band.data.tolist()) | ||
return xr.concat([old, new], dim="time") | ||
|
||
|
||
def test_satellite_filtering( | ||
scale, | ||
client_factory, | ||
cluster_kwargs={ | ||
"workspace": "dask-engineering-azure", | ||
"region": "westeurope", | ||
"wait_for_workers": True, | ||
}, | ||
scale_kwargs={ | ||
"small": {"n_workers": 10}, | ||
"large": {"n_workers": 100}, | ||
}, | ||
): | ||
with client_factory( | ||
**scale_kwargs[scale], **cluster_kwargs | ||
) as client: # noqa: F841 | ||
catalog = pystac_client.Client.open( | ||
"https://planetarycomputer.microsoft.com/api/stac/v1", | ||
modifier=planetary_computer.sign_inplace, | ||
) | ||
|
||
# GeoJSON for region of interest is from https://github.com/isellsoap/deutschlandGeoJSON/tree/main/1_deutschland | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How much Sentinel 2 tiles does that represent? Do you have an idea? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That was more a question about the number of Sentinel 2 zone in UTM format, but nevermind, it's already a good answer. If my calculation is correct, this means already half a TB of Data to load. |
||
with fsspec.open( | ||
"https://raw.githubusercontent.com/isellsoap/deutschlandGeoJSON/main/1_deutschland/3_mittel.geo.json" | ||
) as f: | ||
gj = geojson.load(f) | ||
|
||
# Flatten MultiPolygon to single Polygon | ||
coordinates = [] | ||
for x in gj.features[0]["geometry"]["coordinates"]: | ||
coordinates.extend(x) | ||
area_of_interest = { | ||
"type": "Polygon", | ||
"coordinates": coordinates, | ||
} | ||
|
||
# Get stack items | ||
if scale == "small": | ||
time_of_interest = "2024-01-01/2024-09-01" | ||
else: | ||
time_of_interest = "2015-01-01/2024-09-01" | ||
|
||
search = catalog.search( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does Coiled have a spot where we could store a small parquet file? As written, this is hitting the search endpoint every time the benchmark runs, which isn't exercising Dask at all and has the potential to throw errors. We could instead do the search once and use stac-geoparquet to cache that result. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I may need to do a little setup to give the CI runner access to an Azure bucket, but we can definitely do that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume that the parquet file is small enough that we can also put it into s3 if that’s easier? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yeah, fair point |
||
collections=["sentinel-2-l2a"], | ||
intersects=area_of_interest, | ||
datetime=time_of_interest, | ||
) | ||
items = search.item_collection() | ||
|
||
# Construct Xarray Dataset from stack items | ||
ds = stackstac.stack(items, assets=["B08", "B11"], chunksize=(400, 1, 256, 256)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So here you can complexify and select other bands for other indices like NDVI, NDSI, NDWI. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have a suggestion on which indicies are most common? I see NDVI come up a lot. Is there any computation difference between these indicies, or are there all just relatively straightforward reductions like we already have here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I see NDVI as the HelloWorld of Satellite image processing. There is no computation difference on all if these indices, it's always the same formula, you only switch the bands. You probably already know, but NDVI vor Vegetation, NDSI for Snow, NDWI for Water. |
||
# See https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a#Baseline-Change | ||
ds = harmonize_to_old(ds) | ||
|
||
# Compute humidity index | ||
humidity = (ds.sel(band="B08") - ds.sel(band="B11")) / ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be prettier to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, you're right this has to do with what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Nope, there is a thread about that on Pangeo Discourse I think. Both have probably advantages and drawbacks... |
||
ds.sel(band="B08") + ds.sel(band="B11") | ||
) | ||
result = humidity.groupby("time.month").mean() | ||
result = result.persist() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also love to see the plot and if it represent something, typically seasonality at least. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
wait(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I knew that there was some version change, but didn't know that this could affect the processing. This adds some real problematics, but also complexifies the benchmark, this might no be needed...