[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Episode on data access and parallelization #86

Conversation

fnattino
Copy link
Contributor

This is work-in-progress to address #82 .

@rbavery: I have added a notebook with a first sketch of how the episode on data access/parallelization could look like, any feedback is more than welcome!

@rbavery
Copy link
Collaborator
rbavery commented Jan 25, 2022

This is looking awesome!

Accessing Data Episode

I like the Objectives of this lesson. I think we can potentially split out the Process satellite images in "chunk" to take advantage of parallelization. into it's own lesson. This would mean we'd have a lesson focused on Data Access, which ends around this cell

import rioxarray

# ... or we can open them directly (and stream content only when necessary)
blue_band_href = assets["B02"].href
blue_band = rioxarray.open_rasterio(blue_band_href)
blue_band

and a separate Parallalizing Raster computation with Dask lesson.

I think the final cell for the Access Data Episode could be saving out the raster with rioxarry. this would involve reassigning the CRS to the mosaicked xarray DataArray we produced with stackstac and then using the .rio.toraster method. we can borrow from this example my colleague @alexmandel worked on https://github.com/PacificCommunity/DigitalEarthPacific/blob/demo/writeraster/notebooks/demo/cloudless-mosaic-sentinel2.ipynb

Parallalizing Raster computation with Dask

I love that you already cover guidelines on how to set the chunk size! An additional topic to cover here could be how to tell if your code is running faster with dask or without dask. For this we could cover using time, the dask profiler, or some other easily accessible profiling tool in jupyter notebooks. I think we should also have a section describing dask's lazy computation mode and how to take advantage of that to inspect metadata prior to downloading the actual scene data.

For the Raster calculations portion, instead of Raster calculations using stackstac I suggest we show how to mosaic a collection of scenes. there's stackstac's internal method which just flattens: https://stackstac.readthedocs.io/en/latest/api/main/stackstac.mosaic.html#stackstac.mosaic

I think it would be valuable to show that solution and for a median composite.

Setup instructions will also need to be updated with new dependencies. I've seen the most success with not pinning specific versions to allow a more flexible solve for different machines: https://carpentries-incubator.github.io/geospatial-python/setup.html

A third episode focused on working with a cool looking mosaic could focus on xarray-spatial's raster calc funcs. One idea: computing spectral indices, thresholding them, and polygonizing the result (maybe areas with especially high NDVI): https://github.com/makepath/xarray-spatial

@rbavery
Copy link
Collaborator
rbavery commented Jan 25, 2022

I also like the inclusion of the Dask task graph image. including other images of intermediate results, such as plots of the blue band, could be good to include prior to the final challenge. Also when this gets formatted to the lesson markdown, I think we can create a set of tooltips that refer to other sources for folks to read up on COG, STAC, and Dask, while also briefly summarizing their utility for geospatial.

@fnattino fnattino marked this pull request as ready for review January 26, 2022 22:21
@fnattino
Copy link
Contributor Author

Hi @rbavery , I have created a first version of a full data access episode. Basically, I have converted the Jupyter notebook that you already had a look at into a .md file and I have added some explanatory text in between the code blocks. Whenever you have time to review it, I would be happy to have any kind of feedback - thanks in advance!

I have also added a first exercise following up on your idea to have participants exploring a STAC catalog even before having the search tool introduced - what do you think about having it formulated in this way?

Still working on the second episode (on parallel raster computation with Dask).

@fnattino fnattino requested a review from rbavery January 26, 2022 22:21
@rbavery
Copy link
Collaborator
rbavery commented Jan 26, 2022

@fnattino thanks I'll give this a review this evening

Copy link
Collaborator
@rbavery rbavery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all your work on this @fnattino. I think it's close to being able to merge. I'm meeting with NASA DEVELOP folks next week and I think this is already in great shape to teach if there's time.

_episodes/XX-access-data.md Outdated Show resolved Hide resolved
_episodes/XX-access-data.md Outdated Show resolved Hide resolved
_episodes/XX-access-data.md Outdated Show resolved Hide resolved
Comment on lines 68 to 74
> ## Exercise: Discover a STAC catalog
> Open the following STAC API link using your web browser: https://earth-search.aws.element84.com/v0.
> Navigate through the links to find out which collections are available and how many scenes are indexed. Where may one
> find information on how to query the API for the desired scenes? Can you find out which parameters can be provided
> in the queries?
{: .challenge}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to show learners a graphical tool to browse STAC Catalogs. This one shows the spatial extent and summarizes the information about any STAC catalog url you paste into it.

https://radiantearth.github.io/stac-browser

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great, and thanks for the tips - I hadn't seen yet the new STAC browser! Unfortunately the filtering tools do not seem to work with the Earth Search STAC API (maybe because this is an older STAC API version, 0.9), but I have read this is still a "demo" version, so things might be fixed soon. Anyway, for the purpose of the exercise, i.e. browsing through the items, works very well!

# save processed image to disk
visual_clip.rio.to_raster("amsterdam_tci.tif", driver="COG")
~~~
{: .language-python}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome. I think we should end with a challenge so they can reproduce these steps and build some muscle memory for how interacting with a STAC API via pystac and then working with the result in rioxarray feels.

I think a good option would be to direct them to this STAC catalog and have them download data that intersects a specific lat, lon and date (specified in the challenge text): https://radiantearth.github.io/stac-browser/#/external/earth-search.aws.element84.com/v0/collections/landsat-8-l1-c1

the solution to the challenge could be to save a single band at that location and date to disk with rioxarray

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I have added a challenge using the Landsat 8 dataset. This collection unfortunately seems not to be continuously updated here (and at a certain point might be dropped?) so we might have to find new sources in future!

~~~
{: .language-python}

> ## Exercise: Discover a STAC catalog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this exercise I think we can show an image of the radiant earth stac browser to give people a visual of what information a STAC catalog contains. Looking at the lesson webpage, it's a dense in the amount of text before the first image so I think this will make the first part of the lesson more engaging for someone who is browsing the lesson material or seeking out guidance on STAC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a figure - the best "composition" I could come up with.. If you have suggestions for improvements, let me know!

_episodes/XX-access-data.md Outdated Show resolved Hide resolved
@rbavery
Copy link
Collaborator
rbavery commented Jan 27, 2022

@fnattino thanks for addressing these reviews! once this data access episode is finished, can we merge that PR and finish the parallelization episode in a separate PR? Feel free to merge this as is now, I or somebody could add a challenge later unless you are already working on it.

@fnattino
Copy link
Contributor Author

Hi @rbavery - thanks a lot for having already a look. I am finishing up the last challenge, I'll ping you as soon as I have pushed it!

@fnattino
Copy link
Contributor Author

Hi @rbavery, this is it - I have added the final challenge.

I have also updated the setup instructions and the environment.yaml file, adding pystac_client to the dependencies.

Merging this first and opening a second PR for the parallelisation episode sounds good - I have removed the corresponding notebook from this branch.

One last thing: should this become episode 19? I could set the number and merge if this is alright with you. Really thanks a lot for all the feedback and suggestions!

@rbavery
Copy link
Collaborator
rbavery commented Jan 27, 2022

Fantastic!!! Yes let's make this episode 19 for now. Really looking forward to teaching this! Lgtm feel free to merge.

@fnattino fnattino merged commit c57024e into carpentries-incubator:gh-pages Jan 28, 2022
@fnattino fnattino deleted the data-access branch January 28, 2022 07:50
@fnattino fnattino mentioned this pull request Apr 1, 2022
rogerkuou pushed a commit that referenced this pull request Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants