Support FLAIR Datamodule for semantic segmentation #2303

rbavery · 2024-09-18T16:42:01Z

Summary

I just stumbled across FLAIR, a high res (.2 meter) semantic segmentation dataset of 19 land cover categories. Full description here https://ignf.github.io/FLAIR/#FLAIR1 That link also references a U-Net model baseline trained on FLAIR

It is described in an OGC Report on ML Engineering as "a comprehensive and high-quality collection of labeled satellite imagery aimed at advancing land cover classification and geospatial analysis tasks". It's maintained by the French National Institute of Geographic and Forest Information (IGN).

Rationale

I'm interested in composing a list of high quality, challenging datasets for benchmarking semantic segmentation and object detection models and at a glance FLAIR seems like one of them. It seems like the rigor of maintenance and description of the FLAIR dataset is high compared to other datasets and so I would like to see this available in torchgeo.

Also, I think I recall seeing that torchgeo would like to offer models that are inference-ready, not just pretrained with self-superivison but fine-tuned to address popular tasks in remote sensing. This U-Net model seems like a good start, but I can raise a separate issue for adding the model.

Implementation

I haven't contributed to torchgeo before but I would check out other PRs that added datamodules and pretrained models and follow that example.

Alternatives

No response

Additional information

This comment is on my mind: Clay-foundation/model#269 (comment)

I'd love for more challenging datasets to be front and center when benchmarking models rather than Eurosat. I'd also like for us to use datasets that we have a solid understanding of the geographic and class distribution and I like that FLAIR lays this out on their site.

I might be a bit slow to implement this but would like to submit a PR when I have time if this sounds like a good idea.

adamjstewart · 2024-09-20T11:41:11Z

Yes, we would love to have FLAIR in TorchGeo!

If you would like to take a stab at this, see https://torchgeo.readthedocs.io/en/stable/user/contributing.html#datasets for a list of files you'll need to modify. We have dozens of other semantic segmentation datasets you can base your code on. Let us know if you have any trouble with the testing and we would be happy to help!

MathiasBaumgartinger · 2024-09-30T17:29:52Z

Interestingly, I have just been working with FLAIR and they happily gave an admission for using their dataset in torchgeo. I am currently working on releasing it as a datamodule, however, there are some flaws with the current state of the FLAIR dataset which make them somewhat hard to integrate (see @#2292).

adamjstewart · 2024-09-30T18:38:03Z

I thought we solved all the problems in #2292?

MathiasBaumgartinger · 2024-09-30T19:41:33Z

Absolutely my bad! I linked the wrong issue: rasterio/rasterio#3178 and OSGeo/gdal#10820.

As discussed there, the FLAIR dataset currently provides an geographically underspecified mask dataset. While there is a workaround for this (for some reason i found both a gdal_edit.py and a subsequent gdalwarp with a correct CRS are necessary), I spoke to one of the maintainers of the dataset about a possible fix on their side.

My initial plan was to wait for the maintainers to fix the problem and release the module afterwards. However, if you think that would be appropriate, I could integrate the GDAL operations in the DataModule and create a PRQ right away.

adamjstewart · 2024-10-01T06:20:09Z

The images are already pre-chipped. Any reason you don't want to make a NonGeoDataset?

MathiasBaumgartinger · 2024-10-01T07:10:06Z

Not necessarily. I just thought if there is geo-information, I might as well make it accessible. Do you prefer it as NonGeoDataset?

adamjstewart · 2024-10-01T07:47:28Z

I find NonGeoDataset easier to use, especially if the geo information is corrupted, or if there are multiple CRSs in use. The only reason to use GeoDataset is if the images are not pre-chipped or if you need to combine them with other GeoDatasets.

MathiasBaumgartinger · 2024-10-01T08:30:50Z

Alright! Another thing that comes to mind is that the most recent release (FLAIR#2) jointly includes preprocessed SENTINEL-2 data alongside aerial and mask images. A short summary of the changes mentioned in the datapaper:

[There is a] strong difference in spatial resolution [...]. Therefore, in order to also provide a minimum of context from the satellite data, a buffer was applied to create super-areas.

Use of super areas only

in order to limit the size of the data and due to the wide extent of the dataset, only the super-areas were downloaded

Resampling

the 20 m spatial resolution bands are first resampled during data retrieval to 10 m by the nearest interpolation method. Same approach is adopted for the cloud and snow masks

Removal of nodata pixels

nodata pixels (reflectances at 0) [...] were removed

Reprojection

subsequently reprojected into the Lambert-93 projection (EPSG:2154) which is the one of the aerial imagery.

Additional information

Data Type	Naming	Shape
ground truth	`SEN2_xxxx_data.npy`	$T \times C \times H \times W$
snow/cloud masks	`SEN2_xxxx_masks.npy`	$T \times C \times H \times W$
time series products	`SEN2_xxxx_products.txt`	-
json mapping	`flair-2_centroids_sp_to_patch.json`	-

Sentinel-2 super-areas (SEN2) data is composed of several elements - data, masks, products and a JSON file to match aerial and satellite imager
[The JSON file] uses the aerial patch name (e.g., IMG 077413) as the key and provides a list of two indexes (e.g., [13,25]) that represent the data-coordinates of the aerial patch centroids

With the considerations above, I suppose it would be better to include the data provided by the maintainers of FLAIR as opposed to the original SENTINEL-2 dataset of torchgeo using an intersection dataset. Any thoughts on that?

adamjstewart · 2024-10-01T09:15:47Z

Since the filenames and file formats are completely different from raw Sentinel-2 data, we would either have to create a new class for FLAIRSentinel2(Sentinel2) and use an intersection, or just use a NonGeoDataset to avoid all of that complexity.

agarioud · 2024-10-03T08:56:35Z

Hello,

As a maintainer of the FLAIR dataset at IGN, i greatly appreciate your effort in integrating our dataset.

Seeing this issue i would like to give you some information about the release of a new version of FLAIR in the next weeks. Among others, it will spatially align (patch-wise, so no super-areas any more) multiple modalities, included Sentinel-2, with common file formats. This new release will also have a bigger scale (about 3 times the current size).

You might consider waiting until this new version is released before putting in the effort of integrating the FLAIR#2 Sentinel-2 imagery. Aerial imagery will stay in the same format.

We are happy to provide any information or support that can help in this effort.

MathiasBaumgartinger · 2024-10-03T09:05:02Z

@agarioud, nice to hear directly from you! I've already finished a reasonably working version, so the effort to finish everything (mainly documentation/cleanup) would be very small. Can you give me a more specific time frame in which you plan to release?

Also, are you willing to share the pre-processing steps applied on the Sentinel-2 data? In a released version I would like to be able to perform those processing steps on other Sentinel-2 data as well to achieve maximum performance during prediction in unseen areas.

agarioud · 2024-10-03T09:26:13Z

Unfortunately i cannot give you an exact time frame. We are currently working on preparing the data, and as for the previous releases, we would like to add some documentation to it. I'll notify you as soon as we have more visibility.

Regarding pre-processing of Sentinel-2 we have the following steps : we use BOA L2A data, cropped to the aerial patch extent (which is 512px at 0.2m so 102.4x102.4 m), resample to 10.24 m the Sentinel-2 to have 10x10 pixels patches. For each patch we stacked the 10 spectral bands of each dates (i.e., if 38 acquisitions, the patch has 380 channels) together to reduce inode footprint of the dataset. If you need more precise information you can contact me : [email protected]

Also, we will release a new batch of pre-trained models on the new dataset on our HuggingFace IGNF page.

agarioud · 2024-10-03T09:29:33Z

I forgot to say that we store snow and cloud masks as separate files. Also, the acquisition dates are stored in a JSON file for each area.

adamjstewart · 2024-10-03T09:38:55Z

Do you think it would be useful to have a single dataset with a version parameter that allows users to choose which version of the dataset they want? I'm guessing this would primarily be useful for historical reasons (to compare against papers that used v1). Could also have a base class with subclasses for each version, but it sounds like the name is the same. I guess it depends on how similar the file structure is and if the only difference is simply the total number of images. Either way, from the TorchGeo side, I'm happy with multiple versions of FLAIR if it isn't too much work to support.

agarioud · 2024-10-03T09:44:48Z

The new release will include all previous areas and data but extend to other areas and other modalities. As such, i think a versioning is not necessary, rather than a 'area/patch' selection corresponding to the FLAIR#1 and #2 versions ?

That being said, if one would like to include the super-areas of FLAIR#2, this would need some specific dataloading.

rbavery · 2024-10-03T18:03:15Z

I find NonGeoDataset easier to use, especially if the geo information is corrupted, or if there are multiple CRSs in use. The only reason to use GeoDataset is if the images are not pre-chipped or if you need to combine them with other GeoDatasets.

The Clay model has a location encoder and could utilize the geographic information. I think a GeoDataset would be more valuable in the long run for models the can accept inputs beyond images. It also provides useful context for sampling and evaluation.

MathiasBaumgartinger · 2024-10-04T08:40:18Z

I find NonGeoDataset easier to use, especially if the geo information is corrupted, or if there are multiple CRSs in use. The only reason to use GeoDataset is if the images are not pre-chipped or if you need to combine them with other GeoDatasets.

The Clay model has a location encoder and could utilize the geographic information. I think a GeoDataset would be more valuable in the long run for models the can accept inputs beyond images. It also provides useful context for sampling and evaluation.

That is pretty much my initial thought. Lots of research trying to utilize information beyond just color channels.

So concluding: I will create a first pull request using a NonGeoDataset/NonGeoDataModule for the current version 2 of Flair. I let the maintainers decide whether to merge it or wait for the newly released version. If left umerged for now, people may still cherry-pick it.

In any case I will try my best to update the module ASAP once @agarioud and his team release the new FLAIR dataset with properly specified CRS on the masks.

adamjstewart · 2024-10-04T09:32:49Z

Note that it is possible to return lat/long coords from a NonGeoDataset. The difference (in my mind) between that and a GeoDataset is storing all bounding boxes in a spatiotemporal R-tree. This can be slower, but makes it easier to sample small patches from large tiles or to combine the dataset with other GeoDatasets.

adamjstewart · 2024-10-31T13:54:45Z

I think @nilsleh needs a FLAIR data loader for some of his work. From our side, we would love to see a version 1 data loader in the near future that can later be converted to a version 2 once the new dataset is released.

MathiasBaumgartinger · 2024-11-02T09:32:44Z

Hi! I had a packed schedule the last few weeks. I think I can work on refining my first draft and create a first PRQ for review tomorrow.

MathiasBaumgartinger · 2024-11-04T23:28:57Z

FYI: I have been working on the FLAIR dataset yesterday and today. However, the integration of the sentinel data (which I have not used before) sadly turns out far more complicated than I hoped.

You can see my progress at: https://github.com/MathiasBaumgartinger/torchgeo

rbavery · 2024-11-05T02:54:43Z

What were the challenges?

MathiasBaumgartinger · 2024-11-05T21:55:19Z

Well, what took me the most time was a classic y, x instead of x, y ordering mistake 😅 . EDIT: other challenges described in the PRQ.

Happy to share a first draft: #2394 📨

rbavery changed the title ~~Support Datamodule and from the FLAIR project for semantic segmentation~~ Support FLAIR Datamodule for semantic segmentation Sep 18, 2024

adamjstewart added datasets Geospatial or benchmark datasets datamodules PyTorch Lightning datamodules labels Sep 20, 2024

MathiasBaumgartinger mentioned this issue Nov 5, 2024

FLAIR#2 Dataset and Datamodule Integration #2394

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FLAIR Datamodule for semantic segmentation #2303

Support FLAIR Datamodule for semantic segmentation #2303

rbavery commented Sep 18, 2024

adamjstewart commented Sep 20, 2024

MathiasBaumgartinger commented Sep 30, 2024 •

edited

Loading

adamjstewart commented Sep 30, 2024

MathiasBaumgartinger commented Sep 30, 2024

adamjstewart commented Oct 1, 2024

MathiasBaumgartinger commented Oct 1, 2024

adamjstewart commented Oct 1, 2024

MathiasBaumgartinger commented Oct 1, 2024

adamjstewart commented Oct 1, 2024

agarioud commented Oct 3, 2024

MathiasBaumgartinger commented Oct 3, 2024

agarioud commented Oct 3, 2024

agarioud commented Oct 3, 2024

adamjstewart commented Oct 3, 2024

agarioud commented Oct 3, 2024 •

edited

Loading

rbavery commented Oct 3, 2024

MathiasBaumgartinger commented Oct 4, 2024 •

edited

Loading

adamjstewart commented Oct 4, 2024

adamjstewart commented Oct 31, 2024

MathiasBaumgartinger commented Nov 2, 2024

MathiasBaumgartinger commented Nov 4, 2024

rbavery commented Nov 5, 2024

MathiasBaumgartinger commented Nov 5, 2024 •

edited

Loading

Support FLAIR Datamodule for semantic segmentation #2303

Support FLAIR Datamodule for semantic segmentation #2303

Comments

rbavery commented Sep 18, 2024

Summary

Rationale

Implementation

Alternatives

Additional information

adamjstewart commented Sep 20, 2024

MathiasBaumgartinger commented Sep 30, 2024 • edited Loading

adamjstewart commented Sep 30, 2024

MathiasBaumgartinger commented Sep 30, 2024

adamjstewart commented Oct 1, 2024

MathiasBaumgartinger commented Oct 1, 2024

adamjstewart commented Oct 1, 2024

MathiasBaumgartinger commented Oct 1, 2024

adamjstewart commented Oct 1, 2024

agarioud commented Oct 3, 2024

MathiasBaumgartinger commented Oct 3, 2024

agarioud commented Oct 3, 2024

agarioud commented Oct 3, 2024

adamjstewart commented Oct 3, 2024

agarioud commented Oct 3, 2024 • edited Loading

rbavery commented Oct 3, 2024

MathiasBaumgartinger commented Oct 4, 2024 • edited Loading

adamjstewart commented Oct 4, 2024

adamjstewart commented Oct 31, 2024

MathiasBaumgartinger commented Nov 2, 2024

MathiasBaumgartinger commented Nov 4, 2024

rbavery commented Nov 5, 2024

MathiasBaumgartinger commented Nov 5, 2024 • edited Loading

MathiasBaumgartinger commented Sep 30, 2024 •

edited

Loading

agarioud commented Oct 3, 2024 •

edited

Loading

MathiasBaumgartinger commented Oct 4, 2024 •

edited

Loading

MathiasBaumgartinger commented Nov 5, 2024 •

edited

Loading