Xarray Converter Fails With A List Of ZIP Files Containing A Single-file ZIP

by ADMIN 77 views

Introduction

The earthkit.data library is a powerful tool for working with Earth observation data. However, in this article, we will explore a specific issue that arises when using the to_xarray() method with a list of ZIP files containing a single-file ZIP. This issue is particularly problematic when trying to merge data from multiple years.

What happened?

The code snippet below fails to execute, but it works if we exclude the year 2025 or if only 2025 is selected. This suggests that there is a bug in the earthkit.data library that gets triggered when trying to open a list of ZIP files containing a single-file ZIP.

import earthkit.data

collection_id = "reanalysis-oras5"
request = {
    "product_type": ["operational"],
    "vertical_resolution": "single_level",
    "variable": ["ocean_heat_content_for_the_upper_300m"],
    "year": ["2023", "2024", "2025"],
    "month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
}

ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()

What are the steps to reproduce the bug?

To reproduce the bug, follow these steps:

  1. Install the earthkit.data library using pip: pip install earthkit-data
  2. Create a list of ZIP files containing a single-file ZIP for each year (e.g., 2023, 2024, 2025)
  3. Run the code snippet above, replacing the collection_id and request variables with your own values

Version

The version of the earthkit.data library used in this example is 0.13.1.

Platform (OS and architecture)

The platform used to run this example is a MacBook-Pro with Darwin 24.3.0, running on an arm64 architecture.

Relevant log output

The relevant log output from the error message is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 12
      3 collection_id = "reanalysis-oras5"
      4 request = {
      5     "product_type": ["operational"],
      6     "vertical_resolution": "single_level",
   (...)      9     "month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
     10 }
---> 12 ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/sources/multi.py:109, in MultiSource.to_xarray(self, **kwargs)
    108 def to_xarray(self, **kwargs):
--> 109     return make_merger(self.merger, self.sources).to_xarray(**kwargs)

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/__init__.py:109, in DefaultMerger.to_xarray(self, **kwargs)
    106 def to_xarray(self, **kwargs):
    107     from .xarray import merge
--> 109     return merge(
    110         sources=self.sources,
    111         paths=self.paths,
    112         reader_class=self.reader_class,
    113         **kwargs,
    114     )

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/xarray.py:75, in merge(sources, paths, reader_class, **kwargs)
     73 if paths is not None:
     74     if reader_class is not None and hasattr(reader_class, "to_xarray_multi_from_paths"):
---> 75         return reader_class.to_xarray_multi_from_paths(
     76             paths,
     77             **options,
     78         )
     80     LOG.debug(f"xr.open_mfdataset with options={options}")
     81     return xr.open_mfdataset(paths, **options)

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/readers/netcdf/__init__.py:73, in NetCDFReader.to_xarray_multi_from_paths(cls, paths, **kwargs)
     70 if not options:
     71     options = dict(**kwargs)
---> 73 return xr.open_mfdataset(
     74     paths,
     75     **options,
     76 )

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1631     open_ = open_dataset
   1632     getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
   1635 closers = [getattr_(ds, "_close") for ds in datasets]
   1636 if preprocess is not None:

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in <listcomp>(.0)
   1631     open_ = open_dataset
   1632     getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
   1635 closers = [getattr_(ds, "_close") for ds in datasets]
   1636 if preprocess is not None:

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:667, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    664     kwargs.update(backend_kwargs)
    666 if engine is None:
--> 667     engine = plugins.guess_engine(filename_or_obj)
    669 if from_array_kwargs is None:
    670     from_array_kwargs = {}

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
    186 else:
    187     error_msg = (
    188         "found the following matches with the input file in xarray's IO "
    189         f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
    190         "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
    191         "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
    192     )
--> 194 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'scipy', 'cfgrib', 'earthkit']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

Accompanying data

No accompanying data is provided for this issue.

Organisation

Q: What is the issue with the xarray converter?

A: The xarray converter fails when trying to open a list of ZIP files containing a single-file ZIP. This is a known issue in the earthkit.data library.

Q: What are the steps to reproduce the bug?

A: To reproduce the bug, follow these steps:

  1. Install the earthkit.data library using pip: pip install earthkit-data
  2. Create a list of ZIP files containing a single-file ZIP for each year (e.g., 2023, 2024, 2025)
  3. Run the code snippet below, replacing the collection_id and request variables with your own values:
import earthkit.data

collection_id = "reanalysis-oras5"
request = {
    "product_type": ["operational"],
    "vertical_resolution": "single_level",
    "variable": ["ocean_heat_content_for_the_upper_300m"],
    "year": ["2023", "2024", "2025"],
    "month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
}

ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()

Q: What is the error message?

A: The error message is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 12
      3 collection_id = "reanalysis-oras5"
      4 request = {
      5     "product_type": ["operational"],
      6     "vertical_resolution": "single_level",
   (...)      9     "month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
     10 }
---> 12 ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/sources/multi.py:109, in MultiSource.to_xarray(self, **kwargs)
    108 def to_xarray(self, **kwargs):
--> 109     return make_merger(self.merger, self.sources).to_xarray(**kwargs)

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/__init__.py:109, in DefaultMerger.to_xarray(self, **kwargs)
    106 def to_xarray(self, **kwargs):
    107     from .xarray import merge
--> 109     return merge(
    110         sources=self.sources,
    111         paths=self.paths,
    112         reader_class=self.reader_class,
    113         **kwargs,
    114     )

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/xarray.py:75, in merge(sources, paths, reader_class, **kwargs)
     73 if paths is not None:
     74     if reader_class is not None and hasattr(reader_class, "to_xarray_multi_from_paths"):
---> 75         return reader_class.to_xarray_multi_from_paths(
     76             paths,
     77             **options,
     78         )
     80     LOG.debug(f"xr.open_mfdataset with options={options}")
     81     return xr.open_mfdataset(paths, **options)

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/readers/netcdf/__init__.py:73, in NetCDFReader.to_xarray_multi_from_paths(cls, paths, **kwargs)
     70 if not options:
     71     options = dict(**kwargs)
---> 73 return xr.open_mfdataset(
     74     paths,
     75     **options,
     76 )

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1631     open_ = open_dataset
   1632     getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
   1635 closers = [getattr_(ds, "_close") for ds in datasets]
   1636 if preprocess is not None:

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in <listcomp>(.0)
   1631     open_ = open_dataset
   1632     getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
   1635 closers = [getattr_(ds, "_close") for ds in datasets]
   1636 if preprocess is not None:

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:667, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    664     kwargs.update(backend_kwargs)
    666 if engine is None:
--> 667     engine = plugins.guess_engine(filename_or_obj)
    669 if from_array_kwargs is None:
    670     from_array_kwargs = {}

File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
    186 else:
    187     error_msg = (
    188         "found the following matches with the input file in xarray's IO "
    189         f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
    190         "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
    191         "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
    192     )
--> 194 raise ValueError(error_msg)

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'scipy', 'cfgrib', 'earthkit']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html

Q: How to fix the issue?

A: To fix the issue, you can try the following:

  1. Install the netcdf4 library using pip: pip install netcdf4
  2. Set the engine parameter to netcdf4 in the to_xarray() method:
ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray(engine="netcdf4")

Q: What are the possible causes of the issue?

A: The possible causes of the issue are:

  1. Missing dependencies: The earthkit.data library requires certain dependencies to be installed, such as netcdf4. If these dependencies are missing, the library may not work correctly.
  2. Incorrect configuration: The earthkit.data library has certain configuration options that need to be set correctly. If these options are not set correctly, the library may not work correctly.
  3. Bug in the library: There may be a bug in the earthkit.data library that causes the issue.

Q: How to report the issue?

A: To report the issue, you can:

  1. Check the earthkit.data library documentation to see if the issue is already reported.
  2. If the issue is not reported, create a new issue on the library's GitHub page.
  3. Provide as much information as possible about the issue, including the code snippet that causes the issue, the error message, and any relevant configuration options.