Xarray Converter Fails With A List Of ZIP Files Containing A Single-file ZIP
Introduction
The earthkit.data
library is a powerful tool for working with Earth observation data. However, in this article, we will explore a specific issue that arises when using the to_xarray()
method with a list of ZIP files containing a single-file ZIP. This issue is particularly problematic when trying to merge data from multiple years.
What happened?
The code snippet below fails to execute, but it works if we exclude the year 2025 or if only 2025 is selected. This suggests that there is a bug in the earthkit.data
library that gets triggered when trying to open a list of ZIP files containing a single-file ZIP.
import earthkit.data
collection_id = "reanalysis-oras5"
request = {
"product_type": ["operational"],
"vertical_resolution": "single_level",
"variable": ["ocean_heat_content_for_the_upper_300m"],
"year": ["2023", "2024", "2025"],
"month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
}
ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()
What are the steps to reproduce the bug?
To reproduce the bug, follow these steps:
- Install the
earthkit.data
library using pip:pip install earthkit-data
- Create a list of ZIP files containing a single-file ZIP for each year (e.g., 2023, 2024, 2025)
- Run the code snippet above, replacing the
collection_id
andrequest
variables with your own values
Version
The version of the earthkit.data
library used in this example is 0.13.1.
Platform (OS and architecture)
The platform used to run this example is a MacBook-Pro with Darwin 24.3.0, running on an arm64 architecture.
Relevant log output
The relevant log output from the error message is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 12
3 collection_id = "reanalysis-oras5"
4 request = {
5 "product_type": ["operational"],
6 "vertical_resolution": "single_level",
(...) 9 "month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
10 }
---> 12 ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/sources/multi.py:109, in MultiSource.to_xarray(self, **kwargs)
108 def to_xarray(self, **kwargs):
--> 109 return make_merger(self.merger, self.sources).to_xarray(**kwargs)
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/__init__.py:109, in DefaultMerger.to_xarray(self, **kwargs)
106 def to_xarray(self, **kwargs):
107 from .xarray import merge
--> 109 return merge(
110 sources=self.sources,
111 paths=self.paths,
112 reader_class=self.reader_class,
113 **kwargs,
114 )
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/xarray.py:75, in merge(sources, paths, reader_class, **kwargs)
73 if paths is not None:
74 if reader_class is not None and hasattr(reader_class, "to_xarray_multi_from_paths"):
---> 75 return reader_class.to_xarray_multi_from_paths(
76 paths,
77 **options,
78 )
80 LOG.debug(f"xr.open_mfdataset with options={options}")
81 return xr.open_mfdataset(paths, **options)
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/readers/netcdf/__init__.py:73, in NetCDFReader.to_xarray_multi_from_paths(cls, paths, **kwargs)
70 if not options:
71 options = dict(**kwargs)
---> 73 return xr.open_mfdataset(
74 paths,
75 **options,
76 )
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
1631 open_ = open_dataset
1632 getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
1635 closers = [getattr_(ds, "_close") for ds in datasets]
1636 if preprocess is not None:
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in <listcomp>(.0)
1631 open_ = open_dataset
1632 getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
1635 closers = [getattr_(ds, "_close") for ds in datasets]
1636 if preprocess is not None:
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:667, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
664 kwargs.update(backend_kwargs)
666 if engine is None:
--> 667 engine = plugins.guess_engine(filename_or_obj)
669 if from_array_kwargs is None:
670 from_array_kwargs = {}
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
186 else:
187 error_msg = (
188 "found the following matches with the input file in xarray's IO "
189 f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
190 "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
191 "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
192 )
--> 194 raise ValueError(error_msg)
ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'scipy', 'cfgrib', 'earthkit']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html
Accompanying data
No accompanying data is provided for this issue.
Organisation
Q: What is the issue with the xarray converter?
A: The xarray converter fails when trying to open a list of ZIP files containing a single-file ZIP. This is a known issue in the earthkit.data
library.
Q: What are the steps to reproduce the bug?
A: To reproduce the bug, follow these steps:
- Install the
earthkit.data
library using pip:pip install earthkit-data
- Create a list of ZIP files containing a single-file ZIP for each year (e.g., 2023, 2024, 2025)
- Run the code snippet below, replacing the
collection_id
andrequest
variables with your own values:
import earthkit.data
collection_id = "reanalysis-oras5"
request = {
"product_type": ["operational"],
"vertical_resolution": "single_level",
"variable": ["ocean_heat_content_for_the_upper_300m"],
"year": ["2023", "2024", "2025"],
"month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
}
ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()
Q: What is the error message?
A: The error message is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 12
3 collection_id = "reanalysis-oras5"
4 request = {
5 "product_type": ["operational"],
6 "vertical_resolution": "single_level",
(...) 9 "month": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"],
10 }
---> 12 ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray()
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/sources/multi.py:109, in MultiSource.to_xarray(self, **kwargs)
108 def to_xarray(self, **kwargs):
--> 109 return make_merger(self.merger, self.sources).to_xarray(**kwargs)
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/__init__.py:109, in DefaultMerger.to_xarray(self, **kwargs)
106 def to_xarray(self, **kwargs):
107 from .xarray import merge
--> 109 return merge(
110 sources=self.sources,
111 paths=self.paths,
112 reader_class=self.reader_class,
113 **kwargs,
114 )
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/mergers/xarray.py:75, in merge(sources, paths, reader_class, **kwargs)
73 if paths is not None:
74 if reader_class is not None and hasattr(reader_class, "to_xarray_multi_from_paths"):
---> 75 return reader_class.to_xarray_multi_from_paths(
76 paths,
77 **options,
78 )
80 LOG.debug(f"xr.open_mfdataset with options={options}")
81 return xr.open_mfdataset(paths, **options)
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/earthkit/data/readers/netcdf/__init__.py:73, in NetCDFReader.to_xarray_multi_from_paths(cls, paths, **kwargs)
70 if not options:
71 options = dict(**kwargs)
---> 73 return xr.open_mfdataset(
74 paths,
75 **options,
76 )
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
1631 open_ = open_dataset
1632 getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
1635 closers = [getattr_(ds, "_close") for ds in datasets]
1636 if preprocess is not None:
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:1634, in <listcomp>(.0)
1631 open_ = open_dataset
1632 getattr_ = getattr
-> 1634 datasets = [open_(p, **open_kwargs) for p in paths1d]
1635 closers = [getattr_(ds, "_close") for ds in datasets]
1636 if preprocess is not None:
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/api.py:667, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
664 kwargs.update(backend_kwargs)
666 if engine is None:
--> 667 engine = plugins.guess_engine(filename_or_obj)
669 if from_array_kwargs is None:
670 from_array_kwargs = {}
File ~/miniforge3/envs/earthkit-data/lib/python3.11/site-packages/xarray/backends/plugins.py:194, in guess_engine(store_spec)
186 else:
187 error_msg = (
188 "found the following matches with the input file in xarray's IO "
189 f"backends: {compatible_engines}. But their dependencies may not be installed, see:\n"
190 "https://docs.xarray.dev/en/stable/user-guide/io.html \n"
191 "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
192 )
--> 194 raise ValueError(error_msg)
ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'scipy', 'cfgrib', 'earthkit']. Consider explicitly selecting one of the installed engines via the ``engine`` parameter, or installing additional IO dependencies, see:
https://docs.xarray.dev/en/stable/getting-started-guide/installing.html
https://docs.xarray.dev/en/stable/user-guide/io.html
Q: How to fix the issue?
A: To fix the issue, you can try the following:
- Install the
netcdf4
library using pip:pip install netcdf4
- Set the
engine
parameter tonetcdf4
in theto_xarray()
method:
ds = earthkit.data.from_source("cds", collection_id, **request, split_on="year").to_xarray(engine="netcdf4")
Q: What are the possible causes of the issue?
A: The possible causes of the issue are:
- Missing dependencies: The
earthkit.data
library requires certain dependencies to be installed, such asnetcdf4
. If these dependencies are missing, the library may not work correctly. - Incorrect configuration: The
earthkit.data
library has certain configuration options that need to be set correctly. If these options are not set correctly, the library may not work correctly. - Bug in the library: There may be a bug in the
earthkit.data
library that causes the issue.
Q: How to report the issue?
A: To report the issue, you can:
- Check the
earthkit.data
library documentation to see if the issue is already reported. - If the issue is not reported, create a new issue on the library's GitHub page.
- Provide as much information as possible about the issue, including the code snippet that causes the issue, the error message, and any relevant configuration options.