Guideline for how to get started when using data in CSV files

Maybe some general remarks on working with CSV data in OpenSense?

PWS data

This notebook gives an example how to get started using single PWS csv-files. The following exmaples uses a subset data from the AMS dateset published by de Vos et al. (2019). PWS data which are dowloaded using the API come as individual csv files and should also include a meta file with PWS IDs and coordinates.

[14]:
import datetime
import glob
import warnings

import numpy as np
import pandas as pd
import xarray as xr

warnings.simplefilter(action="ignore", category=FutureWarning)

Functions for checking data

It is quite frequent, that Netamo PWS have identical coordinates. This indicates that the users have not used the web interface from Netatmo correctly to place their PWS which leads to a false location information. This function is used later on to remove PWS with identical coordinates.

[15]:
# Small function to remove PWS with duplicate coordinates from an xarray.Dataset
def get_duplicate_coordinates(ds, id_var_name="id", keep=False, lat="lat", lon="lon"):  # noqa: D103
    # lat and lon can be set if a dataset does not follow our OS cenvention
    df_data = ds[id_var_name].to_dataframe()  # assumes that there is an 'id' dimension
    # return a boolean series which can easily be used for indexing in the initial `ds`
    return df_data.duplicated(subset=[lon, lat], keep=keep)

Netatmo PWS data can have a temporal resolution up to ~5 Min. However, these invervals are not exactly 5 minutes and can occasionally be a bit longer which leads to either missing time stamps in the data series or single timestamps with missing data. The following functions checks for complete time series with NaN for the missing values. Otherwise the missing dates are infilled by reindexing the dataframe.

Note that this function uses specific metadata entries for the AMS PWS dataset. This function works only if all the timestamps start at at full 5 minutes. This needs to be adjusted accordingly for other datasets.

[16]:
def reindex_timestamps(df_data):  # noqa: D103
    # check if timestamps are complete
    start = df_data.index[0]
    end = df_data.index[-1]
    dates = pd.date_range(start=start, end=end, freq="5Min")
    if len(df_data.index) != len(dates):
        print(f"File {fn} has missing time steps, reindexing")  # noqa: T201
        df_data = df_data.reindex(dates)
    return df_data

Read CSV raw data file

Let’s get a list of the all the raw data files. In the next steps we will use only one file, but later we want to loop through all files.

[17]:
fn_list = sorted(glob.glob("csv_cookbook_data/ams*.csv"))  # noqa: PTH207
fn_list
[17]:
['csv_cookbook_data/ams1.csv',
 'csv_cookbook_data/ams2.csv',
 'csv_cookbook_data/ams3.csv',
 'csv_cookbook_data/ams4.csv',
 'csv_cookbook_data/ams5.csv',
 'csv_cookbook_data/ams6.csv',
 'csv_cookbook_data/ams7.csv',
 'csv_cookbook_data/ams8.csv',
 'csv_cookbook_data/ams9.csv']
[18]:
pws_data = pd.read_csv(fn_list[0], index_col=0, parse_dates=True, sep=",")
pws_data.head()
[18]:
ams1
Time
2016-05-01 00:05:00 0.0
2016-05-01 00:10:00 0.0
2016-05-01 00:15:00 0.0
2016-05-01 00:20:00 0.0
2016-05-01 00:25:00 0.0

Read in CSV metadata file

[19]:
# Load metadata from the Amsterdam PWS data set (showcase subset with 9 PWS)
pws_meta = pd.read_csv("csv_cookbook_data/AMS_metadatasubset.csv", sep=",")
[20]:
pws_meta.head()
[20]:
ID lon lat
0 ams1 4.670664 52.305896
1 ams2 4.674940 52.302044
2 ams3 4.677275 52.308427
3 ams4 4.677554 52.346013
4 ams5 4.678593 52.318572

Convert data from pandas.Dataframe to xarray.Dataset including metadata

The following function converts the pandas.Dataframe to xarray.Dataset including metadata. This follows the data conventions defined in the OpenSense Cost Action (link to white paper)

Note that this function uses specific metadata entries for the AMS PWS dataset. This needs to be adjusted accordingly for other dataset.

[21]:
def dataframe_to_dataset(pws_data, lon, lat, elevation=None):  # noqa: D103
    pws_data.index = pws_data.index.rename("time")

    ds = pws_data.to_xarray().to_array(dim="id")

    ds = ds.to_dataset(name="rainfall")

    if elevation is None:
        ds = ds.assign_coords({"elevation": xr.full_like(ds.id, np.nan)})
    else:
        ds = ds.assign_coords({"elevation": elevation})

    ds = ds.assign_coords({"lon": lon})
    ds = ds.assign_coords({"lat": lat})
    # Important: Set the reference, otherwise the first time step is used.
    ds.time.encoding["units"] = "seconds since 1970-01-01 00:00:00"

    # variable attributes
    ds.rainfall.attrs = {
        "name": "rainfall",
        "long_name": "rainfall amount per time unit",
        "units": "mm",
        "coverage_contant_type": "physicalMeasurement",
    }

    # coordiate attributes
    ds.time.attrs = {"unit": "seconds since 1970-01-01 00:00:00"}
    ds.lon.attrs = {"units": "degrees in WGS84 projection"}
    ds.lat.attrs = {"units": "degrees in WGS84 projection"}
    ds.elevation.attrs = {"units": "meters", "longname": "meters_above_sea"}

    # global Atrributes
    ds.attrs["title"] = "PWS data from Amsterdam"
    ds.attrs[
        "institution"
    ] = "Wageningen University and Research, Department of Environmental Sciences"
    ds.attrs["history"] = "Subset of the orignal PWS dataset from de Vos et al. (2019)"
    ds.attrs["references"] = "https://doi.org/10.1029/2019GL083731"
    ds.attrs["date_created"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    ds.attrs["Conventions"] = "OpenSense-0.1"
    ds.attrs[
        "license restrictions"
    ] = "CC-BY 4.0 https://creativecommons.org/licenses/by/4.0/"
    ds.attrs["location"] = "Amsterdam (NL)"
    ds.attrs["source"] = "Netatmo PWS"
    ds.attrs["comment"] = ""

    return ds

Parse raw data and convert to xarray.Dataset with metadata

PWS ams1 has single timestamps with missing data, PWS ams2, ams6 and ams7 have empty timestemps (which corresponds to NaN).

[22]:
ds_list = []
for _, metadata_row in pws_meta.iterrows():
    fn = f"csv_cookbook_data/{metadata_row.ID}.csv"
    pws_data = pd.read_csv(fn, index_col=0, parse_dates=True, sep=",")
    print(pws_data.index[0])  # noqa: T201

    duplicate_ix = pws_data.index.duplicated(keep=False).sum()
    if duplicate_ix > 0:
        print(str(duplicate_ix) + " duplicate indices in file " + metadata_row.ID)  # noqa: T201
        pws_data = pws_data[~pws_data.index.duplicated(keep="last")]

    # check if timestamps are complete
    # PWS ams1 has single timestamps with missing data,
    # PWS ams2, ams6 and ams7 have empty timestemps (which corresponds to NaN).
    reindex_timestamps(pws_data)

    # replace values smaller than 0 with np.nan
    # This is relevant if PWS have numerical values (e.g. -9999) for NaN
    pws_data[pws_data < 0] = np.nan

    print(  # noqa: T201
        metadata_row.ID,
        "has",
        int(pws_data.isnull().sum()[0] / pws_data.index.shape[0] * 100),  # noqa: PD003
        "% missing values",
    )

    # The follwoing lines convert the data to array and add the attributes
    # accoding to the OpenSense naming conventions
    ds = dataframe_to_dataset(pws_data, metadata_row.lon, metadata_row.lat)
    ds_list.append(ds)
2016-05-01 00:05:00
File csv_cookbook_data/ams1.csv has missing time steps, reindexing
ams1 has 0 % missing values
2016-05-01 00:05:00
ams2 has 100 % missing values
2016-05-01 17:30:00
ams3 has 4 % missing values
2016-05-01 00:05:00
ams4 has 1 % missing values
2016-05-01 00:05:00
ams5 has 100 % missing values
2016-05-01 00:05:00
ams6 has 4 % missing values
2016-05-01 00:05:00
ams7 has 100 % missing values
2016-05-01 00:05:00
ams8 has 100 % missing values
2016-05-01 00:05:00
ams9 has 100 % missing values

Concatenate all individual xarray.Datasets to one xarray.Dataset with all stations

PWS ams8 and ams9 have identical coordinates. These stations will be removed using the fuction defined above when creating one xarray.Dataset with all stations

[23]:
# write list to one netCDF file
ds_all = xr.concat(ds_list, dim="id")

# remove PWS with duplicate coordinates from xarray.Dataset
duplicates = get_duplicate_coordinates(ds_all, lat="lat", lon="lon")
ds_all = ds_all.sel(id=~duplicates.to_numpy())
print("Dataset has", duplicates.sum(), "PWS with identical lat/lon coordinates")  # noqa: T201
Dataset has 2 PWS with identical lat/lon coordinates
[24]:
ds_all
[24]:
<xarray.Dataset>
Dimensions:    (time: 2879, id: 7)
Coordinates:
  * time       (time) datetime64[ns] 2016-05-01T00:05:00 ... 2016-05-10T23:55:00
  * id         (id) object 'ams1' 'ams2' 'ams3' 'ams4' 'ams5' 'ams6' 'ams7'
    elevation  (id) object nan nan nan nan nan nan nan
    lon        (id) float64 4.671 4.675 4.677 4.678 4.679 4.679 4.679
    lat        (id) float64 52.31 52.3 52.31 52.35 52.32 52.3 52.3
Data variables:
    rainfall   (id, time) float64 0.0 0.0 0.0 0.0 0.0 ... nan nan nan nan nan
Attributes:
    title:                 PWS data from Amsterdam
    institution:           Wageningen University and Research, Department of ...
    history:               Subset of the orignal PWS dataset from de Vos et a...
    references:            https://doi.org/10.1029/2019GL083731
    date_created:          2024-06-20 17:08:03
    Conventions:           OpenSense-0.1
    license restrictions:  CC-BY 4.0 https://creativecommons.org/licenses/by/...
    location:              Amsterdam (NL)
    source:                Netatmo PWS
    comment:               
[25]:
ds_all.rainfall.cumsum(dim="time").plot.line(x="time", figsize=(14, 4));
../_images/notebooks_OS_csv_to_nc_cookbook_26_0.png

Save data to compressed NetCDF

[26]:
# add enconoding to compress the file
myencoding = {
    "rainfall": {
        "dtype": "int32",
        "scale_factor": 0.001,
        "zlib": True,
        "_FillValue": -9999,
        "complevel": 3,
    }
}

ds_all.to_netcdf("AMS_PWS_subset.nc", encoding=myencoding)

CML data

to be added…

[ ]: