Guideline for how to get started when using data in CSV files¶

Maybe some general remarks on working with CSV data in OpenSense?

PWS data¶

This notebook gives an example how to get started using single PWS csv-files. The following exmaples uses a subset data from the AMS dateset published by de Vos et al. (2019). PWS data which are dowloaded using the API come as individual csv files and should also include a meta file with PWS IDs and coordinates.

[14]:

import datetime
import glob
import warnings

import numpy as np
import pandas as pd
import xarray as xr

warnings.simplefilter(action="ignore", category=FutureWarning)

Functions for checking data¶

It is quite frequent, that Netamo PWS have identical coordinates. This indicates that the users have not used the web interface from Netatmo correctly to place their PWS which leads to a false location information. This function is used later on to remove PWS with identical coordinates.

[15]:

# Small function to remove PWS with duplicate coordinates from an xarray.Dataset
def get_duplicate_coordinates(ds, id_var_name="id", keep=False, lat="lat", lon="lon"):  # noqa: D103
    # lat and lon can be set if a dataset does not follow our OS cenvention
    df_data = ds[id_var_name].to_dataframe()  # assumes that there is an 'id' dimension
    # return a boolean series which can easily be used for indexing in the initial `ds`
    return df_data.duplicated(subset=[lon, lat], keep=keep)

Netatmo PWS data can have a temporal resolution up to ~5 Min. However, these invervals are not exactly 5 minutes and can occasionally be a bit longer which leads to either missing time stamps in the data series or single timestamps with missing data. The following functions checks for complete time series with NaN for the missing values. Otherwise the missing dates are infilled by reindexing the dataframe.

Note that this function uses specific metadata entries for the AMS PWS dataset. This function works only if all the timestamps start at at full 5 minutes. This needs to be adjusted accordingly for other datasets.

[16]:

def reindex_timestamps(df_data):  # noqa: D103
    # check if timestamps are complete
    start = df_data.index[0]
    end = df_data.index[-1]
    dates = pd.date_range(start=start, end=end, freq="5Min")
    if len(df_data.index) != len(dates):
        print(f"File {fn} has missing time steps, reindexing")  # noqa: T201
        df_data = df_data.reindex(dates)
    return df_data

Read CSV raw data file¶

Let’s get a list of the all the raw data files. In the next steps we will use only one file, but later we want to loop through all files.

[17]:

fn_list = sorted(glob.glob("csv_cookbook_data/ams*.csv"))  # noqa: PTH207
fn_list

[17]:

['csv_cookbook_data/ams1.csv',
 'csv_cookbook_data/ams2.csv',
 'csv_cookbook_data/ams3.csv',
 'csv_cookbook_data/ams4.csv',
 'csv_cookbook_data/ams5.csv',
 'csv_cookbook_data/ams6.csv',
 'csv_cookbook_data/ams7.csv',
 'csv_cookbook_data/ams8.csv',
 'csv_cookbook_data/ams9.csv']

[18]:

pws_data = pd.read_csv(fn_list[0], index_col=0, parse_dates=True, sep=",")
pws_data.head()

[18]:

	ams1
Time
2016-05-01 00:05:00	0.0
2016-05-01 00:10:00	0.0
2016-05-01 00:15:00	0.0
2016-05-01 00:20:00	0.0
2016-05-01 00:25:00	0.0

Read in CSV metadata file¶

[19]:

# Load metadata from the Amsterdam PWS data set (showcase subset with 9 PWS)
pws_meta = pd.read_csv("csv_cookbook_data/AMS_metadatasubset.csv", sep=",")

[20]:

pws_meta.head()

[20]:

	ID	lon	lat
0	ams1	4.670664	52.305896
1	ams2	4.674940	52.302044
2	ams3	4.677275	52.308427
3	ams4	4.677554	52.346013
4	ams5	4.678593	52.318572

Convert data from `pandas.Dataframe` to `xarray.Dataset` including metadata¶

The following function converts the pandas.Dataframe to xarray.Dataset including metadata. This follows the data conventions defined in the OpenSense Cost Action (link to white paper)

Note that this function uses specific metadata entries for the AMS PWS dataset. This needs to be adjusted accordingly for other dataset.

[21]:

def dataframe_to_dataset(pws_data, lon, lat, elevation=None):  # noqa: D103
    pws_data.index = pws_data.index.rename("time")

    ds = pws_data.to_xarray().to_array(dim="id")

    ds = ds.to_dataset(name="rainfall")

    if elevation is None:
        ds = ds.assign_coords({"elevation": xr.full_like(ds.id, np.nan)})
    else:
        ds = ds.assign_coords({"elevation": elevation})

    ds = ds.assign_coords({"lon": lon})
    ds = ds.assign_coords({"lat": lat})
    # Important: Set the reference, otherwise the first time step is used.
    ds.time.encoding["units"] = "seconds since 1970-01-01 00:00:00"

    # variable attributes
    ds.rainfall.attrs = {
        "name": "rainfall",
        "long_name": "rainfall amount per time unit",
        "units": "mm",
        "coverage_contant_type": "physicalMeasurement",
    }

    # coordiate attributes
    ds.time.attrs = {"unit": "seconds since 1970-01-01 00:00:00"}
    ds.lon.attrs = {"units": "degrees in WGS84 projection"}
    ds.lat.attrs = {"units": "degrees in WGS84 projection"}
    ds.elevation.attrs = {"units": "meters", "longname": "meters_above_sea"}

    # global Atrributes
    ds.attrs["title"] = "PWS data from Amsterdam"
    ds.attrs[
        "institution"
    ] = "Wageningen University and Research, Department of Environmental Sciences"
    ds.attrs["history"] = "Subset of the orignal PWS dataset from de Vos et al. (2019)"
    ds.attrs["references"] = "https://doi.org/10.1029/2019GL083731"
    ds.attrs["date_created"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    ds.attrs["Conventions"] = "OpenSense-0.1"
    ds.attrs[
        "license restrictions"
    ] = "CC-BY 4.0 https://creativecommons.org/licenses/by/4.0/"
    ds.attrs["location"] = "Amsterdam (NL)"
    ds.attrs["source"] = "Netatmo PWS"
    ds.attrs["comment"] = ""

    return ds

Parse raw data and convert to `xarray.Dataset` with metadata¶

PWS ams1 has single timestamps with missing data, PWS ams2, ams6 and ams7 have empty timestemps (which corresponds to NaN).

[22]:

ds_list = []
for _, metadata_row in pws_meta.iterrows():
    fn = f"csv_cookbook_data/{metadata_row.ID}.csv"
    pws_data = pd.read_csv(fn, index_col=0, parse_dates=True, sep=",")
    print(pws_data.index[0])  # noqa: T201

    duplicate_ix = pws_data.index.duplicated(keep=False).sum()
    if duplicate_ix > 0:
        print(str(duplicate_ix) + " duplicate indices in file " + metadata_row.ID)  # noqa: T201
        pws_data = pws_data[~pws_data.index.duplicated(keep="last")]

    # check if timestamps are complete
    # PWS ams1 has single timestamps with missing data,
    # PWS ams2, ams6 and ams7 have empty timestemps (which corresponds to NaN).
    reindex_timestamps(pws_data)

    # replace values smaller than 0 with np.nan
    # This is relevant if PWS have numerical values (e.g. -9999) for NaN
    pws_data[pws_data < 0] = np.nan

    print(  # noqa: T201
        metadata_row.ID,
        "has",
        int(pws_data.isnull().sum()[0] / pws_data.index.shape[0] * 100),  # noqa: PD003
        "% missing values",
    )

    # The follwoing lines convert the data to array and add the attributes
    # accoding to the OpenSense naming conventions
    ds = dataframe_to_dataset(pws_data, metadata_row.lon, metadata_row.lat)
    ds_list.append(ds)

2016-05-01 00:05:00
File csv_cookbook_data/ams1.csv has missing time steps, reindexing
ams1 has 0 % missing values
2016-05-01 00:05:00
ams2 has 100 % missing values
2016-05-01 17:30:00
ams3 has 4 % missing values
2016-05-01 00:05:00
ams4 has 1 % missing values
2016-05-01 00:05:00
ams5 has 100 % missing values
2016-05-01 00:05:00
ams6 has 4 % missing values
2016-05-01 00:05:00
ams7 has 100 % missing values
2016-05-01 00:05:00
ams8 has 100 % missing values
2016-05-01 00:05:00
ams9 has 100 % missing values

Save data to compressed NetCDF¶

[26]:

# add enconoding to compress the file
myencoding = {
    "rainfall": {
        "dtype": "int32",
        "scale_factor": 0.001,
        "zlib": True,
        "_FillValue": -9999,
        "complevel": 3,
    }
}

ds_all.to_netcdf("AMS_PWS_subset.nc", encoding=myencoding)

CML data¶

to be added…¶

[ ]: