Home

An example of using KaFKA

KaFKA relies on a few things:

The existence of an observation operator (e.g. RT model) interfaced by a Gaussian Process emulator (not necessarily, but it helps)
The provision of an observations object that reads in the EO data, and provides an estimate of uncertainty.
A class used for writing the output to disk. This can probably use information from the observations to define e.g. locations and whatnot.

Defining the problem

The region of interest/state mask/state grid and the temporal grid

KaFKA will produce inferences on parameters over a region, defined on both space and time. For time, the current implementation uses a temporal grid: a normal Python list where each element is a datetime.datetime object. The inferences will be done assuming the land surface state doesn't change between time steps (list elements). The first period of inference will take all observations between the 0-th element of the list up to (and possibly including, I can't be arsed checking) the 1st element in the list, the second from the 1st until the second and so on.

The spatial mask defines both the area and spatial resolution of where parameters will be inferred. As such, it is designed to be a True/False mask over some area. In this case, some locations will form part of the inference, and some will not (allowing the user to mask roads, rivers, lakes, urban areas, ...). An advantageous format for this is to use a geospatial dataset (a GeoTIFF file). This stores the projection as well as extent and spatial resolution, while also providing the actual pixel mask. The GeoTIFF file can then be used to automatically reproject other datasets (observations, priors, ...) or to define the shape of the output geospatial files.

The state vector and the covariance information

KaFKA uses a normal pdf to encode all information related to the state vector, which means that the state can be fully stored as a so-called mean vector and a covariance (or inverse covariance) matrix. The causal nature of the system implies that each for each time step, we'll have a vector and a matrix for all pixels. We have a preference for using inverse covariance matrices, as they're often sparse.

The state vector for a single pixel is determined by a vector of N parameters p1, ..., pN. When considering many pixels, each pixel will be characterised by a set of N parameters, and hence, the total state vector for P pixels will be a vector of size P*N with the following ordering p1,1, ..., pN,1, p2, 1, ..., pN,2, ..., p1,P, ..., pN,P. In other words, we store all the parameters related to one pixel in succession, and then stack the individual vectors sequentially. The pixel ordering is given by the True elements of the state mask after flattening.

The (inverse) covariance matrix follows the same ordering along the main diagonal. This means that in the case of no inter-pixel interactions, the matrix will be block diagonal, with P blocks of size N, hence the total size will be N*P, N*P.

The observations class

The observation class requires a method and an attribute which are compulsory (these are used by the inference engine and are expected):

get_band_data(the_date, band_no), which returns observation, uncertainty and relevant emulators as a namedtuple (or similar class) with elements attributes observations, uncertainty, mask, metadata and emulator (need to define the actual contents)
dates: a dictionary containing a set of datetime.datetime objects with the dates of the different observations.

A barebones class looks like this

class Observations(object):
    def __init__(self):
        self.dates = []
    def get_band_data(self, the_date, band_no):
        # [...]
        return data_chunk

There are examples of these kind of classes in here

The prior class

Again, the class is quite flexible, and only requires a single method to exist with a defined interface:

*process_prior(self, time, inv_cov=True), where time is a timestep in datetime.datetime format, and we can choose between returning the mean and either the inverse covariance matrix or the covariance matrix. The method should return the state mean and either the covariance or inverse covariance matrix (as a sparse matrix).

The data dumper class

dump_data(timestep, x_analysis, P_analysis, P_analysis_inv, state_mask) which dumps the data into the disk. This method doesn't return anything. Currently, the format isn't defined. Clearly, a raster file per parameter would be sensible and in line with common use (so e.g. a GeoTIFF for parameter 1). The covariance matrix could be provided as a mixture of the main diagonal (variances) as individual files with type (eg) float32, and with the upper-diagonal elements per pixel (ignoring interactions between pixels), one could think of just storing the correlations times 100, and quantised by 0.01. This means that we can use a signed byte to store all the quantised correlations. Two assumptions are that we can quantise them (I'm happy with that), and that between-pixel interactions are ignored (OK for the time being, might be an issue later on). At some point, maybe investigate matrix compression techniques.

Setting up the inference engine

We use the LinearKalman class. The setup takes the observations and output objects, as well as a description of a "state mask". The state mask is a 2D boolean array that flags the pixels over the region of interest where the inference is to be done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly