This is a C++ implementation of a Gibbs sampler for the Distance Dependent Chinese Restaurant Process (ddCRP), originally introduced in: Blei, D.M., Frazier, P.I.: "Distance Dependent Chinese Restaurant Process", Journal of Machine Learning Research 12 (2011):2383-2410.
This implementation was used to obtain the results presented in: Lauri, M., Frintrop, S.: "Object Proposal Generation Applying the Distance Dependent Chinese Restaurant Process", in Proc. 20th Scandinavian Conference on Image Analysis, Tromsö, Norway, June 12--14, 2017.
For now, the code supports a multivariate normal cluster likelihood model.
You need a compiler with C++11 support. This software also requires the Boost libraries and Eigen3.
mkdir build && cd build
cmake ..
make
The executable will be placed in the bin folder.
Executing ./bin/ddcrp_clustering_example will yield the help message.
The feature (or data) file contains the N data points in d-dimensional space to feed into the ddCRP, as a N-by-d matrix.
The log decay file contains a N-by-N matrix. Informally, entry (i,j) quantifies the relative likelihood that the ith and jth data points will form a link (and thus be in the same cluster). More formally, entry (i,j) is equal to log( f(d(i,j)) ), where d(i,j) is a distance measure between the ith and jth data point, f is a decay function (see the papers). An entry -Inf here corresponds to an impossible link.
The prior covariance file sets the prior cluster covariance matrix, and is a d-by-d matrix.
The prior mean file sets the prior cluster mean vector, a d-by-1 vector.
The strengths of these priors are determined by the input parameters v and k.
n specifies how many samples of clusterings to draw from the ddCRP, and b sets the number of burn-in samples before outputting the samples.
You can also draw samples from the ddCRP prior (ignoring the likelihood model) by setting the switch --p.
The output will be written to files called clustering_0000.csv with a running numbering.
The ith row in the file has a comma separated list of data point indices belonging to the ith cluster.
The number of rows in the file indicates the number of clusters.
For example, for 5 data points an output
0, 1, 2
3
4, 5
would mean that there are 3 clusters, with data points corresponding to the indices {0,1,2}, {3}, and {4,5}, respectively.
There is some test data provided in the folder data. The file data.csv contains 100 samples drawn from two bivariate Gaussian distributions.
The file log_decay.csv contains a 100-by-100 matrix of log of the decay function values obtained as follows.
We compute the Euclidean distance d(i,j) between each pair of data points.
We apply a windowed exponential decay function
exp(-d/a), if d <= d_max
f(d) =
0 otherwise.
Here we have set a=0.3, d_max=1.5.
The values on the off-diagonals of log_decay.csv are obtained by log(f(d(i,j))).
For the diagonal entries we want the self-link likelihoods of each data point. Here we set a constant value -0.8/a.
We can run the clustering by
cd data
./../bin/ddcrp-gibbs-example -f data.csv -S covar.csv -m mean.csv -l log_decay.csv
The output will be written to the current folder. The figure below compares the true clusters (left) and one of the clusterings drawn from the ddCRP (right).
