DRLib
diff --git a/‎README.md‎
Lines changed: 90 additions & 0 deletions b/‎README.md‎
Lines changed: 90 additions & 0 deletions
diff --git a/‎app.py‎
Lines changed: 145 additions & 0 deletions b/‎app.py‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎configs/CDR.yaml‎
Lines changed: 19 additions & 0 deletions b/‎configs/CDR.yaml‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎configs/ICDR.yaml‎
Lines changed: 20 additions & 0 deletions b/‎configs/ICDR.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎data/H5 Data/texture.h5‎
1.72 MB b/‎data/H5 Data/texture.h5‎
1.72 MB
diff --git a/‎data/H5 Data/usps.h5‎
14 MB b/‎data/H5 Data/usps.h5‎
14 MB
diff --git a/‎data/H5 Data/wifi.h5‎
121 KB b/‎data/H5 Data/wifi.h5‎
121 KB
@@ -0,0 +1,90 @@
+
+# CDR - Interactive Visual Cluster Analysis by Contrastive Dimensionality Reduction
+
+![teaser](teaser.png)
+
+## Environment setup
+
+This project was based on `python 3.6 and pytorch 1.6.0`. See `requirements.txt` for all prerequisites, and you can also install them using the following command.
+
+```bash
+pip install -r requirements.txt
+```
+
+## Datasets
+
+|               | Size  | Dimensionality | Clusters |  Type   |                             Link                             |
+| :-----------: | :---: | :------------: | :------: | :-----: | :----------------------------------------------------------: |
+|    Animals    | 10000 |      512       |    10    |  image  | [Kaggle](https://www.kaggle.com/datasets/alessiocorrado99/animals10) |
+| Anuran calls  | 7195  |       22       |    8     | tabular | [UCI](https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29) |
+|   Banknote    | 1097  |       4        |    2     |  text   | [UCI](https://archive.ics.uci.edu/ml/datasets/banknote+authentication) |
+|    Cifar10    | 10000 |      512       |    10    |  image  | [Alex Krizhevsky](https://www.cs.toronto.edu/~kriz/cifar.html) |
+|     Cnae9     |  864  |      856       |    9     |  text   |    [UCI](https://archive.ics.uci.edu/ml/datasets/cnae-9)     |
+| Cats-vs-Dogs  | 10000 |      512       |    2     |  image  | [Kaggle](https://www.kaggle.com/datasets/shaunthesheep/microsoft-catsvsdogs-dataset) |
+|     Fish      | 9000  |      512       |    9     |  image  | [Kaggle](https://www.kaggle.com/datasets/crowww/a-large-scale-fish-dataset) |
+|     Food      | 3585  |      512       |    11    |  image  | [Kaggle](https://www.kaggle.com/datasets/anshulmehtakaggl/themassiveindianfooddataset) |
+|      Har      | 8240  |      561       |    6     | tabular | [UCI](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones) |
+|    Isolet     | 1920  |      617       |    8     |  text   |    [UCI](https://archive.ics.uci.edu/ml/datasets/isolet)     |
+|   ML binary   | 1000  |       10       |    2     | tabular | [Kaggle](https://www.kaggle.com/datasets/rhythmcam/ml-binary-classification-study-data) |
+|     MNIST     | 10000 |      784       |    10    |  image  |       [Yann LeCun](http://yann.lecun.com/exdb/mnist/)        |
+|   Pendigits   | 8794  |       16       |    10    | tabular | [UCI](https://archive.ics.uci.edu/ml/datasets/pen-based+recognition+of+handwritten+digits) |
+|    Retina     | 10000 |       50       |    12    | tabular | [Paper](https://www.cell.com/fulltext/S0092-8674(15)00549-8) |
+|   Satimage    | 5148  |       36       |    6     |  image  | [UCI](https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)) |
+| Stanford Dogs | 1384  |      512       |    7     |  image  | [Stanford University](http://vision.stanford.edu/aditya86/ImageNetDogs/) |
+|    Texture    | 4400  |       40       |    11    |  text   |     [KEEL](https://sci2s.ugr.es/keel/dataset.php?cod=72)     |
+|     USPS      | 7440  |      256       |    10    |  image  |  [Kaggle](https://www.kaggle.com/bistaumanga/usps-dataset)   |
+|   Weathers    |  900  |      512       |    4     |  image  | [Kaggle](https://www.kaggle.com/datasets/vijaygiitk/multiclass-weather-dataset) |
+|     WiFi      | 1600  |       7        |    4     | tabular | [UCI](https://archive.ics.uci.edu/ml/datasets/Wireless+Indoor+Localization) |
+
+For image dataset such as Animals, Cifar10, Cats-vs-Dogs, Fish, Food, Stanford Dogs and Weathers, we use [SimCLR](https://github.com/sthalles/SimCLR) to get their 512 dimensional representations. 
+
+All the datasets are supported with **H5 format** (e.g. usps.h5), and we need all the dataset to be stored at **`data/H5 Data`.** For image data sets, place all images as `0.jpg,1.jpg,...,n-1.jpg` format and put it in the `static/images/(dataset name)`(e.g. static/images/usps) directory.
+
+## Pre-trained model weights
+
+The pre-training model weights on all the above data sets can be found in [Google Drive](https://drive.google.com/drive/folders/19WYgUcOI6cOYSUPK_w1eICSr0ceRK9Zb?usp=sharing).
+
+## Training
+
+To train the model on USPS with a single GPU, check the configuration file `configs/CDR.yaml`， and try the following command:
+
+```bash
+python train.py --configs configs/CDR.yaml
+```
+
+## Config File
+
+The configuration files can be found under the folder `./configs`, and we provide two config files with the format `.yaml`. We give the guidance of several key parameters in this paper below.
+
+- **n_neighbors(K):** It determines **the granularity of the local structure** to be maintained in low-dimensional space. A too small value will cause one cluster in the high-dimensional space be projected into two low-dimensional clusters, while too large value will aggravate the problem of clustering overlap. The default setting is **K = 15**.
+- **batch_size(B):** It determines the number of negative samples. A larger value is better, but it also depends on the data size. We recommend to use **`B = n/10`**, where `n` is the number of instances.
+- **temperature(t):** It determines the ability of the model upon neighborhood preservation. The smaller the value is, the more strict the model is to maintain the neighborhood, but it also keeps more error neighbors. The default setting  is **t = 0.15**.
+- **separate_upper(μ):** It determines the intensity of cluster separation. The larger the value is, the higher the cluster separation degree is.  The default setting  is **μ = 0.11**.
+
+## Load pre-trained model for visualization
+
+To use our pre-trained model, try the following command:
+
+```bash
+# python vis.py --configs 'configuration file path' --ckpt 'model weights path'
+
+# Example on USPS dataset
+python vis.py --configs configs/CDR.yaml --ckpt_path model_weights/usps.pth.tar
+```
+
+## Prototype interface
+
+Using our prototype interface for interactive visual clustering analysis, try the following command.
+
+```bash
+python app.py --config configs/ICDR.yaml
+```
+
+After that, the prototype interface can be found in [http://127.0.0.1:5000](http://127.0.0.1:5000) .
+
+
+
+![frontend_07](prototype.png)
+
+[comment]: <> "## Cite"
+
@@ -0,0 +1,145 @@
+#!/usr/bin/env python 
+# -*- coding:utf-8 -*-
+import argparse
+import os
+from datetime import timedelta
+
+import h5py
+from flask import Flask, render_template, request
+from experiments.icdr_trainer import ICLPTrainer
+from model.cdr import CDRModel
+from model.icdr import ICDRModel
+from utils.constant_pool import *
+from utils.common_utils import get_principle_components, get_config
+from utils.link_utils import LinkInfo
+import numpy as np
+
+
+app = Flask(__name__)
+experimenter: ICLPTrainer
+app.config['SEND_FILE_MAX_AGE_DEFAULT'] = timedelta(seconds=1)
+
+
+def wrap_results(embeddings, principle_comps=None, attr_names=None):
+    ret_dict = {}
+    ret_dict["embeddings"] = embeddings.tolist()
+    ret_dict["label"] = experimenter.get_label()
+    if principle_comps is not None:
+        ret_dict["low_data"] = principle_comps.tolist()
+        ret_dict["attrs"] = attr_names
+    return ret_dict
+
+
+def build_link_info(embeddings, min_dist):
+    links = request.form.get("links")
+    link_spreads = request.form.get("link_spreads")
+    finetune_epochs = request.form.get("finetune_epochs", type=int)
+
+    links = np.array(eval(links))
+    print(links)
+    link_spreads = np.array(eval(link_spreads))
+
+    if links.shape[0] == 0:
+        experimenter.link_info = None
+        return experimenter.link_info
+
+    if experimenter.link_info is None:
+        experimenter.link_info = LinkInfo(links, link_spreads, finetune_epochs, embeddings, min_dist)
+    else:
+        experimenter.link_info.process_cur_links(links, link_spreads, embeddings)
+
+    return experimenter.link_info
+
+
+def update_config():
+    global configs
+    ds_name = request.form.get("dataset", type=str)
+    configs.exp_params.dataset = ds_name
+    configs.exp_params.n_neighbors = request.form.get("n_neighbors", type=int)
+    configs.training_params.epoch_nums = request.form.get("epoch_nums", type=int)
+    configs.exp_params.input_dims = request.form.get("input_dims", type=int)
+    configs.exp_params.split_upper = request.form.get("split_upper", type=float)
+    configs.exp_params.batch_size = int(request.form.get("n_samples", type=int) / 10)
+
+
+def load_experiment(cfg):
+    method_name = CDR_METHOD if cfg.exp_params.gradient_redefine else NX_CDR_METHOD
+    result_save_dir = ConfigInfo.RESULT_SAVE_DIR.format(method_name, cfg.exp_params.n_neighbors)
+    # 创建CLP模型
+    clr_model = ICDRModel(cfg, device=device)
+    global experimenter
+    experimenter = ICLPTrainer(clr_model, cfg.exp_params.dataset, cfg, result_save_dir, None, device=device)
+
+
+@app.route("/")
+def index():
+    return render_template("index.html")
+
+
+@app.route("/load_dataset_list")
+def load_dataset_list():
+    data = []
+    for item in ConfigInfo.AVAILABLE_DATASETS:
+        data_obj = {}
+        for i, k in enumerate(ConfigInfo.DATASETS_META):
+            data_obj[k] = item[i]
+        data.append(data_obj)
+
+    return {"data": data}
+
+
+@app.route("/train_for_vis", methods=["POST"])
+def train_for_vis():
+    update_config()
+    load_experiment(configs)
+
+    embeddings = experimenter.train_for_visualize()
+    principle_comps, attr_names = get_principle_components(experimenter.dataset.data, attr_names=None)
+    ret_dict = wrap_results(embeddings, principle_comps, attr_names)
+    return ret_dict
+
+
+@app.route("/constraint_resume", methods=["POST"])
+def constraint_resume():
+    update_config()
+    link_info = build_link_info(experimenter.pre_embeddings, experimenter.configs.exp_params.min_dist)
+    ft_epoch = request.form.get("finetune_epochs", type=int)
+
+    ml_strength = request.form.get("ml_strength", type=float)
+    cl_strength = request.form.get("cl_strength", type=float)
+    experimenter.update_link_stat(link_info, is_finetune=True, finetune_epoch=ft_epoch)
+
+    if link_info is not None:
+        experimenter.model.link_stat_update(ft_epoch, experimenter.steady_epoch, ml_strength, cl_strength)
+
+    embeddings = experimenter.resume_train(ft_epoch)
+    return wrap_results(embeddings)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--configs", type=str, default="configs/ICDR.yaml", help="configuration file path")
+    parser.add_argument("--device", type=str, default="cpu")
+    return parser.parse_args()
+
+
+def load_available_data():
+    for item in os.listdir(ConfigInfo.DATASET_CACHE_DIR):
+        ds = item.split(".")[0]
+        n_samples, dims = np.array(h5py.File(os.path.join(ConfigInfo.DATASET_CACHE_DIR, item), "r")['x']).shape
+        ds_type = "image" if os.path.exists(os.path.join(ConfigInfo.IMAGE_DIR, ds)) else "tabular"
+        ConfigInfo.AVAILABLE_DATASETS.append([ds, n_samples, dims, ds_type])
+
+
+if __name__ == '__main__':
+    app.jinja_env.variable_start_string = '[['
+    app.jinja_env.variable_end_string = ']]'
+
+    args = parse_args()
+    device = args.device
+    config_path = args.configs
+    configs = get_config()
+    configs.merge_from_file(config_path)
+    load_available_data()
+    load_experiment(configs)
+    app.run(debug=False)
@@ -0,0 +1,19 @@
+exp_params:
+  dataset: "usps"
+  input_dims: 256   # (28, 28, 1)
+  LR: 0.001
+  batch_size: 512
+  n_neighbors: 15
+  optimizer: "adam" # adam or sgd
+  scheduler: "multi_step" # cosine or multi_step or on_plateau
+  temperature: 0.15
+  gradient_redefine: True
+  separate_upper: 0.1
+  separation_begin_ratio: 0.25
+  steady_begin_ratio: 0.875
+
+training_params:
+  epoch_nums: 1000
+  epoch_print_inter_ratio: 0.1
+  val_inter_ratio: 1
+  ckp_inter_ratio: 1
@@ -0,0 +1,20 @@
+exp_params:
+  dataset: "wifi"
+  input_dims: 7   # (28, 28, 1)
+  LR: 0.001
+  batch_size: 128
+  n_neighbors: 15
+  optimizer: "adam" # adam or sgd
+  scheduler: "multi_step" # cosine or multi_step or on_plateau
+  temperature: 0.15
+  min_dist: 0.1
+  separate_upper: 0.11
+  gradient_redefine: True
+  separation_begin_ratio: 0.25
+  steady_begin_ratio: 0.875
+
+training_params:
+  epoch_nums: 1000
+  epoch_print_inter_ratio: 0.1
+  val_inter_ratio: 0.5
+  ckp_inter_ratio: 1