This repository contains the data and results of the baseline classifiers for the NLBSE’25 tool competition on code comment classification.
The competition participants must use the provided data to train/test their classifiers, which should outperform the baselines.
Details on how to participate in the competition are available in our Google Colab notebook.
Since you will be using our dataset (and possibly one of our notebooks) as well as the original work behind the dataset, please cite the following references in your paper:
@inproceedings{nlbse2025,
author={Al-Kaswan, Ali and Colavito, Giuseppe and Stulova, Nataliia and Rani, Pooja},
title={The NLBSE'25 Tool Competition},
booktitle={Proceedings of The 3rd International Workshop on Natural Language-based Software Engineering (NLBSE'25)},
year={2025}
}@article{rani2021,
title={How to identify class comment types? A multi-language approach for class comment classification},
author={Rani, Pooja and Panichella, Sebastiano and Leuenberger, Manuel and Di Sorbo, Andrea and Nierstrasz, Oscar},
journal={Journal of systems and software},
volume={181},
pages={111047},
year={2021},
publisher={Elsevier}
}@inproceedings{pascarella2017,
title={Classifying code comments in Java open-source software systems},
author={Pascarella, Luca and Bacchelli, Alberto},
booktitle={2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)},
year={2017},
organization={IEEE}
}@inproceedings{alkaswan2023stacc,
title={Stacc: Code comment classification using sentencetransformers},
author={Al-Kaswan, Ali and Izadi, Maliheh and Van Deursen, Arie},
booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)},
pages={28--31},
year={2023},
organization={IEEE}
}We provide a HF Dataset with the competition data. The Dataset has six splits two (train and test) per language. Each row represents a sentence (aka an instance), and each sentence contains five columns as follows:
classis the class name referring to the source code file where the sentence comes from;comment_sentenceis the actual sentence string, which is part of a (multi-line) class comment;partitionis the dataset split in training and testing;0identifies training instances, and1identifies testing instances, respectively;combois the class name appended to the sentence string, used to train the baselines;labelsis the ground-truth category, it is a binary list that a single sample belongs to. Each sample belongs one or more categories.
The list of labels is defined for each of as follows:
java: [summary,Ownership,Expand,usage,Pointer,deprecation,rational],python: [Usage,Parameters,DevelopmentNotes,Expand,Summary],pharo: [Keyimplementationpoints,Example,Responsibilities,Classreferences,Intent,Keymessages,Collaborators]
-
Preprocessing. Before splitting, the manually tagged class comments were preprocessed as follows:
- changed the sentences to lowercase, reduced multiple line endings to one, and removed special characters except for
a-z0-9,.@#&^%!? \nsince different languages can have different meanings for the symbols. For example,$,:{}!!are markup symbols in Pharo, while in Java it is‘/* */ <p>, and#,in Python. For simplicity reasons, we removed all such special character meanings. - replaced periods in numbers and in Latin contractions such as
e.g.,i.e.,etc., so that comment sentences are not split incorrectly. - removed extra whitespace before and after comments or lines.
- changed the sentences to lowercase, reduced multiple line endings to one, and removed special characters except for
-
Splitting sentences:
- Since the classification is sentence-based, we split the comments into sentences.
- We use the NEON tool to split the text into sentences. It splits the sentences based on selected characters
(\\n|:). This is another reason to remove some of the special characters to avoid unnecessary splitting. - Note: the sentences may not be complete. Sometimes, the annotators classify a relevant phrase of a sentence into a category.
-
Partition selection:
- After splitting comments into sentences, we split the sentence dataset in an 80/20 training-testing split.
- The partitions are determined based on an algorithm in which we first determine the stratum of each class comment. The original paper by Rani et al. gives more details on strata distribution.
- Then, we follow a round-robin approach to fill training and testing partitions from the strata. We select a stratum, select the category with a minimum number of instances in it to achieve the best balancing and assign it to the train or test partition based on the required proportions.
We extracted the class comments from selected projects into a joint dataset available on Zenodo.
| Language | Project | Project Homepage |
|---|---|---|
| Java | Eclipse | github.com/eclipse |
| Java | Guava | github.com/google/guava |
| Java | Guice | github.com/google/guice |
| Java | Hadoop | github.com/apache/hadoop |
| Java | Spark | github.com/apache/spark |
| Java | Vaadin | github.com/vaadin/framework |
| Pharo | GToolkit | github.com/feenkcom/gtoolkit |
| Pharo | Moose | github.com/moosetechnology/Moose |
| Pharo | PetitParser | github.com/moosetechnology/PetitParser |
| Pharo | Pillar | github.com/pillar-markup/pillar |
| Pharo | PolyMath | github.com/PolyMathOrg/PolyMath |
| Pharo | Roassal2 | github.com/ObjectProfile/Roassal2 |
| Pharo | Seaside | github.com/SeasideSt/Seaside |
| Python | Django | github.com/django |
| Python | IPython | github.com/ipython/ipython |
| Python | Mailpile | github.com/mailpile/Mailpile |
| Python | Pandas | github.com/pandas-dev/pandas |
| Python | Pipenv | github.com/pypa/pipenv |
| Python | Pytorch | github.com/pytorch/pytorch |
| Python | Requests | github.com/psf/requests/ |
| Language | Category | precision | recall | f1 | |
|---|---|---|---|---|---|
| 0 | java | summary | 0.873385 | 0.829448 | 0.85085 |
| 1 | java | Ownership | 1 | 1 | 1 |
| 2 | java | Expand | 0.323529 | 0.444444 | 0.374468 |
| 3 | java | usage | 0.911043 | 0.818182 | 0.862119 |
| 4 | java | Pointer | 0.738255 | 0.940171 | 0.827068 |
| 5 | java | deprecation | 0.818182 | 0.6 | 0.692308 |
| 6 | java | rational | 0.162162 | 0.295082 | 0.209302 |
| 7 | python | Usage | 0.700787 | 0.735537 | 0.717742 |
| 8 | python | Parameters | 0.793893 | 0.8125 | 0.803089 |
| 9 | python | DevelopmentNotes | 0.243902 | 0.487805 | 0.325203 |
| 10 | python | Expand | 0.433628 | 0.765625 | 0.553672 |
| 11 | python | Summary | 0.648649 | 0.585366 | 0.615385 |
| 12 | pharo | Keyimplementationpoints | 0.636364 | 0.651163 | 0.643678 |
| 13 | pharo | Example | 0.872881 | 0.903509 | 0.887931 |
| 14 | pharo | Responsibilities | 0.596154 | 0.596154 | 0.596154 |
| 15 | pharo | Classreferences | 0.2 | 0.5 | 0.285714 |
| 16 | pharo | Intent | 0.71875 | 0.766667 | 0.741935 |
| 17 | pharo | Keymessages | 0.68 | 0.790698 | 0.731183 |
| 18 | pharo | Collaborators | 0.26087 | 0.6 | 0.363636 |
We trained and tested 3 multi-class classifiers (one for each language) based on Al-Kaswan et al. on the provided training and test sets. The models are available on the HuggingFace Hub.
The summary of the baseline results is provided in baseline_results_summary.csv.
We provide a notebook to train our baseline classifiers and to run the evaluations locally and on Google Colab (note that the final evaluation and score calculation needs to be performed on Google Colab).