Skip to content

Data and results for the baseline classifiers the NLBSE'25 tool competition on code comment classification

License

Notifications You must be signed in to change notification settings

nlbse2025/code-comment-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLBSE'25 Tool Competition: Code Comment Classification

This repository contains the data and results of the baseline classifiers for the NLBSE’25 tool competition on code comment classification.

The competition participants must use the provided data to train/test their classifiers, which should outperform the baselines.

Details on how to participate in the competition are available in our Google Colab notebook.

Citing Related Work

Since you will be using our dataset (and possibly one of our notebooks) as well as the original work behind the dataset, please cite the following references in your paper:

@inproceedings{nlbse2025,
  author={Al-Kaswan, Ali and Colavito, Giuseppe and Stulova, Nataliia and Rani, Pooja},
  title={The NLBSE'25 Tool Competition},
  booktitle={Proceedings of The 3rd International Workshop on Natural Language-based Software Engineering (NLBSE'25)},
  year={2025}
}
@article{rani2021,
  title={How to identify class comment types? A multi-language approach for class comment classification},
  author={Rani, Pooja and Panichella, Sebastiano and Leuenberger, Manuel and Di Sorbo, Andrea and Nierstrasz, Oscar},
  journal={Journal of systems and software},
  volume={181},
  pages={111047},
  year={2021},
  publisher={Elsevier}
}
@inproceedings{pascarella2017,
  title={Classifying code comments in Java open-source software systems},
  author={Pascarella, Luca and Bacchelli, Alberto},
  booktitle={2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)},
  year={2017},
  organization={IEEE}
}
@inproceedings{alkaswan2023stacc,
  title={Stacc: Code comment classification using sentencetransformers},
  author={Al-Kaswan, Ali and Izadi, Maliheh and Van Deursen, Arie},
  booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)},
  pages={28--31},
  year={2023},
  organization={IEEE}
}

Data for Classification

We provide a HF Dataset with the competition data. The Dataset has six splits two (train and test) per language. Each row represents a sentence (aka an instance), and each sentence contains five columns as follows:

  • class is the class name referring to the source code file where the sentence comes from;
  • comment_sentence is the actual sentence string, which is part of a (multi-line) class comment;
  • partition is the dataset split in training and testing; 0 identifies training instances, and 1 identifies testing instances, respectively;
  • combo is the class name appended to the sentence string, used to train the baselines;
  • labels is the ground-truth category, it is a binary list that a single sample belongs to. Each sample belongs one or more categories.

The list of labels is defined for each of as follows:

  • java: [summary, Ownership, Expand, usage, Pointer, deprecation, rational],
  • python: [Usage, Parameters, DevelopmentNotes, Expand, Summary],
  • pharo: [Keyimplementationpoints, Example, Responsibilities, Classreferences, Intent, Keymessages, Collaborators]

Dataset Preparation

  • Preprocessing. Before splitting, the manually tagged class comments were preprocessed as follows:

    • changed the sentences to lowercase, reduced multiple line endings to one, and removed special characters except for a-z0-9,.@#&^%!? \n since different languages can have different meanings for the symbols. For example, $,:{}!! are markup symbols in Pharo, while in Java it is ‘/* */ <p>, and #, in Python. For simplicity reasons, we removed all such special character meanings.
    • replaced periods in numbers and in Latin contractions such as e.g., i.e., etc., so that comment sentences are not split incorrectly.
    • removed extra whitespace before and after comments or lines.
  • Splitting sentences:

    • Since the classification is sentence-based, we split the comments into sentences.
    • We use the NEON tool to split the text into sentences. It splits the sentences based on selected characters (\\n|:). This is another reason to remove some of the special characters to avoid unnecessary splitting.
    • Note: the sentences may not be complete. Sometimes, the annotators classify a relevant phrase of a sentence into a category.
  • Partition selection:

    • After splitting comments into sentences, we split the sentence dataset in an 80/20 training-testing split.
    • The partitions are determined based on an algorithm in which we first determine the stratum of each class comment. The original paper by Rani et al. gives more details on strata distribution.
    • Then, we follow a round-robin approach to fill training and testing partitions from the strata. We select a stratum, select the category with a minimum number of instances in it to achieve the best balancing and assign it to the train or test partition based on the required proportions.

Software Projects

We extracted the class comments from selected projects into a joint dataset available on Zenodo.

Language Project Project Homepage
Java Eclipse github.com/eclipse
Java Guava github.com/google/guava
Java Guice github.com/google/guice
Java Hadoop github.com/apache/hadoop
Java Spark github.com/apache/spark
Java Vaadin github.com/vaadin/framework
Pharo GToolkit github.com/feenkcom/gtoolkit
Pharo Moose github.com/moosetechnology/Moose
Pharo PetitParser github.com/moosetechnology/PetitParser
Pharo Pillar github.com/pillar-markup/pillar
Pharo PolyMath github.com/PolyMathOrg/PolyMath
Pharo Roassal2 github.com/ObjectProfile/Roassal2
Pharo Seaside github.com/SeasideSt/Seaside
Python Django github.com/django
Python IPython github.com/ipython/ipython
Python Mailpile github.com/mailpile/Mailpile
Python Pandas github.com/pandas-dev/pandas
Python Pipenv github.com/pypa/pipenv
Python Pytorch github.com/pytorch/pytorch
Python Requests github.com/psf/requests/

Baseline Results

Language Category precision recall f1
0 java summary 0.873385 0.829448 0.85085
1 java Ownership 1 1 1
2 java Expand 0.323529 0.444444 0.374468
3 java usage 0.911043 0.818182 0.862119
4 java Pointer 0.738255 0.940171 0.827068
5 java deprecation 0.818182 0.6 0.692308
6 java rational 0.162162 0.295082 0.209302
7 python Usage 0.700787 0.735537 0.717742
8 python Parameters 0.793893 0.8125 0.803089
9 python DevelopmentNotes 0.243902 0.487805 0.325203
10 python Expand 0.433628 0.765625 0.553672
11 python Summary 0.648649 0.585366 0.615385
12 pharo Keyimplementationpoints 0.636364 0.651163 0.643678
13 pharo Example 0.872881 0.903509 0.887931
14 pharo Responsibilities 0.596154 0.596154 0.596154
15 pharo Classreferences 0.2 0.5 0.285714
16 pharo Intent 0.71875 0.766667 0.741935
17 pharo Keymessages 0.68 0.790698 0.731183
18 pharo Collaborators 0.26087 0.6 0.363636

We trained and tested 3 multi-class classifiers (one for each language) based on Al-Kaswan et al. on the provided training and test sets. The models are available on the HuggingFace Hub.

The summary of the baseline results is provided in baseline_results_summary.csv.

We provide a notebook to train our baseline classifiers and to run the evaluations locally and on Google Colab (note that the final evaluation and score calculation needs to be performed on Google Colab).

About

Data and results for the baseline classifiers the NLBSE'25 tool competition on code comment classification

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •