RoBERTa with Sample Reweighting and Temperature Scaling for Imbalanced Toxicity Detection: A Performance–Fairness–Calibration Study

Lasmedi Afuan; Nurul Hidayat; Abdul Karim

doi:10.66472/ijoml.v1i1.3

PDF

Published: Jan 28, 2026

DOI: https://doi.org/10.66472/ijoml.v1i1.3

Keywords:

AdamW Computational Efficiency IndoBERT Lion Optimizer Sentiment Analysis

Lasmedi Afuan

Department of Informatics, Universitas Jenderal Soedirman, Indonesia

Nurul Hidayat

Department of Informatics, Universitas Jenderal Soedirman, Indonesia

Abdul Karim

Department of Artificial Intelligence Convergence, HallymUniversity, Chuncheon 24252, Republic of Korea

Abstract

Detecting toxic language at scale requires models that are not only accurate but also robust to demographic subgroup bias and reliable in their probability estimates; however, these objectives can conflict, especially under severe class imbalance. This study investigates the performance–fairness–calibration interplay in toxicity detection using the Jigsaw Unintended Bias dataset (124,858 comments; 5.99% toxic; identity annotations in 9.39% of samples). We aim to quantify how sample reweighting and imbalance-aware training affect global discrimination, worst-subgroup behavior, and probabilistic calibration, and to assess post-hoc temperature scaling on predicted probabilities. We compare a TF-IDF + logistic regression baseline against RoBERTa variants trained without mitigation, with sample reweighting, and with an imbalance-oriented loss, using multi-metric evaluation (AUC, Min/Worst-Subgroup AUC, ECE, and NLL). RoBERTa consistently improves global AUC over the baseline (≈0.96 vs 0.9155) while worst-subgroup AUC remains substantially lower and varies modestly across RoBERTa variants (≈0.7726–0.7813). Calibration results indicate a marked gap between models: the baseline achieves the lowest ECE (0.0052), whereas RoBERTa exhibits higher ECE (≈0.0257) that increases further under reweighting and imbalance-oriented training (≈0.0490–0.0866), with NLL not improving consistently. These findings contribute empirical evidence that fairness-oriented interventions can shift error and calibration profiles, motivating holistic evaluation and methods that jointly constrain subgroup fairness and probabilistic reliability.

How to Cite

Afuan, L., Hidayat, N., & Karim, A. (2026). RoBERTa with Sample Reweighting and Temperature Scaling for Imbalanced Toxicity Detection: A Performance–Fairness–Calibration Study. International Journal of Machine Learning (IJOML), 1(1), 23–37. https://doi.org/10.66472/ijoml.v1i1.3