RoBERTa with Sample Reweighting and Temperature Scaling for Imbalanced Toxicity Detection: A Performance–Fairness–Calibration Study
Main Article Content
Abstract
Detecting toxic language at scale requires models that are not only accurate but also robust to demographic subgroup bias and reliable in their probability estimates; however, these objectives can conflict, especially under severe class imbalance. This study investigates the performance–fairness–calibration interplay in toxicity detection using the Jigsaw Unintended Bias dataset (124,858 comments; 5.99% toxic; identity annotations in 9.39% of samples). We aim to quantify how sample reweighting and imbalance-aware training affect global discrimination, worst-subgroup behavior, and probabilistic calibration, and to assess post-hoc temperature scaling on predicted probabilities. We compare a TF-IDF + logistic regression baseline against RoBERTa variants trained without mitigation, with sample reweighting, and with an imbalance-oriented loss, using multi-metric evaluation (AUC, Min/Worst-Subgroup AUC, ECE, and NLL). RoBERTa consistently improves global AUC over the baseline (≈0.96 vs 0.9155) while worst-subgroup AUC remains substantially lower and varies modestly across RoBERTa variants (≈0.7726–0.7813). Calibration results indicate a marked gap between models: the baseline achieves the lowest ECE (0.0052), whereas RoBERTa exhibits higher ECE (≈0.0257) that increases further under reweighting and imbalance-oriented training (≈0.0490–0.0866), with NLL not improving consistently. These findings contribute empirical evidence that fairness-oriented interventions can shift error and calibration profiles, motivating holistic evaluation and methods that jointly constrain subgroup fairness and probabilistic reliability.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.