α
Research
Alpha Leak
Conferences
Topics
Top Authors
Rankings
Browse All
EN
中
Home
/
Authors
/
Mantas Mazeika
Mantas Mazeika
9
papers
4,305
total citations
papers (9)
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
ICLR 2025
arXiv
2,226
citations
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
ICML 2024
arXiv
802
citations
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
NEURIPS 2023
arXiv
571
citations
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
ICML 2024
arXiv
333
citations
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
CVPR 2022
arXiv
174
citations
Tamper-Resistant Safeguards for Open-Weight LLMs
ICLR 2025
arXiv
113
citations
Forecasting Future World Events With Neural Networks
NEURIPS 2022
arXiv
39
citations
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
NEURIPS 2025
arXiv
36
citations
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
NEURIPS 2022
arXiv
11
citations