Xiaoqing "Ellen" Tan

About Me

I am a Research Scientist at Meta (GenAI Llama Research; previously FAIR). My work spans data, scaling, fine-tuning, and evaluation for large language models and vision–language models. My recent research interests center on data, particularly the principles of large-scale data curation, scaling, and their evaluation.

I’ve had the opportunity to work as a Core Contributor across several major open-sourced projects, including Llama 2/3/4, Code Llama, MetaCLIP 1/1.2, and Chameleon. I also published first-authored papers at top conferences such as NeurIPS, ICML, EMNLP, and ACL. Our work ROBBIE was featured at FAIR’s 10-year anniversary.

Prior to that, I obtained my Ph.D. in Biostatistics at University of Pittsburgh in 2022, advised by Lu Tang and Gong Tang. My research focused on developing novel statistical and machine learning methods in causal inference, data integration, and decision fairness. I was a visiting student in Computer Science at Carnegie Mellon University from 2019 to 2021. I obtained my B.S. in Pharmacy and Computer Science at Sun Yat-sen University in 2018.

Major Technical Reports

*: core contributor

The Llama 3 Herd of Models
Llama-Team, Dubey, A., Jauhri, A., Pandey, A., ..., Tan, X.∗, ..., et al.
arXiv preprint arXiv:2407.21783 2024
[Paper] [Code]

Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon-Team, ..., Tan, X.∗, ..., et al.
arXiv preprint arXiv:2405.09818 2024
[Paper] [Code]

Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama-Team, Touvron, H., Martin, L., Stone, K., ..., Tan, X.∗, Tang, B., ..., et al.
arXiv preprint arXiv:2307.09288 2023
[Paper] [Code]

Code Llama: Open Foundation Models for Code
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.
arXiv preprint arXiv:2308.12950 2023
[Paper] [Code]

Featured Research Papers

+: co-first author ∗: core contributor

Towards Massive Multilingual Holistic Bias
Tan, X., Hansanti, P., Turkatenko, A., Chuang, J., Wood, C., Yu, B., Ropers, C., Costa-jussà, M. R.
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP) at Association for Computational Linguistics (ACL) 2025
[Paper] [Code]

Federated Learning of Robust Individualized Decision Rules with Application to Heterogeneous Multihospital Sepsis Population
Chen, X., Talisa, V. B., Tan, X., Qi, Z., Kennedy, J. N., Chang, C.-C. H., Seymour, C. W., Tang, L.
The Annals of Applied Statistics 2025
[Paper]

Double Machine Learning Methods for Estimating Average Treatment Effects: A Comparative Study
Tan, X., Yang, S., Ye, W., Faries, D. E., Lipkovich, I., Kadziola, Z.
Journal of Biopharmaceutical Statistics 2025
[Paper] [Code]

Efficient Tool Use with Chain-of-Abstraction Reasoning
Gao, S., Dwivedi-Yu, J., Yu, P., Tan, X., Pasunuru, R., Golovneva, O., Sinha, K., Celikyilmaz, A., Bosselut, A., Wang, T.
Proceedings of the 31st International Conference on Computational Linguistics (COLING) 2025
[Paper]

A General Framework for Robust Individualized Decision Learning with Sensitive Variables
Tan, X., Qi, Z., Tang, L., Cohen, M. C.
SSRN 2024
[Paper]

Altogether: Image Captioning via Re-aligning Alt-text
Xu, H., Huang, P.-Y., Tan, X., Yeh, C.-F., Kahn, J., Jou, C., Ghosh, G., Levy, O., Zettlemoyer, L., Yih, W.-t., Li, S.-W., Xie, S., Feichtenhofer, C.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2024
[Paper] [Code]

Demystifying CLIP Data (MetaCLIP)
Xu, H., Xie, S., Tan, X., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.
The Twelfth International Conference on Learning Representations (ICLR) 2024
** ICLR Spotlights (top 5%)
[Paper] [Code]

An Introduction to Vision-Language Modeling
Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., ..., Tan, X., ..., et al.
arXiv preprint arXiv:2405.17247 2024
[Paper]

ROBBIE: Robust Bias Evaluation of Large Generative Language Models
Esiobu, D.+, Tan, X.+, Hosseini, S.+, Ung, M., Zhang, Y., Fernandes, J., Dwivedi-Yu, J., Presani, E., Williams, A., Smith, E. M.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023
[Paper] [Code for eval framework] [AdvPromptSet benchmark]

Shepherd: A Critic for Language Model Generation
Wang, T., Yu, P., Tan, X.∗, O’Brien, S., Pasunuru, R., Dwivedi-Yu, J., Golovneva, O., Zettlemoyer, L., Fazel-Zarandi, M., Celikyilmaz, A.
arXiv preprint arXiv:2308.04592 2023
[Paper] [Code]

RISE: Robust Individualized Decision Learning with Sensitive Variables
Tan, X., Qi, Z., Seymour, C., Tang, L.
Advances in Neural Information Processing Systems (NeurIPS) 2022
** Distinguished Student Paper Award for the ENAR 2023 Spring Meeting
[Paper] [Code] [Video]

A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources
Tan, X., Chang, C., Zhou, L., Tang, L.
International Conference on Machine Learning (ICML) 2022
** Winner of the Student Research Award at the 35th New England Statistics Symposium
** Honorable Mention Award at JSM 2021 Student Paper Competition
[Paper] [Code] [Video]

Identifying Principal Stratum Causal Effects Conditional on a Post-treatment Intermediate Response
Tan, X., Abberbock, J., Rastogi, P., Tang, G.
Causal Learning and Reasoning (CLeaR) 2022
[Paper] [Code]

Invited Talks

A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources
The 35th New England Statistics Symposium (NESS) 2022, Storrs, CT

Improving personalized causal inference with information borrowed from heterogeneous data sources
The 14th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics) 2021, King's College London, UK

Selected Awards

Winner of Distinguished Student Paper Awards for the International Biometric Society Eastern North American Region’s (ENAR) Spring Meeting, 2023
Winner of the Student Research Award at the 35th New England Statistics Symposium (NESS), 2022
Honorable Mention Award at the ASA Joint Statistical Meetings (JSM) Statistical Learning and Data Science (SLDS) Section Student Paper Award competition, 2021
Best performance on the Ph.D. Qualifying Exam, Department of Biostatistics, University of Pittsburgh, 2020
National scholarship, Ministry of Education of China, 2015, 2017

Professional Services

Program Chair, the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023) (August 19-25, 2023, Cape Town, South Africa)
Program Committee, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2023 Research Track (August 6-10, 2023, Long Beach, CA)
Program Committee, NeurIPS Workshop on INTERPOLATE - First Workshop on Interpolation Regularizers and Beyond, 2022 (November 28 - December 9, New Orleans, LA)
Session Chair, CS17 Data-driven Healthcare, 2022 Symposium on Data Science & Statistics (June 7-10, 2022, Pittsburgh, PA)
Reviewer for NeurIPS, ICML, AISTATS, Annals of Applied Statistics, IEEE Transactions on Industrial Informatics, etc