Xiaoqing "Ellen" Tan

Email: xit31 [at] pitt [dot] edu

About Me

I am a Research Scientist at Meta (GenAI Llama Research; previously FAIR). My work spans data, scaling, fine-tuning, and evaluation for large language models and vision–language models. My recent research interests center on data, particularly the principles of large-scale data curation, scaling, and their evaluation.

I’ve had the opportunity to work as a Core Contributor across several major open-sourced projects, including Llama 2/3/4, Code Llama, MetaCLIP 1/1.2, and Chameleon. I also published first-authored papers at top conferences such as NeurIPS, ICML, EMNLP, and ACL. Our work ROBBIE was featured at FAIR’s 10-year anniversary.

Prior to that, I obtained my Ph.D. in Biostatistics at University of Pittsburgh in 2022, advised by Lu Tang and Gong Tang. My research focused on developing novel statistical and machine learning methods in causal inference, data integration, and decision fairness. I was a visiting student in Computer Science at Carnegie Mellon University from 2019 to 2021. I obtained my B.S. in Pharmacy and Computer Science at Sun Yat-sen University in 2018.

Major Technical Reports

*: core contributor

The Llama 3 Herd of Models
Llama-Team, Dubey, A., Jauhri, A., Pandey, A., ..., Tan, X.∗, ..., et al.
arXiv preprint arXiv:2407.21783 2024
[Paper] [Code]

Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon-Team, ..., Tan, X.∗, ..., et al.
arXiv preprint arXiv:2405.09818 2024
[Paper] [Code]

Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama-Team, Touvron, H., Martin, L., Stone, K., ..., Tan, X.∗, Tang, B., ..., et al.
arXiv preprint arXiv:2307.09288 2023
[Paper] [Code]

Code Llama: Open Foundation Models for Code
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X., Adi, Y., Liu, J., Remez, T., Rapin, J., et al.
arXiv preprint arXiv:2308.12950 2023
[Paper] [Code]

+: co-first author ∗: core contributor

Towards Massive Multilingual Holistic Bias
Tan, X., Hansanti, P., Turkatenko, A., Chuang, J., Wood, C., Yu, B., Ropers, C., Costa-jussà, M. R.
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP) at Association for Computational Linguistics (ACL) 2025
[Paper] [Code]

Federated Learning of Robust Individualized Decision Rules with Application to Heterogeneous Multihospital Sepsis Population
Chen, X., Talisa, V. B., Tan, X., Qi, Z., Kennedy, J. N., Chang, C.-C. H., Seymour, C. W., Tang, L.
The Annals of Applied Statistics 2025
[Paper]

Double Machine Learning Methods for Estimating Average Treatment Effects: A Comparative Study
Tan, X., Yang, S., Ye, W., Faries, D. E., Lipkovich, I., Kadziola, Z.
Journal of Biopharmaceutical Statistics 2025
[Paper] [Code]

Efficient Tool Use with Chain-of-Abstraction Reasoning
Gao, S., Dwivedi-Yu, J., Yu, P., Tan, X., Pasunuru, R., Golovneva, O., Sinha, K., Celikyilmaz, A., Bosselut, A., Wang, T.
Proceedings of the 31st International Conference on Computational Linguistics (COLING) 2025
[Paper]

A General Framework for Robust Individualized Decision Learning with Sensitive Variables
Tan, X., Qi, Z., Tang, L., Cohen, M. C.
SSRN 2024
[Paper]

Altogether: Image Captioning via Re-aligning Alt-text
Xu, H., Huang, P.-Y., Tan, X., Yeh, C.-F., Kahn, J., Jou, C., Ghosh, G., Levy, O., Zettlemoyer, L., Yih, W.-t., Li, S.-W., Xie, S., Feichtenhofer, C.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2024
[Paper] [Code]

Demystifying CLIP Data (MetaCLIP)
Xu, H., Xie, S., Tan, X., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.
The Twelfth International Conference on Learning Representations (ICLR) 2024
** ICLR Spotlights (top 5%)
[Paper] [Code]

An Introduction to Vision-Language Modeling
Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., ..., Tan, X., ..., et al.
arXiv preprint arXiv:2405.17247 2024
[Paper]

ROBBIE: Robust Bias Evaluation of Large Generative Language Models
Esiobu, D.+, Tan, X.+, Hosseini, S.+, Ung, M., Zhang, Y., Fernandes, J., Dwivedi-Yu, J., Presani, E., Williams, A., Smith, E. M.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023
[Paper] [Code for eval framework] [AdvPromptSet benchmark]

Shepherd: A Critic for Language Model Generation
Wang, T., Yu, P., Tan, X.∗, O’Brien, S., Pasunuru, R., Dwivedi-Yu, J., Golovneva, O., Zettlemoyer, L., Fazel-Zarandi, M., Celikyilmaz, A.
arXiv preprint arXiv:2308.04592 2023
[Paper] [Code]

RISE: Robust Individualized Decision Learning with Sensitive Variables
Tan, X., Qi, Z., Seymour, C., Tang, L.
Advances in Neural Information Processing Systems (NeurIPS) 2022
** Distinguished Student Paper Award for the ENAR 2023 Spring Meeting
[Paper] [Code] [Video]

A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources
Tan, X., Chang, C., Zhou, L., Tang, L.
International Conference on Machine Learning (ICML) 2022
** Winner of the Student Research Award at the 35th New England Statistics Symposium
** Honorable Mention Award at JSM 2021 Student Paper Competition
[Paper] [Code] [Video]

Identifying Principal Stratum Causal Effects Conditional on a Post-treatment Intermediate Response
Tan, X., Abberbock, J., Rastogi, P., Tang, G.
Causal Learning and Reasoning (CLeaR) 2022
[Paper] [Code]

Invited Talks

A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources
The 35th New England Statistics Symposium (NESS) 2022, Storrs, CT

Improving personalized causal inference with information borrowed from heterogeneous data sources
The 14th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics) 2021, King's College London, UK

Selected Awards

Professional Services