curriculum vitae

my work and educational history

Basics

Name AnnaElaine (Anna) Rosengart
Email arosenga@andrew.cmu.edu
Url https://aerosengart.github.io
Summary A Ph.D. candidate in the Department of Statistics & Data Science at Carnegie Mellon University. I primarily do work related to analysis of biomedical data. Currently, I am focused on statistical methods for wastewater-based epidemiology.

Work

  • 2025.07 - 2025.08
    AI Validation Fellow
    Handshake Model Validation Expert Fellowship
    Developed prompts for large language model evaluation.
    • AI
  • 2023.06 - 2023.08
    Laboratory Intern
    Center for Systems Immunology, University of Pittsburgh
    Applied large language modeling tools to prediction tasks in protein binding. Led code composition and documentation for construction, training, and use of protein binding Transformer.
    • LLM
    • proteins
  • 2022.02 - 2022.08
    Laboratory Intern
    Center for Systems Immunology, University of Pittsburgh
    Developed methodology for identification of putative mechanisms of disease from multi-omic biomedical data. Composed and documented code.
    • multi-omics
    • machine learning
  • 2021.06 - 2021.08
    Trauma Atlas Intern
    Pittsburgh Trauma Research Center, University of Pittsburgh
    Analyzed scale, multi-omics data from experimental trials in trauma patient treatments. Composed sample code scripts for data cleaning, visualization, and gene set enrichment analysis to fellow researchers. Implemented consensus-based feature selection for dimension reduction.
  • 2021.01 - 2021.05
    Undergraduate Research Trainee
    Department of Statistics, University of Michigan
    Trained in performing time series analysis and statistical modeling of stochastic processes for epidemiological study using partially observed Markov processes.
    • time series
    • epidemiology
  • 2020.09 - 2020.12
    Laboratory Intern
    Center for Biologic Imaging, University of Pittsburgh
    Worked on the development of efficient and accurate computational methodologies for analysis of large-scale experimental biomedical imaging data. Adapted preexisting software and the development of a software pipeline for cleaning, manipulating, and mapping terabyte-scale mouse brain imaging slices to the Allen Mouse Brain Atlas.
    • imaging

Education

Awards

Publications

  • 2025.07.28
    Sliding Window Interaction Grammar (SWING): a generalized interaction language model for peptide and protein interactions
    Nature Methods
    Protein language models embed protein sequences for different tasks. However, these are suboptimal at learning the language of protein interactions. We developed an interaction language model (iLM), Sliding Window Interaction Grammar (SWING) that leverages differences in amino-acid properties to generate an interaction vocabulary. SWING successfully predicted both class I and class II peptide–major histocompatibility complex interactions. Furthermore, the class I SWING model could uniquely cross-predict class II interactions, a complex prediction task not attempted by existing methods. Using human class I and II data, SWING accurately predicted murine class II peptide–major histocompatibility interactions involving risk alleles in systemic lupus erythematosus and type 1 diabetes. SWING accurately predicted how variants can disrupt specific protein–protein interactions, based on sequence information alone. SWING outperformed passive uses of protein language model embeddings, demonstrating the value of the unique iLM architecture. Overall, SWING is a generalizable zero-shot iLM that learns the language of protein–protein interactions.
  • 2024.12.16
    Spatiotemporal Variability of the Pepper Mild Mottle Virus Biomarker in Wastewater
    ACS ES&T Water
    Since the start of the coronavirus-19 pandemic, the use of wastewater-based epidemiology (WBE) for disease surveillance has increased throughout the world. Because wastewater measurements are affected by external factors, processing WBE data typically includes a normalization step in order to adjust wastewater measurements (e.g., viral ribonucleic acid (RNA) concentrations) to account for variation due to dynamic population changes, sewer travel effects, or laboratory methods. Pepper mild mottle virus (PMMoV), a plant RNA virus abundant in human feces and wastewater, has been used as a fecal contamination indicator and has been used to normalize wastewater measurements extensively. However, there has been little work to characterize the spatiotemporal variability of PMMoV in wastewater, which may influence the effectiveness of PMMoV for adjusting or normalizing WBE measurements. Here, we investigate its variability across space and time using data collected over a two-year period from sewage treatment plants across the United States. We find that most variation in PMMoV measurements can be attributed to longitude and latitude followed by site-specific variables. Further research into cross-geographical and -temporal comparability of PMMoV-normalized pathogen concentrations would strengthen the utility of PMMoV in WBE.
  • 2024.04.29
    Informing policy via dynamic models: Cholera in Haiti
    PLOS Computational Biology
    Public health decisions must be made about when and how to implement interventions to control an infectious disease epidemic. These decisions should be informed by data on the epidemic as well as current understanding about the transmission dynamics. Such decisions can be posed as statistical questions about scientifically motivated dynamic models. Thus, we encounter the methodological task of building credible, data-informed decisions based on stochastic, partially observed, nonlinear dynamic models. This necessitates addressing the tradeoff between biological fidelity and model simplicity, and the reality of misspecification for models at all levels of complexity. We assess current methodological approaches to these issues via a case study of the 2010-2019 cholera epidemic in Haiti. We consider three dynamic models developed by expert teams to advise on vaccination policies. We evaluate previous methods used for fitting these models, and we demonstrate modified data analysis strategies leading to improved statistical fit. Specifically, we present approaches for diagnosing model misspecification and the consequent development of improved models. Additionally, we demonstrate the utility of recent advances in likelihood maximization for high-dimensional nonlinear dynamic models, enabling likelihood-based inference for spatiotemporal incidence data using this class of models. Our workflow is reproducible and extendable, facilitating future investigations of this disease system.
  • 2024.02.19
    SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains
    Nature Methods
    Modern multiomic technologies can generate deep multiscale profiles. However, differences in data modalities, multicollinearity of the data, and large numbers of irrelevant features make analyses and integration of high-dimensional omic datasets challenging. Here we present Significant Latent Factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets. SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, and has rigorous false discovery rate control. Using SLIDE on single-cell and spatial omic datasets, we uncovered significant interacting latent factors underlying a range of molecular, cellular and organismal phenotypes. SLIDE outperforms/performs at least as well as a wide range of state-of-the-art approaches, including other latent factor approaches. More importantly, it provides biological inference beyond prediction that other methods do not afford. Thus, SLIDE is a versatile engine for biological discovery from modern multiomic datasets.
  • 2023.06
    High-dimensional proteomics identifies organ injury patterns associated with outcomes in human trauma
    The Journal of Trauma and Acute Care Surgery
    Severe traumatic injury with shock can lead to direct and indirect organ injury; however, tissue-specific biomarkers are limited in clinical panels. We used proteomic and metabolomic databases to identify organ injury patterns after severe injury in humans.
  • 2022.08.23
    Multi-Omic Admission-Based Prognostic Biomarkers Identified by Machine Learning Algorithms Predict Patient Recovery and 30-Day Survival in Trauma Patients
    MDPI
    Admission-based circulating biomarkers for the prediction of outcomes in trauma patients could be useful for clinical decision support. It is unknown which molecular classes of biomolecules can contribute biomarkers to predictive modeling. Here, we analyzed a large multi-omic database of over 8500 markers (proteomics, metabolomics, and lipidomics) to identify prognostic biomarkers in the circulating compartment for adverse outcomes, including mortality and slow recovery, in severely injured trauma patients. Admission plasma samples from patients (n = 129) enrolled in the Prehospital Air Medical Plasma (PAMPer) trial were analyzed using mass spectrometry (metabolomics and lipidomics) and aptamer-based (proteomics) assays. Biomarkers were selected via Least Absolute Shrinkage and Selection Operator (LASSO) regression modeling and machine learning analysis. A combination of five proteins from the proteomic layer was best at discriminating resolvers from non-resolvers from critical illness with an Area Under the Receiver Operating Characteristic curve (AUC) of 0.74, while 26 multi-omic features predicted 30-day survival with an AUC of 0.77. Patients with traumatic brain injury as part of their injury complex had a unique subset of features that predicted 30-day survival. Our findings indicate that multi-omic analyses can identify novel admission-based prognostic biomarkers for outcomes in trauma patients. Unique biomarker discovery also has the potential to provide biologic insights.

Skills

Statistics and Data Science
R
Python
Modeling
Data Visualization
Machine Learning
Music
Sibelius
Logic Pro
Music Theory
Composition
Arranging
Trombone
Piano