Jie He 何婕
Jie He 何婕

Biostatistician building AI-powered clinical tools

I’m a biostatistician who builds AI-powered tools for clinical research. Most of my work lives at the intersection of rigorous statistical methodology and modern software: the kind of problems where the math matters and so does the engineering.

My PhD research at Boston University analyzes over a billion observations of smartwatch and wearable sensor data from the Electronic Framingham Heart Study, studying how digital biomarkers connect to cognitive aging in older adults. At Vertex Pharmaceuticals, I’m building a multi-agent system that automates the Tables, Figures, and Listings pipeline for clinical trial submissions, with statistical programmers driving LLM agents through a Shiny interface. Before the PhD, I spent three years at Boston Children’s Hospital doing applied biostatistics across pediatric cardiology, hematology, and critical care, where I built R packages, deployed survival models on registry data, and co-authored peer-reviewed publications.

Outside of work, I ski, write, and am always looking for dogs to pet.

Download CV
What I’m Working On

PhD Research, Boston University Biostatistics Extending super learner methods for survival prediction in complex sampling designs, with applications to longitudinal clinical data. Advised by Dr. Haolin (Leo) Li.


Research Extern at Vertex Pharmaceuticals (Aug 2025 to Present) Building a multi-agent system that automates the Tables, Figures, and Listings (TFL) pipeline for clinical trial submissions. Statistical programmers drive LLM agents through a Shiny interface to handle SDTM mapping, dataset construction, and output generation.

Selected Projects

Experience

  1. Research Assistant Extern, SP & Stats

    Vertex Pharmaceuticals

    Building a multi-agent automation system for Tables, Figures, and Listings (TFL) generation in clinical trials. Statistical programmers interact with LLM agents through a Shiny interface; the agents handle SDTM mapping, analysis dataset construction, and output generation against provided specifications.

    • Designed a human-in-the-loop architecture using R Shiny as the front-end control panel for analyst interaction
    • Built LLM agent pipelines that interpret analysis specifications and generate statistical programming code
    • Automated SDTM dataset derivation and ADaM construction to reduce manual overhead on routine TFL outputs
  2. Biostatistician

    Boston Children's Hospital

    Statistical collaborator in the Biostatistics and Research Department, supervised by Dr. Edie Weller. Contributed to published research across pediatric hematology, cardiology, critical care, and COVID-19 outcomes.

    • Simulation study: Applied and evaluated ML classification models (on EHR-based simulated structures) to assess performance across varying data types and correlation patterns
    • Survival analysis: Built Cox models with frailty for clustered right-censored data; validated theoretical frameworks for time-dependent covariates
    • ECPR outcomes: Deployed random survival forests on ELSO Registry data to identify predictors of in-hospital mortality following extracorporeal CPR
    • R package & Shiny app: Developed a data management tool with custom functions and an interactive app to streamline REDCap data cleaning and automate QC of clinical notes

Education

  1. PhD Biostatistics

    Boston University
    Analyzing large-scale mobile health and wearable sensor data from older adult populations to understand associations between digital biomarkers and cognitive function. Core dataset: Electronic Framingham Heart Study (>1 billion observations). Advised by Prof. Chunyu Liu.
    Google Scholar
  2. MS Biostatistics

    University of North Carolina at Chapel Hill

    Coursework spanning statistical theory and applied methods:

    • Biostatistics: Survival Analysis, Longitudinal Data Analysis, Causal Inference, Clinical Study Design
    • Mathematics: Real Analysis, Stochastic Modeling, Optimization and Functional Analysis
    • Computer Science: Data Structures and Algorithms, Parallel Computing
  3. BSPH Biostatistics, Second Major in Computer Science, Minor in Mathematics

    University of North Carolina at Chapel Hill
    Graduated with Distinction. Honors Thesis: Weighted inference of gene expression variability in single-cell RNA-seq data, advised by Prof. Di Wu. Developed R functions to address mean-variance relationships for zero-inflated counts across 32,738 genes and 2,692 cells.
Skills
Programming
R / RStudio

Advanced · Survival analysis · Package development · Shiny

SAS

Advanced · Clinical data analysis · Regulatory reporting

Python

Proficient · Data pipelines · ML workflows · LLM tooling

SQL

Proficient · Data querying and management

Shell Scripting

Proficient · Workflow automation · HPC environments

PyTorch

Familiar · Neural networks · Deep learning

Clinical Workflow
Survival & Longitudinal Methods

KM · Cox · Competing risks · RSF · Mixed-effects models · GEE

Real World Evidence

Registry outcomes/ EHRs (ELSO) · Large cohort wearables (eFHS)

Sensitivity & Reproducibility

Outlier diagnostics · Subgroup analyses · Reproducible R scripts

SAP & Specification Interpretation

Translating programming notes into automated outputs

Clinical Trial Design

Coursework in study design · Simulation-based model evaluation

CDISC Standards

Working knowledge of SDTM/ADaM structures · Exposure through TFL tooling

Languages
English (professional) · Chinese/Mandarin (native) · German (basic)