Home | Jie He

Jie He 何婕

Biostatistician building AI-powered clinical tools

I’m a biostatistician who builds AI-powered tools for clinical research. Most of my work lives at the intersection of rigorous statistical methodology and modern software: the kind of problems where the math matters and so does the engineering.

My PhD research at Boston University analyzes over a billion observations of smartwatch and wearable sensor data from the Electronic Framingham Heart Study, studying how digital biomarkers connect to cognitive aging in older adults. At Vertex Pharmaceuticals, I’m building a multi-agent system that automates the Tables, Figures, and Listings pipeline for clinical trial submissions, with statistical programmers driving LLM agents through a Shiny interface. Before the PhD, I spent three years at Boston Children’s Hospital doing applied biostatistics across pediatric cardiology, hematology, and critical care, where I built R packages, deployed survival models on registry data, and co-authored peer-reviewed publications.

Outside of work, I ski, write, and am always looking for dogs to pet.

Download CV

What I’m Working On

PhD Research, Boston University Biostatistics Extending super learner methods for survival prediction in complex sampling designs, with applications to longitudinal clinical data. Advised by Dr. Haolin (Leo) Li.

Research Extern at Vertex Pharmaceuticals (Aug 2025 to Present) Building a multi-agent system that automates the Tables, Figures, and Listings (TFL) pipeline for clinical trial submissions. Statistical programmers drive LLM agents through a Shiny interface to handle SDTM mapping, dataset construction, and output generation.

Selected Projects

TFL Automation at Vertex Pharmaceuticals

A multi-agent system for automating Tables, Figures, and Listings generation in clinical trials, with statistical programmers driving LLM agents through a Shiny interface.

Aug 2025

Wearables & Cognitive Function: Electronic Framingham Heart Study

PhD research: statistical analysis of large-scale smartwatch and mobile health data (>1 billion observations) from older adults to examine associations between wearable-derived measures and cognitive function.

Sep 2024

Gene Expression Variability in Single-Cell RNA-seq

Honors thesis: a weighted hypothesis testing framework for differential variability in scRNA-seq data, with R tools to address the mean-variance relationship in zero-inflated count data.

May 2019

Experience

Research Assistant Extern, SP & Stats
Vertex Pharmaceuticals Aug 2025 – Present
Building a multi-agent automation system for Tables, Figures, and Listings (TFL) generation in clinical trials. Statistical programmers interact with LLM agents through a Shiny interface; the agents handle SDTM mapping, analysis dataset construction, and output generation against provided specifications.
- Designed a human-in-the-loop architecture using R Shiny as the front-end control panel for analyst interaction
- Built LLM agent pipelines that interpret analysis specifications and generate statistical programming code
- Automated SDTM dataset derivation and ADaM construction to reduce manual overhead on routine TFL outputs
Biostatistician
Boston Children's Hospital Jun 2021 – Sep 2024
Statistical collaborator in the Biostatistics and Research Department, supervised by Dr. Edie Weller. Contributed to published research across pediatric hematology, cardiology, critical care, and COVID-19 outcomes.
- Simulation study: Applied and evaluated ML classification models (on EHR-based simulated structures) to assess performance across varying data types and correlation patterns
- Survival analysis: Built Cox models with frailty for clustered right-censored data; validated theoretical frameworks for time-dependent covariates
- ECPR outcomes: Deployed random survival forests on ELSO Registry data to identify predictors of in-hospital mortality following extracorporeal CPR
- R package & Shiny app: Developed a data management tool with custom functions and an interactive app to streamline REDCap data cleaning and automate QC of clinical notes

Education

PhD Biostatistics
Boston University Sep 2024 – May 2028
Analyzing large-scale mobile health and wearable sensor data from older adult populations to understand associations between digital biomarkers and cognitive function. Core dataset: Electronic Framingham Heart Study (>1 billion observations). Advised by Prof. Chunyu Liu.
Google Scholar
MS Biostatistics
University of North Carolina at Chapel Hill Aug 2019 – Dec 2020
Coursework spanning statistical theory and applied methods:
- Biostatistics: Survival Analysis, Longitudinal Data Analysis, Causal Inference, Clinical Study Design
- Mathematics: Real Analysis, Stochastic Modeling, Optimization and Functional Analysis
- Computer Science: Data Structures and Algorithms, Parallel Computing
BSPH Biostatistics, Second Major in Computer Science, Minor in Mathematics
University of North Carolina at Chapel Hill Aug 2015 – Aug 2019
Graduated with Distinction. Honors Thesis: Weighted inference of gene expression variability in single-cell RNA-seq data, advised by Prof. Di Wu. Developed R functions to address mean-variance relationships for zero-inflated counts across 32,738 genes and 2,692 cells.

Skills

Programming

R / RStudio

Advanced · Survival analysis · Package development · Shiny

SAS

Advanced · Clinical data analysis · Regulatory reporting

Python

Proficient · Data pipelines · ML workflows · LLM tooling

SQL

Proficient · Data querying and management

Shell Scripting

Proficient · Workflow automation · HPC environments

PyTorch

Familiar · Neural networks · Deep learning

Clinical Workflow

Survival & Longitudinal Methods

KM · Cox · Competing risks · RSF · Mixed-effects models · GEE

Real World Evidence

Registry outcomes/ EHRs (ELSO) · Large cohort wearables (eFHS)

Sensitivity & Reproducibility

Outlier diagnostics · Subgroup analyses · Reproducible R scripts

SAP & Specification Interpretation

Translating programming notes into automated outputs

Clinical Trial Design

Coursework in study design · Simulation-based model evaluation

CDISC Standards

Working knowledge of SDTM/ADaM structures · Exposure through TFL tooling

Languages

English (professional) · Chinese/Mandarin (native) · German (basic)

Biostatistician building AI-powered clinical tools

Experience

Research Assistant Extern, SP & Stats

Biostatistician

Education

PhD Biostatistics

MS Biostatistics

BSPH Biostatistics, Second Major in Computer Science, Minor in Mathematics