Home | Jie He

TFL Automation at Vertex Pharmaceuticals

Fri, 01 Aug 2025 00:00:00 +0000

Overview

Every drug approval requires submission-ready clinical study deliverables: standardized datasets (SDTM, ADaM) and a Clinical Study Report (CSR) containing Tables, Figures, and Listings (TFL) that document trial results. Producing these deliverables is largely manual statistical programming work that has to be done correctly and consistently for every study.

This project builds a multi-agent automation system to handle that work.

Architecture

The system is organized around a human-in-the-loop agent pattern:

Shiny front-end: Statistical programmers interact with the system through an R Shiny interface. They provide analysis specifications, review agent-generated outputs, and approve or revise before submission.
LLM agent layer: Agents interpret the provided specifications and generate statistical programming code for each requested output. The agents handle SDTM domain mapping, ADaM dataset construction logic, and TFL output generation.
Validation layer: Automated checks run against CDISC standards and expected specifications before output is surfaced to the programmer for review.

The Shiny interface keeps programmers in control while eliminating the repetitive parts of routine TFL work. Analysts drive the agents rather than writing boilerplate code by hand.

Technical Stack

R Shiny: Front-end interface and human-in-the-loop control panel
Python: LLM agent orchestration and pipeline automation
R and SAS: Statistical programming and output generation
Shell scripting: Workflow automation and environment management

Why It Matters

TFL generation is a prerequisite for every regulatory submission, and it is time-consuming to do manually. Automating routine outputs means statistical programmers can spend their effort on the judgment calls that require expertise: methodology, specification review, and interpretation of results.

This project is ongoing (Aug 2025 to Present). Details are limited due to confidentiality.

Wearables & Cognitive Function: Electronic Framingham Heart Study

Sun, 01 Sep 2024 00:00:00 +0000

Overview

Wearable devices and smartphones now generate continuous, high-resolution behavioral and physiological data at a scale that was impossible to collect in traditional epidemiological studies. The Electronic Framingham Heart Study (eFHS) is one of the first large cohort studies to integrate this data at scale, capturing smartwatch-derived measures from thousands of participants over extended follow-up periods.

This PhD project analyzes that data to investigate whether and how digital biomarkers derived from wearables (activity patterns, heart rate variability, sleep signatures) associate with cognitive function in older adults.

Scale

The dataset spans more than 1 billion observations, requiring statistical methods and computational pipelines designed for large-scale longitudinal data, not just larger versions of standard analyses.

Methods

Longitudinal analysis: Repeated measures and mixed-effects models for continuous wearable data streams
Sampling-based models: Methods that handle data density and irregular observation timing
Digital biomarker derivation: Processing raw sensor data into interpretable summary features
Tools: R, SAS, and shell scripting for large-scale data pipelines on HPC environments

Selected Output

He, J., et al. Associations Between Smartwatch-Derived Measures and Cognitive Function: Findings from the Electronic Framingham Heart Study. In Review, 2025.
Zhang, Y., …, He, J., et al. Factors Associated with Longitudinal Digital Survey Engagement and Smartwatch Usage in the Electronic Framingham Heart Study. In Review, 2025.

Advisor

Prof. Chunyu Liu, Department of Biostatistics, Boston University School of Public Health

Completed PhD project (2024–2025).

Gene Expression Variability in Single-Cell RNA-seq

Wed, 01 May 2019 00:00:00 +0000

Motivation

Most differential expression analysis in single-cell RNA-seq focuses on differences in mean expression between cell populations. But gene expression variability (the variance of expression across cells) carries its own biological signal. Genes that are more variable in one condition vs. another can indicate heterogeneous cell states, developmental plasticity, or disease-associated dysregulation.

The challenge: zero-inflated count data in scRNA-seq creates a strong mean-variance dependency that biases naive variability tests. A gene with higher mean expression will appear more variable simply due to distributional properties, not biology.

Methods

Developed a weighted hypothesis testing framework that:

Accounts for mean-variance dependency in zero-inflated negative binomial count data
Tests for differential variability between cell populations using a weighted statistic that stabilizes variance estimates across the expression range
Scales to large datasets, validated on 32,738 genes across 2,692 single cells

Implementation

All methods implemented as R functions, using the MAST and edgeR frameworks as a foundation. The weighting scheme was derived analytically and validated via simulation.

Recognition

Awarded Honors Thesis with Highest Distinction by the University of North Carolina at Chapel Hill, 2019.

Advisor: Prof. Di Wu, Department of Biostatistics, UNC Gillings School of Global Public Health.