<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Home | Jie He</title><link>https://saster-he.github.io/</link><atom:link href="https://saster-he.github.io/index.xml" rel="self" type="application/rss+xml"/><description>Home</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 24 Oct 2023 00:00:00 +0000</lastBuildDate><image><url>https://saster-he.github.io/media/icon_hu7729264130191091259.png</url><title>Home</title><link>https://saster-he.github.io/</link></image><item><title>TFL Automation at Vertex Pharmaceuticals</title><link>https://saster-he.github.io/project/vertex-llm-automation/</link><pubDate>Fri, 01 Aug 2025 00:00:00 +0000</pubDate><guid>https://saster-he.github.io/project/vertex-llm-automation/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>Every drug approval requires submission-ready clinical study deliverables: standardized datasets (SDTM, ADaM) and a Clinical Study Report (CSR) containing Tables, Figures, and Listings (TFL) that document trial results. Producing these deliverables is largely manual statistical programming work that has to be done correctly and consistently for every study.&lt;/p>
&lt;p>This project builds a multi-agent automation system to handle that work.&lt;/p>
&lt;h2 id="architecture">Architecture&lt;/h2>
&lt;p>The system is organized around a &lt;strong>human-in-the-loop agent pattern&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Shiny front-end&lt;/strong>: Statistical programmers interact with the system through an R Shiny interface. They provide analysis specifications, review agent-generated outputs, and approve or revise before submission.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>LLM agent layer&lt;/strong>: Agents interpret the provided specifications and generate statistical programming code for each requested output. The agents handle SDTM domain mapping, ADaM dataset construction logic, and TFL output generation.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Validation layer&lt;/strong>: Automated checks run against CDISC standards and expected specifications before output is surfaced to the programmer for review.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>The Shiny interface keeps programmers in control while eliminating the repetitive parts of routine TFL work. Analysts drive the agents rather than writing boilerplate code by hand.&lt;/p>
&lt;h2 id="technical-stack">Technical Stack&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>R Shiny&lt;/strong>: Front-end interface and human-in-the-loop control panel&lt;/li>
&lt;li>&lt;strong>Python&lt;/strong>: LLM agent orchestration and pipeline automation&lt;/li>
&lt;li>&lt;strong>R and SAS&lt;/strong>: Statistical programming and output generation&lt;/li>
&lt;li>&lt;strong>Shell scripting&lt;/strong>: Workflow automation and environment management&lt;/li>
&lt;/ul>
&lt;h2 id="why-it-matters">Why It Matters&lt;/h2>
&lt;p>TFL generation is a prerequisite for every regulatory submission, and it is time-consuming to do manually. Automating routine outputs means statistical programmers can spend their effort on the judgment calls that require expertise: methodology, specification review, and interpretation of results.&lt;/p>
&lt;p>&lt;em>This project is ongoing (Aug 2025 to Present). Details are limited due to confidentiality.&lt;/em>&lt;/p></description></item><item><title>Wearables &amp; Cognitive Function: Electronic Framingham Heart Study</title><link>https://saster-he.github.io/project/phd-recurrent-events/</link><pubDate>Sun, 01 Sep 2024 00:00:00 +0000</pubDate><guid>https://saster-he.github.io/project/phd-recurrent-events/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>Wearable devices and smartphones now generate continuous, high-resolution behavioral and physiological data at a scale that was impossible to collect in traditional epidemiological studies. The Electronic Framingham Heart Study (eFHS) is one of the first large cohort studies to integrate this data at scale, capturing smartwatch-derived measures from thousands of participants over extended follow-up periods.&lt;/p>
&lt;p>This PhD project analyzes that data to investigate whether and how digital biomarkers derived from wearables (activity patterns, heart rate variability, sleep signatures) associate with cognitive function in older adults.&lt;/p>
&lt;h2 id="scale">Scale&lt;/h2>
&lt;p>The dataset spans &lt;strong>more than 1 billion observations&lt;/strong>, requiring statistical methods and computational pipelines designed for large-scale longitudinal data, not just larger versions of standard analyses.&lt;/p>
&lt;h2 id="methods">Methods&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Longitudinal analysis&lt;/strong>: Repeated measures and mixed-effects models for continuous wearable data streams&lt;/li>
&lt;li>&lt;strong>Sampling-based models&lt;/strong>: Methods that handle data density and irregular observation timing&lt;/li>
&lt;li>&lt;strong>Digital biomarker derivation&lt;/strong>: Processing raw sensor data into interpretable summary features&lt;/li>
&lt;li>&lt;strong>Tools&lt;/strong>: R, SAS, and shell scripting for large-scale data pipelines on HPC environments&lt;/li>
&lt;/ul>
&lt;h2 id="selected-output">Selected Output&lt;/h2>
&lt;ul>
&lt;li>He, J., et al. &lt;em>Associations Between Smartwatch-Derived Measures and Cognitive Function: Findings from the Electronic Framingham Heart Study.&lt;/em> In Review, 2025.&lt;/li>
&lt;li>Zhang, Y., &amp;hellip;, He, J., et al. &lt;em>Factors Associated with Longitudinal Digital Survey Engagement and Smartwatch Usage in the Electronic Framingham Heart Study.&lt;/em> In Review, 2025.&lt;/li>
&lt;/ul>
&lt;h2 id="advisor">Advisor&lt;/h2>
&lt;p>&lt;a href="https://www.bu.edu/sph/profile/chunyu-liu/" target="_blank" rel="noopener">Prof. Chunyu Liu&lt;/a>, Department of Biostatistics, Boston University School of Public Health&lt;/p>
&lt;p>&lt;em>Completed PhD project (2024–2025).&lt;/em>&lt;/p></description></item><item><title>Gene Expression Variability in Single-Cell RNA-seq</title><link>https://saster-he.github.io/project/scrna-seq-variability/</link><pubDate>Wed, 01 May 2019 00:00:00 +0000</pubDate><guid>https://saster-he.github.io/project/scrna-seq-variability/</guid><description>&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Most differential expression analysis in single-cell RNA-seq focuses on differences in &lt;em>mean&lt;/em> expression between cell populations. But gene expression variability (the variance of expression across cells) carries its own biological signal. Genes that are more variable in one condition vs. another can indicate heterogeneous cell states, developmental plasticity, or disease-associated dysregulation.&lt;/p>
&lt;p>The challenge: zero-inflated count data in scRNA-seq creates a strong mean-variance dependency that biases naive variability tests. A gene with higher mean expression will appear more variable simply due to distributional properties, not biology.&lt;/p>
&lt;h2 id="methods">Methods&lt;/h2>
&lt;p>Developed a weighted hypothesis testing framework that:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Accounts for mean-variance dependency&lt;/strong> in zero-inflated negative binomial count data&lt;/li>
&lt;li>&lt;strong>Tests for differential variability&lt;/strong> between cell populations using a weighted statistic that stabilizes variance estimates across the expression range&lt;/li>
&lt;li>&lt;strong>Scales to large datasets&lt;/strong>, validated on 32,738 genes across 2,692 single cells&lt;/li>
&lt;/ul>
&lt;h2 id="implementation">Implementation&lt;/h2>
&lt;p>All methods implemented as R functions, using the &lt;code>MAST&lt;/code> and &lt;code>edgeR&lt;/code> frameworks as a foundation. The weighting scheme was derived analytically and validated via simulation.&lt;/p>
&lt;h2 id="recognition">Recognition&lt;/h2>
&lt;p>Awarded &lt;strong>Honors Thesis with Highest Distinction&lt;/strong> by the University of North Carolina at Chapel Hill, 2019.&lt;/p>
&lt;p>&lt;em>Advisor: &lt;a href="https://sph.unc.edu/adv_profile/di-wu-phd/" target="_blank" rel="noopener">Prof. Di Wu&lt;/a>, Department of Biostatistics, UNC Gillings School of Global Public Health.&lt;/em>&lt;/p></description></item></channel></rss>