-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.xml
More file actions
152 lines (152 loc) · 24.4 KB
/
index.xml
File metadata and controls
152 lines (152 loc) · 24.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>A Hugo website</title>
<link>/</link>
<description>Recent content on A Hugo website</description>
<generator>Hugo</generator>
<language>en-US</language>
<lastBuildDate>Sun, 16 Feb 2025 15:19:26 -0900</lastBuildDate>
<atom:link href="/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Transcriptome-wide association studies</title>
<link>/post/twas/</link>
<pubDate>Sun, 16 Feb 2025 15:19:26 -0900</pubDate>
<guid>/post/twas/</guid>
<description><h2 id="instrument-variable--twas">Instrument variable &amp; TWAS</h2>
<p>Transcriptome-wide association studies (TWAS) aim to identify associations between gene expression and traits of interest. In an ideal world where we have both RNA-seq and trait data for tens of thousands of individuals, performing a TWAS analysis would be straightforward: simply regress the trait on gene expression. However, GTEx, the largest collection of expression data, has only ~700 RNA-seq samples, and it does not include trait values. This limitation precludes a direct association test between expression and traits.</p></description>
</item>
<item>
<title>Polygenic Risk Score</title>
<link>/post/prs/</link>
<pubDate>Sun, 16 Feb 2025 15:19:26 -0600</pubDate>
<guid>/post/prs/</guid>
<description><h2 id="bayesian-regression-method-for-polygenic-score">Bayesian regression method for polygenic score</h2>
<p>Polygenic score (PRS) investigates the genetic liability of certain diseases. Given the training data, we might compute the polygenic score as <code>\(PRS_i = \sum_{j = 1}^{M} \hat \beta_j G_{ij}\)</code> for the testing cohort. Most of the PRS methods paper, such as <a href="https://www.nature.com/articles/s41467-019-09718-5">PRS-CS</a>, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4596916/">LDPred</a> aim to recover causal effects <code>\(\lambda\)</code> from the observed marginal effect size estimates <code>\(\hat \beta_j\)</code>. Here let&rsquo;s consider a infinitesimal model (LDpred-inf).</p>
<p>We assume the causal effect size <code>\(\lambda \sim MVN(0, \frac{h^2}{M}I)\)</code> (called infinitesimal model). From <a href="/post/gwas/">this post</a>, we also have <code>\(\hat \beta | \lambda \sim MVN(R \lambda, \frac{1 - h^2}{N}R)\)</code>. The Bayesian inference recipe with conjugate prior normal distribution gives us (according to this <a href="https://gregorygundersen.com/blog/2020/11/18/bayesian-mvn/">document</a> ):
$$
<code>\begin{split} p(\lambda \mid \hat \beta) &amp;\propto f(\hat \beta \mid \lambda) \cdot f(\lambda) \\ &amp;\propto \exp \{ - \frac{1}{2} (\hat \beta - R\lambda )^T (\frac{1 - h^2}{N}R)^{-1} (\hat \beta - R\lambda ) \} \cdot exp \{ - \frac{1}{2}\lambda^T (\frac{h^2}{M})^{-1} \lambda \} \\ &amp;\propto \exp\{ - \frac{1}{2}[\frac{N}{1 - h^2}\cdot (\hat \beta - R\lambda )^T R^{-1}(\hat \beta - R\lambda ) +\frac{M}{h^2} \lambda^T \lambda ] \} \\ &amp;\propto \exp \{- \frac{1}{2} [\frac{N}{1 - h^2} \cdot (\hat \beta^T R^{-1} \hat \beta - \hat \beta^T R^{-1} R \lambda -\lambda^T R R^{-1}\hat \beta + \lambda^T RR^{-1}R\lambda) + \frac{M}{h^2} \lambda^T \lambda] \} \\ &amp;\propto \exp \{- \frac{1}{2} [\lambda^T(\frac{N}{1 - h^2}R + \frac{M}{h^2} I)\lambda - 2 \frac{N}{1 - h^2} \hat \beta^T \lambda ] \} \end{split}</code>
$$</p></description>
</item>
<item>
<title>Linkage Disequilibrium Score Regression</title>
<link>/post/ldsc/</link>
<pubDate>Sun, 16 Feb 2025 15:19:26 -0400</pubDate>
<guid>/post/ldsc/</guid>
<description><h2 id="ldsc-derivation">LDSC derivation</h2>
<p>We discussed how to perform <a href="/post/gwas/">GWAS with scaled genotypes &amp; phenotype</a>. In this blog post, I present an important piece of result: Linkage Disequilibrium Score Regression (LDSC)</p>
<p>LDSC was proposed in <a href="https://www.nature.com/articles/ng.3211">this</a> landmark paper, in which it described how LD affect the probability of a variant being significant. Under infinitesimal model, LDSC states <code>\(\mathbb{E}[\chi_j^2] = \frac{Nh^2}{M} l_j + 1\)</code>, where <code>\(l_j \equiv \sum_{k = 1}^M r_{jk}^2\)</code> is the LD score. To carry out the derivation, one must treat the effect size as random: <code>\(\lambda_j \sim N(0, \frac{h^2}{M})\)</code>.</p></description>
</item>
<item>
<title>Genome-wide association studies</title>
<link>/post/gwas/</link>
<pubDate>Sun, 16 Feb 2025 15:19:26 -0300</pubDate>
<guid>/post/gwas/</guid>
<description><h2 id="variants-trait-association">Variants-trait association</h2>
<p>The core objective of genetic studies is to identify which genetic variants contribute to disease risk. While establishing direct causation is challenging, we can detect statistical associations between genetic variants and traits by analyzing large-scale genomic data.</p>
<p>Large biobanks (such as the UK Biobank) have collected genomic data from hundreds of thousands of samples. To study variant-trait associations, one approach is to apply linear or logistic regression for each genetic variant, treating genotypes as independent variables and the trait as the dependent variable.</p></description>
</item>
<item>
<title>Hidden Markov Model (1) - Markov Chain</title>
<link>/post/hmm1/</link>
<pubDate>Mon, 29 Nov 2021 00:00:00 +0000</pubDate>
<guid>/post/hmm1/</guid>
<description><h2 id="introduction">Introduction</h2>
<p>This series of blog posts aims to explore the Hidden Markov Model (HMM) due to its broad applications across various fields, including natural language processing, population genetics, finance, and more. Beyond its practical utility, I find HMM particularly fascinating because it bridges multiple disciplines such as probability, linear algebra, machine learning, and computer science. In this post, I will introduce the Markov Chain, which serves as the foundation of HMM. As before, the concepts will be explained through a simple example, with minimal use of complex mathematical notation.</p></description>
</item>
<item>
<title>Hidden Markov Model (2) - Forward Backward Propagation</title>
<link>/post/hmm2/</link>
<pubDate>Mon, 29 Nov 2021 00:00:00 +0000</pubDate>
<guid>/post/hmm2/</guid>
<description><h2 id="introduction">Introduction</h2>
<p>This post will include a few sections:</p>
<ol>
<li>Introducing HMM, and demonstrate how it different from the Markov Chain</li>
<li>Introducing an exhaustive method to infer the hidden state</li>
<li>Introducing forward-backward propagation as an improvement</li>
</ol>
<p>The example is from <a href="https://www.youtube.com/watch?v=VBs8FYsZIN4">Dr.Xiaole Liu&rsquo;s Youtube channel</a>, and I highly recommend you to check out her video if you want to develop intuition of HMM rather than get killed by notations. Also, you may want to review <a href="https://en.wikipedia.org/wiki/Conditional_independence">conditional independence</a> before you start reading, since it will be very frequently used later in this post.</p></description>
</item>
<item>
<title>An Intuitive Explanation of Bayesian Network</title>
<link>/post/bayesian_network/</link>
<pubDate>Sat, 27 Nov 2021 15:19:26 -0600</pubDate>
<guid>/post/bayesian_network/</guid>
<description><h2 id="introduction">Introduction</h2>
<p>Bayesian network, a probabilistic model that represents the causal relationship between variables, has gain its popularity in various fields. In biology, for example, people start to use this model to infer genetic regulatory network (GRN) due to its nice property of being directional. The aim of this blog post is to provide a gentle and less-mathematical introduction to Bayesian network.</p>
<p> <br>
 <br>
 <br>
 </p>
<h2 id="an-example">An example</h2>
<p>Suppose we are going to take a math exam next week. The outcome of the exam heavily depends on two factors: <strong>sleep</strong> and <strong>study</strong>. If we study and sleep well, chances are high that we will do a good job in the exam. Also, <strong>sleep</strong> can affect our attention and therefore influence our <strong>study</strong> quality. Since the world is probabilistic, we need to define the probability of each action (<strong>sleep</strong>, <strong>study</strong> and <strong>exam</strong>):</p></description>
</item>
<item>
<title>Model the Gene Expression (2): Likelihood Ratio Test</title>
<link>/post/gene_exp2/</link>
<pubDate>Tue, 23 Nov 2021 15:19:26 -0600</pubDate>
<guid>/post/gene_exp2/</guid>
<description><p>In the <a href="/post/gene_exp1/">last post</a>, we used a GLM framework to model the gene expression.
$$
y \sim NB(\mu, r) \\
log( \mu )= b_1 x + b_0
$$</p>
<p>Using <a href="/post/mle/">maximum likelihood estimation</a>, we were able to find a set of parameters <code>\(\hat b_0, \hat b_1, \hat r\)</code>, that maximizes the likelihood function.</p>
<p>But if you send this model (the estimated parameters) to biologists, they wouldn&rsquo;t be happy. And we all know what is lacking: <strong>the p-value!</strong></p></description>
</item>
<item>
<title>Model the Gene Expression (1): A GLM framework</title>
<link>/post/gene_exp1/</link>
<pubDate>Tue, 23 Nov 2021 15:18:26 -0600</pubDate>
<guid>/post/gene_exp1/</guid>
<description><p>Before you start reading this post, please familiarize yourself with <a href="/post/mle/">MLE</a> and linear model.</p>
<p> <br>
 </p>
<h2 id="background">Background</h2>
<p>In transcriptomic research, we often want to determine if genes are unregulated or down-regulated under a particular perturbation. For example, we have a medication that may cure type 2 diabetes. In our experiment, 6 patients are split into two groups, with 3 patients taking the medication, and 3 patients taking the placebo. The patients&rsquo; blood samples are then collected to measure the transcriptomic profile (mRNA abundance level for each gene) using NGS technology (<a href="https://en.wikipedia.org/wiki/RNA-Seq">RNA seq</a>). The mRNA abundance levels are quantified by the number of <a href="https://en.wikipedia.org/wiki/Read_(biology)">reads</a> that were mapped to the reference genome. Finally, we use statistical tests to determine if the level of change is big enough to be considered as a DEG (differentially expressed gene).</p></description>
</item>
<item>
<title>Calculate SVD by hand (and decompose Spongebob)</title>
<link>/post/svd/</link>
<pubDate>Mon, 22 Nov 2021 21:18:26 -0600</pubDate>
<guid>/post/svd/</guid>
<description><p>In my <a href="/post/pca1/">previous post</a>, I have manually implemented PCA by finding the eigenvectors and eigenvalues of a covariance matrix. In this post, let&rsquo;s try to perform PCA using a different approach called Singular Value Decomposition. Then we are going to decompose SPONGEBOB!</p>
<p>Note: you might find this <a href="/post/pca1/">post</a> to be useful, if you are new to PCA.
 <br>
 <br>
 <br>
 <br>
 </p>
<h2 id="algorithm">Algorithm</h2>
<p>Again, we are going to use the same dataset we have used before.</p>
<p>$$
\mathbf{ M} =
\begin{bmatrix}
1 &amp; 0 \\
0 &amp; 1 \\
-1 &amp; -1
\end{bmatrix}
$$</p></description>
</item>
<item>
<title>Dive into Bayesian statistics (5): Intro to PyMC3</title>
<link>/post/bayesian5/</link>
<pubDate>Mon, 22 Nov 2021 20:19:26 -0600</pubDate>
<guid>/post/bayesian5/</guid>
<description><p>In <a href="/post/bayesian3/">our previous post</a>, we manually implemented the Markov Chain Monte Carlo (MCMC) algorithm, specifically Metropolis-Hastings, to draw samples from the posterior distribution. The code isn’t particularly difficult to understand, but it’s also not very intuitive to read or write. Besides the challenges of implementation, algorithm performance (i.e., speed) is a major consideration in more realistic applications. Fortunately, well-optimized tools are available to address these obstacles, namely Stan and PyMC3.</p>
<p>Subjectively speaking, Stan is not my cup of tea. I remember spending an entire afternoon trying to install RStan, only to fail. To make matters worse, Stan has its own specialized language, adding another layer of complexity. In contrast, PyMC3 is much easier to install. The documentation and tutorials are well-written, and anyone with a basic understanding of Bayesian statistics should be able to follow them without much trouble. In this post—and future posts—I will stick with PyMC3.</p></description>
</item>
<item>
<title>Dive into Bayesian statistics (4): Markov Chain Monte Carlo</title>
<link>/post/bayesian4/</link>
<pubDate>Mon, 22 Nov 2021 20:18:26 -0600</pubDate>
<guid>/post/bayesian4/</guid>
<description><p>In the last few posts, we tried three methods (<a href="/post/bayesian2/">Integration, Conjugate Prior</a> and <a href="/post/bayesian3/">MCMC</a> to infer the posterior distribution <code>\(P(\lambda | \text{data})\)</code>, which gave us</p>
<p><code>$$\lambda \sim \text{Gamma}(\alpha = 20, \beta = 6)$$</code></p>
<p>In this post, we are going to see 1) how to use Bayesian model to make prediction; 2) the internal relationship between a Poisson distribution, a Gamma distribution and a Negative binomial distribution.</p>
<p> <br>
 
 </p>
<h2 id="question">Question:</h2>
<p>Here is the data we have worked so far:</p></description>
</item>
<item>
<title>Dive into Bayesian statistics (3): Markov Chain Monte Carlo</title>
<link>/post/bayesian3/</link>
<pubDate>Mon, 22 Nov 2021 19:18:26 -0600</pubDate>
<guid>/post/bayesian3/</guid>
<description><p>In this post, I will continue to use the same example that I used before (<a href="/post/bayesian1/">Bayesian: MAP</a> and <a href="/post/bayesian2/">Bayesian: solve denominator</a>. Also, it will be very helpful to first understand accept-reject sampling that I discussed in <a href="/post/monte_carlo/">this post</a></p>
<p> <br>
 <br>
 </p>
<p>Now let&rsquo;s get started!</p>
<p>As we discussed at the end of <a href="/post/bayesian2/">this post</a>, solving the denominator is a non-trivial work, especially when you have many parameters to estimate. One way to overcome this obstacle is to use a method called Markov Chain Monte Carlo (MCMC).</p></description>
</item>
<item>
<title>Dive into Bayesian statistics (2): Solve the nasty denominator!</title>
<link>/post/bayesian2/</link>
<pubDate>Mon, 22 Nov 2021 18:18:26 -0600</pubDate>
<guid>/post/bayesian2/</guid>
<description><p>In the <a href="/post/bayesian1/">last post</a>, we tried to use a Bayesian framework to model the number of visitors per hour. After concatenating a Poisson distribution with a Gamma prior, we get something like:
$$
P(\lambda | \text{data}) = c \cdot \lambda^{19} e^{-6\lambda}
$$</p>
<p>Since we are interested to find <code>\(\lambda_0\)</code> that gives the maximum value of <code>\(P(\lambda | \text{data})\)</code> (a.k.a. <strong>Maximum A Posteriori</strong>), we don&rsquo;t need to worry too much about a constant <code>\(c\)</code>. But in this post, we are going to solve <code>\(c\)</code>, and consolidate our understanding of Bayesian inference.</p></description>
</item>
<item>
<title>Dive into Bayesian statistics (1): Maximum A Posteriori</title>
<link>/post/bayesian1/</link>
<pubDate>Mon, 22 Nov 2021 17:18:26 -0600</pubDate>
<guid>/post/bayesian1/</guid>
<description><p>Before you read this post, I assume you are already familiar with basic probability theories, maximum likelihood estimation and bayes theorem. I encourage you to read my previous post that discussed <a href="/post/mle/">MLE</a>, and we are going to use the same dataset in this post.</p>
<p>Okay, let&rsquo;s get started.</p>
<p> <br>
 
 <br>
 </p>
<h2 id="1-bayes-theorem">1. Bayes theorem</h2>
<p>In inferential statistics, our goal is to <strong>infer the population parameters</strong>. That is, we observe the data, and from the data we guess the most likely population parameters. There are, in general, two ways to approach this goal.</p></description>
</item>
<item>
<title>How to draw sample from a generic distribution?</title>
<link>/post/monte_carlo/</link>
<pubDate>Mon, 22 Nov 2021 16:18:26 -0600</pubDate>
<guid>/post/monte_carlo/</guid>
<description><p>In this post, I am going to show two methods to draw samples from a generic distribution.</p>
<p>But before we get started, we should define what do I mean generic distribution. Here is one example:</p>
<p>$$
f(x)= \begin{cases}
0 &amp; \text{if <code>\(x\)</code> &lt; 0} \\
c \cdot \sqrt{x} &amp; \text{if $0 &lt; x &lt; 1 $} \\
0 &amp; \text{if <code>\(x &gt; 1\)</code>} \end{cases}
$$</p>
<p>First, let&rsquo;s take a look at the probability density function (pdf):<br>
 </p></description>
</item>
<item>
<title>Maximum likelihood estimation</title>
<link>/post/mle/</link>
<pubDate>Mon, 22 Nov 2021 00:00:00 +0000</pubDate>
<guid>/post/mle/</guid>
<description><p>In this post, I will show you <strong>THE</strong> most important technique in inferential statistics: Maximum Likelihood Estimation (MLE).</p>
<p> <br>
 </p>
<h2 id="1-some-data-to-work-with">1. Some data to work with</h2>
<p>Before we get started, let&rsquo;s see what type of problem could be solved using MLE.</p>
<p>For example, I recorded the number of visitors of this website each hour from 8:00 am - 12:00 am (<strong>p.s.</strong> off course this is fake data, and I am probably too optimistic), and I hope to have a model that can accurately describe my data, and well as making predictions. Here is my data:</p></description>
</item>
<item>
<title>Calculate PCA by hand (via eigen-decomposition)</title>
<link>/post/pca1/</link>
<pubDate>Sun, 21 Nov 2021 00:00:00 +0000</pubDate>
<guid>/post/pca1/</guid>
<description><p>In this blog post, I will calculate PCA step-by-step (via eigen-decomposition).</p>
<p>But before we dive deep into PCA, there are two prerequisite concepts we need to understand:</p>
<ul>
<li><strong>Variance/Covariance</strong></li>
<li><strong>Find eigenvectors and eigenvalues</strong></li>
</ul>
<p>If you already familiar those two concepts, feel free to skip those sections.</p>
<p> <br>
 <br>
 <br>
 </p>
<h2 id="prerequisite-1-variancecovariance">Prerequisite 1: Variance/Covariance</h2>
<h3 id="variance">Variance</h3>
<p>Variance measures how far a set of numbers is spread out from their average value. The sample variance is defined as:</p></description>
</item>
<item>
<title>About</title>
<link>/about/</link>
<pubDate>Thu, 05 May 2016 21:48:51 -0700</pubDate>
<guid>/about/</guid>
<description><p>My name is Taotao Tan, and I am a Ph.D. student in Computational Biology at Baylor College of Medicine. My research interests span a diverse range of topics, including statistics, deep learning, genetics, and functional genomics.</p>
<h3 id="contact">Contact:</h3>
<p>Twitter/X: <a href="https://x.com/doubleTaoTan">@doubleTaoTan</a></p>
<p>Github: <a href="https://github.com/JasonTan-code">@JasonTan-code</a></p></description>
</item>
<item>
<title>Category</title>
<link>/category/</link>
<pubDate>Thu, 05 May 2016 21:48:51 -0700</pubDate>
<guid>/category/</guid>
<description><h3 id="genetics">Genetics</h3>
<ul>
<li><a href="/post/gwas/">Genome-wide association studies</a></li>
<li><a href="/post/ldsc/">Linkage Disequilibrium Score Regression</a></li>
<li><a href="/post/prs/">Polygenic Risk Score</a></li>
<li><a href="/post/twas/">Transcriptome-wide association studies</a></li>
</ul>
<p> 
 </p>
<h3 id="linear-algebra">Linear algebra</h3>
<ul>
<li><a href="/post/pca1/">Calculate PCA by hand (via eigen-decomposition)</a></li>
<li><a href="/post/svd/">Calculate SVD by hand (and decompose Spongebob)</a></li>
</ul>
<p> 
 </p>
<h3 id="bayesian-network">Bayesian Network</h3>
<ul>
<li><a href="/post/bayesian_network/">An Intuitive Explanation of Bayesian Network</a></li>
</ul>
<p> 
 </p>
<h3 id="inferential-statistics">Inferential Statistics</h3>
<ul>
<li><a href="/post/mle/">Maximum likelihood estimation</a></li>
<li><a href="/post/gene_exp1/">Model the Gene Expression (1): A GLM framework</a></li>
<li><a href="/post/gene_exp2/">Model the Gene Expression (2): Likelihood Ratio Test</a></li>
</ul>
<p> 
 </p>
<h3 id="bayesian-statistics">Bayesian Statistics</h3>
<ul>
<li><a href="/post/bayesian1/">Dive into Bayesian statistics (1): Maximum A Posteriori</a></li>
<li><a href="/post/bayesian2/">Dive into Bayesian statistics (2): Solve the nasty denominator!</a></li>
<li><a href="/post/bayesian3/">Dive into Bayesian statistics (3): Markov Chain Monte Carlo</a></li>
<li><a href="/post/bayesian4/">Dive into Bayesian statistics (4): Posterior predictive distribution</a></li>
<li><a href="/post/bayesian5/">Dive into Bayesian statistics (5): Intro to PyMC3</a></li>
</ul>
<p> 
 </p></description>
</item>
</channel>
</rss>