WHO WE ARE
We are an interdisciplinary lab of biologists, computer scientists and engineers that want to push the boundaries of knowledge
WHAT WE DO
We analyze data to understand complex systems and predict their behavior. We generate and test hypotheses, create algorithms, build software systems
HOW WE DO IT
We develop and apply machine learning, optimization and other computational methods with HPC support on data generated in our lab and elsewhere.
We serve multiple disciplines but we are better at biological, medical and agricultural systems and models. Experimentally, we focus mostly on microbes. Often tagged as ML/AI geeks, HPC pests, Systems & Synthetic Biology aficionados
WHERE WE ARE
We are part of the Computer Science department and the UC Davis Genome Center. Our lab and offices are on the 5th floor of Genome and Biomedical Science Facility (GBSF)
WHY WE DO IT
Long hours, low pay, constant pressure over results, publications and funding, what is there not to like? The excitement to work on discovering something novel, useful, potentially ground-breaking is difficult to match.
DESIGN, PLANNING AND STRATEGY
This always starts from the science or business question that we want to answer. We identify the challenge, the opportunity and the resources we need to succeed. We go to whatever lengths necessary to engage the right people, design the most informative experiments and remove any intrinsic bias to ensure that success if within grasp.
RESEARCH AND DEVELOPMENT
Once the scope and success criteria have been defined, we approach R&D through an engineering lense. Usually interdisciplinary teams of 2-5 students, postdoctoral associates and other trainees meet weekly, divide tasks and exchange ideas.
EVALUATION, REFINEMENT AND DISSEMINATION
R&D is not a monolithic feed-forward process, it involves constant feedback, evaluation and refinement, where early failures become the guideposts for future success. Once the scientific aims of a project are completed, the final step includes peer-review of our methods, publication of our findings and making available our products to our collaborators and the public.
The word2vec technique has become an essential part when building a text model and even adapted in other fields like building a recommendation system. In this blog, I will introduce the basic concepts and applications of word2vec.
When building a machine learning model to understand text, the first challenge is to encode the text as numerical values. Naively, we can encode each word in the vocabulary with a one-hot vector. The length of the vector is the same as the size of the vocabulary and the distance between any two words is the same. At this point, a question might come to your mind: is it sensible to let distances between synonyms and those between antonyms be the same? Definitely not. Besides, one-hot encoding usually has a very high dimension. A more reasonable embedding should preserve semantic and syntactic similarity, relation with other words, etc. e.g. The words “president” and “Trump” should be close to each other in the embedding space; the distance between “woman” and “man” should be similar to the distance between “aunt” and “uncle”.
How do we formulate this problem from a machine learning perspective? word2vec was initially implemented by a continuous bag-of-word model (CBOW) , which predicts the next word given one word using a shadow feedforward neural network (Fig. 1). When training CBOW, each word is represented by a one-hot embedding as we don’t have the desired embedding at this point. The CBOW takes a word, transforms the one-hot encoding of that word into a new embedding space and finally predicts the probability of each word in the vocabulary. When the learning process converges, the former part of the CBOW model is able to generate a semantic embedding for each word. In the setting of CBOW, the input word is the context of the predicted word. In practice, the context is multiple words. The way CBOW defines the learning task for the model is similar to fill-in-the-blank quizzes. Note that a CBOW model is trained in a supervised way but to generate word2vec no labeled data is needed because the context-word pair can be generated without human labels.
Fig. 1. A simple CBOW model with only one word in the context. In such a setting, the vocabulary size is V, and the hidden layer size is N .
Interestingly, the idea of inferring the meaning of a word from its context is applicable when building a recommender system. In the context of modeling a user’s behavior, a sequence of his/her activities is a counterpart to text and each activity to word. To make this idea more concrete, suppose we are recommending music to a user. A sequence of songs a user listens to can be thought of as a sentence and each song as a word. So we can create an item2vec  to represent each song and “item” here represents a song. In fact, this idea has lead to a big success commercially .
. Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.
. Xin Rong, “word2vec Parameter Learning Explained”. https://arxiv.org/pdf/1411.2738.pdf
. Oren Barkan, Noam Koenigstein. “Item2Vec: Neural Item Embedding for Collaborative Filtering”. https://arxiv.org/vc/arxiv/papers/1603/1603.04259v2.pdf
. Grbovic, Mihajlo, and Haibin Cheng. “Real-time personalization using embeddings for search ranking at Airbnb.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018.
Metabolic Flux Analysis (MFA) is a technique used to quantify metabolic fluxes (the rate of turnover of molecules). MFA has at least two important applications: First, by studying metabolic flux, it is possible to adjust the amount of substrates/ingredients in the medium/protocol of cell culture  or mutate genes  to improve productivity. Second, by adjusting the diet to maximize the biomass of gut microbiome , people may achieve better health.
Here are three examples to show the applications of MFA:
- Building the model to predict productivity of monoclonal antibody given different substrate/medium .
This study built the model by curating the metabolic network (Figure 1). After the network is curated, the network can predict the output fluxes of monoclonal antibody (mAb) given the input fluxes of nutrients including amino acids, glucose and ammonium. Figure 2. shows the experimental (in symbol) and prediction (in lines) results. Using different medium (Biogro-CHO medium or PowerCHO2 medium) or different culture approach (fed-batch or batch culture) generates different productivity of mAb.
Fig. 1. Metabolic network in Chinese hamster ovary (CHO) cells . By quantifying metabolic fluxes, we can optimize the productivity of product by adjusting the substrates or medium. In this case, product is monoclonal antibody (mAb) and the substrates are amino acids, glucose and ammonium.
Fig. 2. Time series of variable cell concentration, cell viability, product (mAb) and substrates . The top is the results using Biogro-CHO medium, and the lower part are the results using PowerCHO-2 medium. The filled symbols represent batch culture and the empty symbols represent fed-batch culture.
After building the model that can predict mAb productivity, it is possible to optimize productivity by adjusting the substrates or ingredients of the medium.
- Metabolic fluxes comparisons between wildtype and aroA/aroD genes overexpress mutant.
This study compared the metabolic fluxes comparisons between wildtype and aroA/aroD genes overexpress mutant given the information that this mutant will result in a significant accumulation of product, shikimic acid (SA). By comparing the fluxes between mutant and wildtype strains, we can know more about the mechanism of SA production, which may be helpful for further SA productivity improvement.
Fig. 3. Shikimic acid (SA) is a commercial product. The strain which overexpresses the aroA and aroD genes (figure 3b.) has a higher SA production rate .
- Maximizing the biomass of gut microbiome to improve health status by adjusting diet.
There are studies showing that healthier people have a higher bacterial gene count inside their gut compared with unhealthier people . We assume gene count is correlated with the amount/biomass of gut microbiota – higher gene count may improve the health status. Under these assumptions, an optimization algorithm was proposed to maximize the biomass of microbiome given the flux constraints by adjusting the diet . Finally, eight essential amino acids consumed by the gut microbiome which may increase/decrease the number of microbiomes can be seen in Figure 5 .
Fig. 4. Trying to increase biomass of gut microbiome by adjusting the diet: The metabolic pathway between host and microbiome are obtained and then the substrates which can maximize the biomass of microbiome are found. Finally, the food containing these ingredients may be helpful in maximizing the biomass of microbiome.
Fig. 5. (A) Simulated consumption of the eight essential amino acids by the gut microbiome of the LGC (yellow circles, unhealthy) individuals and the HGC (green circles, healthy) individuals. (B) Positive/Negative effect of different food sources improve the phenotype of LGC subjects .
 Robitaille, J., Chen, J., & Jolicoeur, M. (2015). A single dynamic metabolic model can describe mAb producing CHO cell batch and fed-batch cultures on different culture media. PloS one, 10(9), e0136815.
 Liu, D. F., Ai, G. M., Zheng, Q. X., Liu, C., Jiang, C. Y., Liu, L. X., … & Liu, S. J. (2014). Metabolic flux responses to genetic modification for shikimic acid production by Bacillus subtilis strains. Microbial cell factories, 13(1), 40.
 Shoaie, S., Ghaffari, P., Kovatcheva-Datchary, P., Mardinoglu, A., Sen, P., Pujos-Guillot, E., … & Hoyles, L. (2015). Quantifying diet-induced metabolic changes of the human gut microbiome. Cell metabolism, 22(2), 320-331.15
Large-scale recombinant protein production is one of the most significant achievements of modern biotechnology. These proteins have wide applications in molecular biology, therapeutics, and industry. Efficient recombinant protein production using genetically manipulated organisms have saved several lives by providing the pure and accessible amount of therapeutic and prophylactic proteins. Today, more than 75 recombinant proteins are being utilized as pharmaceuticals, and over 360 new recombinant protein-based medicines are under development. In addition to pharmaceuticals, recombinant proteins are also being used as insecticides, diagnostic kits, enzymes with numerous applications such as in detergent production and food processing.
Optimization of the expression of recombinant protein is one of the key steps in improving the yield. Still, optimization of protein production mostly relies on a trial-and-error approach, and most of the research is focused only on the optimization of the genetic components such as promoter activity, ribosome binding site, and terminator. Recent studies have indicated that apart from these basic genetic components, the chromosomal position of the recombinant gene can also influence the expression. A handful of independent investigations, using commonly used recombinant proteins producer microbes such as Escherichia coli, Bacillus subtilis, and Streptomyces albus, have indicated the possible effect of chromosomal position on gene expression. A study by Bryant et al. (2014) demonstrated that E. coli with a chromosomally integrated reporter gene cassette composed of promoter Plac and fluorescent gene gfp produced up to 300-fold change in Gfp level when was expressed from different chromosomal positions. The change in gene dosage due to replication accounted only for 1.4-fold change out of 300-fold change. Using RNA polymerase-promoter occupancy assay, it was demonstrated that these variations in the fold change are regulated at the transcription level. In agreement with these findings, Englaender et al. (2017) expressed gene encoding the fluorescent reporter protein mCherry from four genomic loci on the E. coli chromosome to measure protein expression at each site. Expression levels ranged from 25% to 500% compared to the gene expressed on a high-copy plasmid. Effect of chromosomal position is not only limited to E. coli, Bilyk et al. (2017) also demonstrated that expression of a reporter gene in Streptomyces albus J1074 varied up to 8-fold depending on its position on the chromosome. But, several studies refute the claim that the chromosomal position influences the recombinant gene expression. These studies advocate that gene dosage is primarily responsible for the amount of expression of heterologous genes. Block et al. (2012), using E. coli as a host, demonstrated that if target gene is placed closer to the origin of replication, it produces more transcripts because of the high gene dosage resulting from the replication. Expression from the target gene decreases linearly with its increase in its distance from the origin of replication. In agreement to this study, Sauer et al. (2016) investigated how genome location and gene orientation influences expression in B. subtilis. It was observed that the expression of a lacZ reporter gene can differ up to 5-fold based on its chromosomal location. This difference in expression correlated strongly with the location of the reporter cassette relative to the origin of replication and was not influenced by gene orientation with respect to the DNA replication direction.
Though contradicting, these preliminary studies indicate that the chromosomal expression of heterologous genes can be influenced by chromosomal position and distance from the origin of replication. Further system level expression analysis, using different reporter genes, is required to make a definite claim.8
A new publication from our lab with Ki-Jo, Minseung, Iannis and Prof. Tagkopoulos on curating a Rheumatoid arthritis (RA) synovial genome-scale transcriptomic profiles which cover an intersection of 11769 genes from 11 datasets. Three clusters with different pathway signatures are observed and the prediction model using gene expression level as input for drug effectiveness are built.
Kim, K. J., Kim, M., Adamopoulos, I., & Tagkopoulos, I. (2019). Compendium of synovial signatures identifies pathologic characteristics for predicting treatment response in rheumatoid arthritis patients. Clinical Immunology.
A new publication from our lab with Navneet, Linh, Minseung and Prof. Tagkopoulos on observing a reproducible phenomenon of abrupt population decrease followed by a rapid increase during long‐term chemostat cultivations in specific E.coli strains. Genetic basis of this phenomenon is identified by genome and transcription profiling.2
A new publication from our lab with Ki-Jo and Prof. Tagkopoulos on reviewing different machine approaches for clinical data analysis in rheumatic diseases research.