30 May 2023 by

This year the lab completed our second trip to the Guadalupe Mountains to collect plants - part of our project to understand how 50 years of climate change has affected the ecophysiology and genetics of plant populations in this biodiversity hotspot. Following in the footsteps of TTU botanists David Northington and Tony Burgess, twelve members of the Johnson Lab, Schwilk Lab, Smith Lab spent five days at the Pine Springs Campground. We had two main goals for the trip:

  1. Collect plants along an elevational gradient up both sides of the Pine Springs Canyon (6300 feet) - towards Guadalupe Peak (8750 feet) on one side and towards The Bowl (7500 feet). The plants will be used to establish whether plants along the gradient change their water use efficiency and photosynthetic effort.
  2. Collect plants from several endemic and range-restricted species for conservation genetics using Angiosperms353, including Chaetopappa hersheyi (Asteraceae), Salvia summa (Lamiaceae), Penstemon cardinalis (Plantaginaceae) and Philadelphus hitchcockianus (Hydrangeaceae)

The hike to Guadalupe Peak was challenging but most in the group made it to the top and collected plants at the top of Texas! The next day some folks worked the lower elevations while others climbed to The Bowl and collected in the unique pine forest region. On the third day, two students and Dylan Schwilk hiked the Guadalupe Peak trail again, while another group collected along the trail to Smith Spring at lower elevation.

All the while, we worked to process specimens - keying plants to species, pressing them to preserve for herbarium specimens, and scanning live leaf samples for calculations of leaf mass:area ratios.

The trip was funded in part by the TTU Climate Science Center, through its Ensuring Livable Futures program. Madison Bullock, a PhD student in the lab, is overseeing the data analysis for this project and is already working on processing specimens, measuring plant physiology, and extracting DNA for genetic analysis.

a b
Madison finds her first specimen of Chaetopappa hersheyi The GUMO 2022 field team near Guadalupe Peak, with El Capitan in the background (photo by Justin Rex)
c d
Undergraduates Jazlyn and Garrison work to press specimens on the Guadalupe Peak trail Salvia farinacea The team works to press and key plants at camp by headlamplight

Post-Doctoral Research Position

The Johnson Lab at Texas Tech University (Lubbock, Texas) is looking for a postdoctoral scholar to develop novel bioinformatics methods that extend targeted DNA sequencing in plants for used on mixed and unknown samples. This two-year full-time position is funded by a Broad Agency Agreement from the United States Food and Drug Association Center for Food Safety and Nutrition (FDA-CFSAN). The successful candidate will work with Dr. Matt Johnson (Assistant Professor, TTU Department of Biological Sciences) to write open-source bioinformatics software – specifically, new methods that leverage Angiosperms353 data to extend the potential of metagenomics in plants and in collaboration with Dr. Sara Handy (FDA-CFSAN). You can find more information about the project here.

Applicants should have a PhD in biology (botany, evolution, genetics), computer science (bioinformatics, data science, software development) or a related field, with the degree completed before starting the postdoc position. Ideal candidates will have experience working with DNA sequence data in one or more programming languages (Python, C++, R, Java, etc) demonstrated through scientific publication and/or published code (e.g. GitHub). Preferred candidates will have strong written and oral communication skills and will have demonstrated ability to work both independently and as part of a research team.

The successful candidate will be expected to:

  1. generate a database of high-throughput targeted DNA sequencing data from plants
  2. develop novel computational methods for the identification of plants using targeted DNA sequences from mixed collections
  3. mentor graduate and undergraduate students in the development of bioinformatics skills
  4. publish peer-reviewed manuscripts and present research results at national scientific conferences.

The Johnson Lab is dedicated to creating a diverse, equitable, and inclusive environment for generating high quality science, and the successful candidate is expected to share this commitment. Candidates from groups historically excluded in biological and computer science research are especially encouraged to apply. Review of applications will begin October 15, 2022 and will continue until the position is filled. Start date is flexible but can begin as soon as January 2023.

How To Apply

All candidates should e-mail Matt Johnson (matt.johnsonobfuscate@ttu.edu) to confirm their application.

Follow this link to the TTU employment page to apply for position 30800BR: https://bit.ly/3T3lryZ

Note for current PhD students: If you state that you do not yet have a PhD, you may get an e-mail stating you are not qualified. That’s not true! All applicants who will have a PhD by the position start date will be considered.


  • Curriculum Vitae
  • Names and contact information for 3 references
  • Statement of interest (no more than two pages) describing:
    • Your experience high-throughput DNA sequencing
    • Your skills in bioinformatics and computer programming
    • Your commitment to working in a diverse, equitable, and inclusive environment


This position will follow the NIH fellowship and training guidelines for post-doctoral researchers, commensurate with candidate experience.

21 May 2022 by

It is 50 years since the Biological Investigations in the Guadalupe Mountains were completed by a number of scientists from Texas Tech, including TTU botanists David Northington and Tony Burgess. The record of plants, made at the time the Guadalupe Mountains National Park (GUMO) first opened, established the TTC herbarium as the official repository of plants collected in GUMO and stands at 1500 specimens and counting! Now, we are working to use this unique collection to study the effects of 50 years of land use and climate change on the park flora.

Our first return field trip was a successful collaboration of three labs at Texas Tech - the Schwilk Lab, Smith Lab, and Johnson Lab combined forces including five undergraduates, three graduate students, and one post-doc. We learned a ton about balancing the needs of different types of data collection - pressing specimens for the herbarium, storing leaf tissue for leaf-mass-area and carbon-nitrogen ratio measurements, and measuring plant physiology with the LiCOR 6800.

The trip was funded in part by the TTU Climate Science Center, through its Ensuring Livable Futures program. Madison Bullock, a PhD student in the lab, is overseeing the data analysis for this project and is already working on processing specimens, measuring plant physiology, and extracting DNA for genetic analysis.

a b
Home base for the week at the Ship on the Desert at Guadalupe Mountains National Park The GUMO 2022 field team after a hike to Pratt Cabin at McKittrick Canyon
c d
Measuring photosynthetic activity using a LiCOR 6800 on Salvia farinacea Two students pressing specimens collected in the Guadalupe Mountains National Park

Congratulations, Aman!

This semester we had a major milestone, as Aman Pruthi became the first graduate student to finish a degree in the Johnson Lab!

Aman’s Master’s Thesis is “Development of genomic tools for the moss Bryum argenteum and its comparative analysis with other published moss genomes” and is available on the Texas Tech Library website.

Aman used two types of high-throughput DNA sequencing - Oxford Nanopore PromethION and Illumina NovaSeq - to generate a hybrid genome assembly and annotation of the moss Bryum argenteum. His work is part of our collaborative project funded by the US Golf Association to control “silvery threaded moss” on putting greens. We have observed major growth differences in B. argenteum when it is on golf courses versus in its natural habitat, and this genome will provide insights on any genetic differences associated with the invasion of putting greens.

Aman officially walked at graduation in May, 2022. He accepted a job as a bioinformatics specialist at Frontage Laboratories in south Florida. Congratulations Aman and good luck with your next steps!

a b
Matt Johnson and Aman Pruthi after Graduation, May 2022 Bryum argenteum gametophytes and a few young sporophytes
18 Jan 2022 by amanpruthi

Herbarium Field Trip 2022

And we are off! First research travel trip of the year and over 2,200 miles (about twice the distance from Florida to New York City) to cover within a week! Madison Bullock and Reese Price traveled to 8 different cities across the southwestern United States to collect herbarium tissue for their research projects that focused on topics such as conservation genomics and population genetics.

The first stop on the trip was to Albuquerque, NM where we sampled over 50 samples from the University of New Mexico, meeting Dr. Hannah Marx and sampling over 60 specimens. We then headed to San Juan College in Farmington, NM where a highlight was talking Dr. Arnold Clifford about his exciting research and his personal herbarium collection. Finishing the first day of the research trip, we headed to Gallup, NM where we got to see the historic El Rancho Hotel/Restaurant along Route 66, where many movie stars have stayed in the past.

The second day was just as interesting as the first and had a lot of miles ahead! Heading west towards Flagstaff, AZ; we took a short detour through the Petrified Forest, where we saw the Painted Desert, Blue Mesa trail, and the Jasper Forest. After taking the scenic route, we stopped in Flagstaff to continue collecting specimens for their research from Northern Arizona University with the help of the herbarium’s curator, Tina Ayers.

The third day was one of the busiest out of the entire trip. With three herbaria to visit in three different cities. Starting out with Arizona State University in Tempe, we sampled around 30 specimens before heading to the Desert Botanical Garden in Phoenix.

We had some time tour the Garden and were impressed with the saguaro cactus - who knew they could get so tall? We were also impressed with the glass-blown artwork from Chihuly.

To wrap up the day we drove to Tuscon to sample from the University of Arizona and visited the historic downtown. Here you can see Madison sampling Philadelphus tissue that will be beneficial to her upcoming conservation genetics research.

The next day we drove back east to Las Cruces, NM to visit the New Mexico State University Herbarium. We had a good time takling about specimens and our plans for target capture sequencing with Angiosperms353 with Dr. Sara Fuentes-Soriano and Dr. Donovan Bailey. On the the last day of the trip we re-entered Texas to pick up the last of our samples from the University of Texas at El Paso with the help of Vicky Zhuang and Michael Moody. On the way home, we could not pass up the opportunity to see the White Sands National Park.

We’re so thankful to the help of all the herbaria and their staff! We will now be busy the first two months of the semester preparing nearly 800 target capture libraries for sequencing - hopefully with some intriguing results coming soon!

Yanni Chen Wins Native Plant Society Grant

We are pleased to announce that Ph.D. Candidate Yanni Chen has received an Ann Miller Gonzalez Graduate Research Grant from the Native Plant Society of Texas, for her project titled: “Ecoregion Variation Impacts on Gene Expression Associated with Smoke-induced Germination in Bouteloua gracilis, Shortgrass Prairie Native Species”

Bouteloua gracilis (blue grama) is a grass native to Texas and plays an important role in cattle grazing, landscape construction, and shortgrass prairie restoration. Smoke, as a low cost efficient seed dormancy breaking treatment, has proven its effectiveness on B. gracilis. However, the seed source of B. gracilis population variation may influence smoke effectiveness, especially if genes showing smoke response in model organisms do not show the same response in native species. Yanni proposed using RT-qPCR to examine variation in primary smoke receptor genes across populations on shortgrass prairie native species, Bouteloua gracilis. The study will verify whether the seed source of B. gracilis will influence smoke induced germination effects, while also enhancing our broader understanding of smoke-induced seed germination.

For the project, Yanni will collect seeds of blue grama grass across the precipitation gradient in Texas and test them for variation in smoke-induced seed germination. Yanni will then test for variation in gene expression in KAI2 and other genes identified as involved in smoke sensing in model organisms, using real-time quantitative PCR (RT-PCR).

Congratulations to Yanni!

17 Jul 2021 by mossmatters

Botany 2021

Johnson Lab Talks at Botany 2021.

The best week of the year for botany research has arrived - and the Johnson Lab is prepared for a great virtual Botany2021! There will be eight talks featuring research done by graduate students and undergraduates in the lab and the E.L. Reed Herbarium. Here are the Johnson Lab talks (times listed in EDT, GMT-4).

  • “Development of genomic tools for Bryum argenteum: Genome assembly and annotation using long and short reads” M.S. Student Aman Pruthi is giving an update on the Bryum argenteum genome assembly he’s been working on the past couple of years. This project is funded by the United States Golf Association, because of the impact “silvery-threaded moss” has had on putting greens around North America. Aman used long reads from Nanopore PromethION and short reads from Illumina NovaSeq to create a draft assembly. Abstract #299, Monday July 19 11:15 AM.
  • “Differential gene expression of smoke induced seed germination of shortgrass prairie native species” Ph.D. Candidate Yanni Chen is studying the effects of smoke on seed germination, and this talk focuses on her differential expression analysis. Using seeds from the native grass Bouteloua gracilis, Yanni used RNAseq to identify genes responding to smoke water compared to normal water prior to germination. She compares her results to those known from model organisms to see if the same genes show differential expression in this non-model species. Abstract #600 Tuesday, July 20, 1:15 PM.
  • “Testing for cryptic species in Physcomitrium pyriforme using target capture sequencing of 800 nuclear genes.” Ph.D. Student Lindsay Williams is continuing a 130 year discussion about the number of species represented by “goblet moss” in eastern North America. Collecting hundreds of samples from across the range was helped in big part by iNaturalist users through the PhyscoHunt community science project. Using HybSeq, Lindsay is able to identify distinct genetic clusters with overlapping ranges – could they be separate species? Abstract #589 Wednesday, July 21, 3:45 PM.
  • “Damage in antique DNA from herbarium specimens: harmful rust or healthy patina?” PI Matt Johnson investigates whether patterns of DNA damage long known to occur in ancient samples also affect herbarium specimens. An Artocarpus target capture data set was used to identify deamination - the spontaneous mutation of cytosine to uracil - and determine the effect this process would have on phylogenies that incorporate herbarium specimens. Abstract #625 Thursday, July 22, 11:30 AM
  • “Reconstructing a phylogeny of sand verbenas (Abronia, Tripterocalyx) using Angiosperms353” Undergraduate Sherese Price began working with Abronia and target capture data as part of a course and continued in the lab to help produce a taxon-complete phylogenetic dataset with Angiosperms353. Lightning Talk Abstract #293 Thursday, July 22, 4:15 PM.
  • “Comparison of machine learning and manual approaches for assessing morphology in herbarium specimens” Undergraduate Anukriti Dey manually measured leaves of Aslepias asperula in hundreds of herbarium specimens using imageJ and compared this to the speed and accuracy of automated methods like LeafMachine. Find out the pros and cons at her Lightning Talk Abstract #498 Thursday, July 22, 4:30 PM.
  • “Conservation genomics of the ethnobotanically important argan tree” Undergraduate Madeline Slimp is using Angiosperms353 as a tool for conservation genetics, looking at the endandgered and ethonbotanically important Sideroxylon spinosum across its range in Morocco. Using sampling similar to earlier efforts with ISSR and microsatellite markers, Madeline identifies population structure within the region, which will help inform conservation efforts. Abstract #720 Friday, July 23, 12:45 PM.

The Johnson Lab will also be represented in a by Jose Villeda, an undergraduate in the Smith Lab at Texas Tech. Jose’s talk, Correlation of plant traits along a fast-slow continuum using 50 year old herbarium specimens features elemental analysis, stomatal density, and georeferencing work conducted by Johnson Lab undergraduates Cassidy Coker, Zach Bailey, and Madeline Slimp. Abstract #742 Monday, July 19, 11:00 AM

Follow along with our Botany2021 experience on Twitter!

18 Jun 2021 by mossmatters

Paper wins award

Our paper “Phylogenomic delineation of Physcomitrium (Bryophyta: Funariaceae) based on targeted sequencing of nuclear exons and their flanking regions.” has won the 2019 Outstanding Papers by Young Investigators Award by the Journal of Systematics and Evolution! The paper, which uses a phylogeny of over 600 nuclear loci to identify relationships within Physcomitrium and Entosthodon, was led by Rafael Medina. One of the key findings of the paper was that single-taxon genera with reduced sporophytes, like Physcomitrella, Physcomitridium, and Aphannoregma are embedded within Physcomitrium representing parallel evolution. The rejection of the reduced-sporophyte genera results in a reclasification of the model moss, P. patens, which is now Physcomitrium patens

The JSE Young Investigators Award is given to outstanding papers from first authors who received their Ph.D. degree within 7 years and includes a cash award. The authors have decided to donate the prize to the International Association of Bryologists to support student research awards.

You can read the full paper (open access) here.

11 Apr 2021 by

Field Botany and Natural History Collections

Course (BIOL4301-012) offered Fall 2023

Instructor: Dr. Matthew Johnson, matt.johnsonobfuscate@ttu.edu

Prerequisites: None, Contact Instructor for Permission to Enroll

Class Hours: Tuesday/Thursday 11:00 – 12:20

Class Location: Herbarium, Biology 714

The E.L. Reed Herbarium can be accessed by taking the elevator up to the 6th floor, and walking across the hall to the west staircase, and walk up to the 7th floor. Please contact the instructor if you need accommodation using the service elevator.

Students are expected to attend at least one weekend day trip in September 2023 to collect specimens.

Expected Learning Outcomes

In this course, students will learn to:

  • evaluate and critique current curation practices from multiple disciplines after learning about the basic techniques used within each discipline.
  • collect and curate a specimen of a natural history collection from start to finish.
  • interpret and discuss primary scientific literature that utilized natural history collections.

Students should demonstrate proficiency in the collection of specimens, mounting, digitizing, and curation of their specimen.

Mounting specimens in the Herbarium Herbarium Specimen from 1974

Tentative Schedule Fall 2023

Week Topic
Aug. 23 – 27 Introduction to Natural History Collections
Mounting Herbarium Specimens Tutorial
Aug. 30 - Sept. 3 Making field observations
using iNaturalist and field journals
Sept. 6 - 10 Curation of Natural History Collections
Digitization: Databasing, Imaging, and Georeferencing
Sept. 13 - 17 Ethics of Natural History Collections: Permits, Colonialism, and Stewardship
The E.L. Reed Herbarium
Sept. 20 - 24 Plant Taxonomy: Type Specimens and Vouchers
Field Collections: Pressing plants and field journals
Sept 27- Oct. 1 Use of digital resources for natural history collections (iDigBio, GBIF, iNaturalist)
Participate in WeDigBio
Oct. 4 – 8 Class time to work on Practical Assignments
Practical due October 7
Oct. 11 – 15 Introduction to Class Discussion Project
Introduction to Final Curation Project
Oct. 18 – 22 In-class workshops for researching class discussion projects
In-class workshops for final curation Projects
Oct. 25 – 29 Discussion Topic #1 – Colonialism and the creation of natural history collections
Final Project Workshop during class
Nov. 1 – 5 Discussion Topic #2 – Natural History Collections in Ecology/Evolution
Final Project Workshop during class
Nov. 8 – 12 Discussion Topic #3– Natural History Collections in Physiology
Final Project Workshop during class
Nov. 15 – 19 Discussion Topic #4 – Natural History Collections in Genomics
Final Project Workshop during class
Nov. 23 Discussion Topic #5 - Natural History Collections in Ethnobiology
No Class Nov 26 Thanksgiving
Nov. 30 - Dec. 1 Curation Project Presentations
Dec. 7 Final Curation Projects Due

Methods of Assessing Expected Learning Outcomes

Students will be assessed via:

  1. Attendance and participation in lecture and hands-on activiites
  2. Participation in collection events
  3. In-class practical examination on curation of herbarium specimens
  4. Student-led discussion in person and online
  5. Final project and presentation

Participation will be evaluated by attendance and attention in class. Active discussion of class topics in the blackboard forum is expected. Students should attend each class having read any assigned materials.

Students must attend at least one collection event (exact details to be detrmined, but may include a BioBlitz or trip to the TTU Rangelands). Collections made in this course will be held in the TTU E.L. Reed Herbarium, digitized in the TORCH database, and uploaded to a citizen science website such as iNaturalist.

The In-Class practical will consist of individual demonstration of proficiency in the following topics: Field Journal, Pressing Specimens, Curating Specimens, Digitizing Nautral History Collections Data. Students will have in-class time to prepare their practical materials, and may also schedule time to visit the herbarium outside of class hours.

During the second half of the semester, each student will work in a group to lead discussion on the use of herbarium specimens in modern plant science research. Students will be responsible for picking reading materials and will be evaluated based on the quality of discussion generated.

Each student will conduct a final project relevant to the collection, curation, digitization, or research use of natural history collections. The project guidelines are flexible and may include work on new specimens or existing collections in the E.L. Reed Herbarium. Students will present on their project during the final week of classes, and will compile a written report due during exam week.

Angiosperms353 PopGen

Our lab has a new preprint, “On the potential of Angiosperms353 for Population Genomics.” The paper is the first for undergraduate (and first author!) Madeline Slimp and Ph.D. Student Lindsay Williams. It is also the first empirical paper entirely produced by our lab at Texas Tech!

Some highlights include:

  • Successful low-cost library preparation from 50-year-old herbarium specimens.
  • High recovery rate of Angiosperms353 genes from 24 species.
  • Adapted data workflows to call SNPs within species from HybSeq data.
  • Substantial genetic diversity within all species.

Click here to check out the preprint!

24 Jul 2020 by mossmatters

Botany 2020

Johnson Lab Talksat Botany 2020.

The highlight of the year for us is always the Botany conference, which is being held virtually this year. While we are sad to miss out on Alaska, the online setting means our lab has our highest participation yet, with eight talks!

  • “Herbaria as botanical snapshots: 50 years of land use and climate change impacts on genetics and physiology in the Guadalupe Mountains” Undergraduate Madeline Slimp will be giving an overview of work we’re doing in the E.L. Reed Herbarium using 2000+ specimens collected in Guadalupe Mountains National Park made when it opened in the 1970s. Conservation genomics, stomatal density, metagenomis, and more! Abstract #353, July 27 at 10:30 AM.
  • “Implementing undergraduate research in an upper-level botany lab using target capture sequencing of herbarium specimens” Lab Manager Haley Hale, will be describing our experience bringing research into our undergraduate Botany lab, where students used Angiosperms353 in the lab and on the computer to build phylogenies using herbarium specimens. Abstract #477, July 29 at 12:30 PM.
  • “On the potential of Angiosperms353 for population genomics.” PI Matt Johnson, was invited to participate in a symposium on Angiosperms353, organized by Rachel Jabaily and Laura Lagomarsino, and will be describing preliminary analysis of Angiosperms353 data at and below the species level. Abstract #263, July 29 at 1:30 PM.
  • “Characterization of the Fungal Microbiome in 50-Year-Old Plant Herbarium Specimens” Undergraduate Cassidy Coker will be describing her efforts to extract fungal DNA from plant herbarium specimens, and to use metabarcoding techniques to identify how the fungal microbiome differs between leaf and root tissues. Lightning Talk Abstract #336 July 31 at 10:50 AM.
  • “Methods to Delimit Speciation and Determine Population Parameters of the Moss, Physcomitrium pyriforme Using Target Capture Sequencing” Ph.D. Student Lindsay Williams will be describing our workflow for extracting population genomics data from HybSeq in using sequences from the moss Physcomitrium pyriforme. Are there cryptic species hidden within this widespread moss? Lightning Talk Abstract #400 July 31 at 11:20 AM.
  • “Development of genomic tools for Bryum argenteum: Applications in small RNA and population genetics” Master’s student Aman Pruthi will be talking about our early efforts to sequence the genome of Bryum argenteum, which he will be using to help charactarize the evolution of small RNA in bryophytes. Lightning Talk Abstract #419 July 31 at 11:25 AM.
  • “Phylogenomics and Habitat Restoration: Detecting the Effects of Gene Duplication and Diversification of KAI2 on Seed Germination” Ph.D. student Yanni Chen will be speaking about applications of phylogenomics to restoration ecology. How has the gene KAI2, a germination gene that interacts with smoke, influenced the evolution of the smoke response in seeds? Lightning Talk Abstract #420 July 31 at 11:30 AM.
  • “Expanded phylotranscriptomic sampling reveals gene family expansion in pleurocarpous mosses” Master’s student Kira Buckowing will be presenting on our progress expanding the phylotranscriptomic analysis of pleurocarpous mosses– is the large expansion in gene families due to whole genome duplication? Lightning Talk Abstract #523 July 31 at 12:45 PM.

Follow along with our experience on Twitter.

Reducing Target Capture Costs

A new lab milestone! Lab Manager Haley Hale led a paper describing our protocol modifications for target capture in non-model plant species. In collaboration with Elliot Gardner (Chicago Botanic Gardens, Case Western University), Juan Viruel (Royal Botanic Gardens, Kew), and Lisa Pokorny (Royal Botanic Gardens, Kew), we collected our experience with high-throughput target capture to make recommendations for researchers.

The paper appeared online on April 15, 2020, click here to read it (open access).

Some highlights include:

  • Recommendations on when (and when not) to use fragmentase on herbarium specimens.
  • Success of “homebrewed” SPRI beads for cleanup of extractions and PCR reactions.
  • Pooling strategies for target capture that allow for 96 samples per reaction.

The paper was part of an special issue on Low Cost Methods in Plant Sciences. In addition to all the wonderful new aritcles, the APPS editor team put together a list of articles covering many aspects of plant science from phenology to ecology to genomics!

A new project gets some positive developments! Graduate student Aman Pruthi is hard at work extracting high molecular weight DNA from Bryum argenteum. This moss species grows worldwide on all seven continents, but is becoming a nuisance for golf course managers. The silver-threaded moss likes to take over putting greens, crowding out the turf grass.

Along with collaborators Zane Raudenbush (Ohio State) and Lloyd Stark (University of Nevada Las Vegas), this project is funded by the United States Golf Association. We’re aiming to understand the population genetics of the golf course invasions, so we’re sequencing the Bryum argenteum genome to serve as a reference for population-level sampling. Aman will also be using the genome to characterize and examine the evolution of small RNAs in mosses.

Together with our indepsensible lab manager Haley Hale, Aman was able to get a high concentration of HMW DNA from our Bryum argenteum cultures. Next stop: long-read genome sequencing!

Bryum argenteum growing on agar in the Biology greenhouse Aman grinds Bryum tissue to a powder using liquid nitrogen
Powderized Bryum argenteum High molecular weight on the gel!

Johnson Lab at Botany 2019

The Johnson lab had a very successful trip to the Botany 2019 conference in Tuscon, Arizona! Five lab members traveled to the conference and presented a variety of research from stomatal density to phylogenomics. Some highlights of the conference are below!

Setting off on our 634 mile journey from Lubbock to Tuscon in the departmental van Halfway to Tuscon, a lunchtime stop at White Sands National Monument. Gypsum for miles!
Spreading our E.L. Reed Herbarium Brand with lab t-shirts Having some fun between talks at the Star Pass Resort
Undergraduate Zach presents his poster on the effect of life history strategy on stomatal density in herbarium specimens Undergraduate Madeline Slimp draws a crowd to her poster on the utility of Angiosperms353 for conservation genomics
Lab manager/technician Haley Hale answers questions after her talk on cost-cutting methods for Angiosperms353 Ph.D. Student Yanni Chen presents on the extent of phylogenetic signal in seed traits of shortgrass prairie species
01 Jul 2019 by mossmatters

Botany 2019

Johnson Lab Talks/Posters at Botany 2019.

It’s July, which means the annual Botany conference is just a few weeks away! It is a milestone for the Johnson Lab as we will be represented by multiple speakers in six talks this year! We will be making the short 9-hour drive from Lubbock to Tuscon (moving out West changes our definition of “short drive”!). Joining us will be:

  • Ph.D. student Yanni Chen, who will be speaking about her project looking into the phylogenetic signal of seed traits in shortgrass prairies. Restoration managers like to pick seeds by size and germination rate, but will doing so impact phylogenetic diversity?
  • Lab Manager Haley Hale, who will be speaking about strategies to reduce per-sample costs in targeted sequencing studies, including the use of herbarium specimens. Where can corners be cut to make target capture datasets for 100s of samples feasible?
  • Undergraduate Zach Bailey, who will be presenting a poster on gathering stomatal density data from 50-year-old herbarium specimens. Should we be making different predictions in plants’ responses to increase carbon dioxide based on the species life history strategy?
  • Undergraduate Madeline Slimp, who will be presenting a poster on the success of Angiosperms353 to generate population-genomics scale data on 50-year-old herbarium specimens. What sort of genetic diversity is present in Angiosperms353 genes for 24 species from Guadalupe Mountains?
  • PI Matt Johnson, who will be giving two talks: one on the status of the Kew PAFTOL project to generate a family-complete phylogeny of flowering plants with Angiosperms353, and another using sequence capture for phylogenetic systematics in the moss family Funariaceae.

It’s especially exciting as none of the other lab members have attended a Botany conference before! Follow along with our experience on Twitter.

20 Jun 2019 by mossmatters

Starting a new seed experiment

Yanni prepares seeds for germination experiment.

Ph.D. student Yanni Chen is starting a new experiment this summer, as part of her project to study the extent phylogenetic signal in seed traits for species that inhabit short grass praries. Yanni has commercially available seeds from 50 species of grasses, forbs and shrubs all sourced from Texas. Yanni will be measuring seed size and mass, which along with this germination trial will be used as traits to test for phylogenetic signal on the angiosperm phylogeny.

Later this summer, Yanni will be extracting DNA from the seeds to test the ability of the Angiosperms353 kit to correctly identify plants from seed tissue, and to estimate the level of genetic diversity in commercially available seeds.

A select few species will also receive a smoke treatment, which is a technique restoration managers use to stimulate seed growth in short grass prairies. Yanni is interested in whether the molecular mechanisms of smoke response are conserved across the flowering plants, or whether different suites of genes respond to smoke in (for example) grasses vs. forbs.

04 Apr 2019 by mossmatters

Lab Presentations at TTU URC 2019

Each year, the Center for Transformative Undergraduate Experience hosts a three-day event showcasing undergraduate research. Two lab members– Madeline Slimp (TTU Honors ‘21) and Zach Bailey (TTU Honors ‘19)– presented posters this year.

Both of their projects make use of herbarium specimens from our collection of plants of Guadalupe Mountains National Park. The collection was made in the early 1970s, and provides a botanical snapshot of the community.

Madeline Slimp: Conservation genomics of plant populations in Guadalupe Mountains National Park using Herbarium Specimens.

Zachary Bailey: The effect of life-history strategies on stomatal characteristics using herbarium specimens from Guadalupe Mountains National Park.

Congrats to Zach and Madeline, and thanks for representing the lab and the herbarium!

We are hiring a Post-Doctoral Research Associate with expertise in Phylogenomics and Bioinformatics. The post-doc will work as part of a three-university NSF-funded collaboration to investigate the phylogeography and ploidal diversity of the moss Physcomitrium pyriforme using targeted sequence capture. More information about the project can be found at funariaceae.uconn.edu

Required Qualifications

  • Ph.D. in Biology, Bioinformatics, Computer Science, or Related Field at time of start date
  • 2-3 years programming experience (C++, Java, Python, R or similar)
  • Fluency in English (oral and written)

Major/Essential Functions

  • Analysis of next-generation sequence data using high-throughput methods.
  • Adapting existing bioinformatics tools for species delimitation and population genomics of non-model, polyploid organisms.
  • Designing novel phylogenomics packages to efficiently process target capture data.
  • Working with the research team to write peer-reviewed manuscripts and conference presentations.
  • Organizing a bioinformatics workshop to train researchers in target capture methods.
  • Mentoring of graduate and undergraduate students.

Preferred candidates will have experience in one or more of the following:

  • Design and deployment of open source bioinformatics packages.
  • Data analysis (Jupyter/Pandas/R) and version control (Git).
  • Design and implementation of SQL and other relational databases.
  • Web design and interface with online databases.
  • Phylogenomics analysis software (MAFFT, RAxML, ASTRAL).
  • Evolution and biology of plants, including bryophytes.

Interested individuals should submit an application, including a statement of interest, CV, and a contact information of 3 professional interests at: https://bit.ly/2UvN7yT or search for job 16737BR on the Texas Tech staff job page.

Review of applications will begin on March 31, 2019 and will continue until the position is filled. A tentative start date for the position is June 1, 2019, but is flexible. Women, minorities, and persons with disabilities are encouraged to apply.

For questions about the position or how to apply, contact Dr. Matt Johnson at: matt.johnson@ttu.edu

Targeted Sequencing of Herbarium Specimens

Work in the lab has begun on our quest to use targeted sequence capture on herbarium specimens from the E.L. Reed Herbarium. By using DNA from herbarium specimens, we will be able to identify changes in genetic diversity over time throughout a plant community.

As part of our project we are extracting DNA from 50 year old herbarium specimens. An undergraduate in the lab, Madeline Slimp, is working with our technician Haley Hale to sample leaves from herbarium sheets for DNA extraction.

One challenge is determining if the DNA we extract from herbarium specimens is degraded, or whether it more closely resembles DNA from fresh tissue. Madeline and Haley identified a range of tissue types and specimen ages, and things are looking good (see gel image below)!

Madeline gel
Madeline Slimp, undergraduate researcher, Texas Tech Honors College, loading a gel Successful DNA extractions from herbarium tissue

First Lab Field Trip!

This past weekend, the lab made our first field trip together to visit the Guadalupe Mountains National Park (GUMO) in southwest Texas. In attendance were Dr. Matt Johnson, Ph.D. student Yanni Chen, lab technician Haley Hale, and undergraduates Zach Bailey and Madeline Slimp. We were joined by Dr. Nick Smith, and Ph.D. student Xiulin Gao, from the Schwilk lab.

The goal of the trip was to get some first-hand experience with the diverse botanical region of the park. Our students have been working with our large herbarium collection (2000+ specimens from 500+ species) made when the Park first opened in the 1970s.

Camping was a bit of an aventure, with winds howling at 25 mph all night. We made it through the night, and after a luxurious breakfast we made our way to Frijole Ranch and the Smith Spring Trail. It was a 1 mile hike with about 400 feet of elevation gain, and along the way we saw a diverse set of high desert plants. We were able to see quite a bit of fall color, with maples (Acer grandidentatum), dwarf mulberries (Morus micorphylla), Mexican buckeyes (Ungadia speciosa) and velvet ash (Fraxinus velutina) all in various shades of orange and red.

At the top of the trail was Smith Spring, a refreshing closed canopy with running water. I was pleased to find that mosses do exist in southwest Texas, and they can be quite happy!

Overall the trip was a success and will be a good starting point for new ideas about botanizing in the national park. Our experience will be an inspiration for our first DNA sampling from the GUMO herbarium collection which begins this week!

Zach Presents at URC!

Zach Bailey, an undergraduate working in the herbarium, presented his research in a poster at the Undergraduate Research Consortium. Zach is a junior in the Texas Tech Honors College, and worked this year to catalog our collection of herbarium specimens from the Guadalupe Mountains National Park. Zach found that collection, made by TTU botanists Northington and Burgess, contains over 2000 specimens from more than 500 plant species. It represents a snapshot in botanical time from when the National Park first opened in the 1970s.

Zach found over 40 species for which we have 7 or more specimens, making it a candidate for future research into the conservation genetics of plant species found in the high desert environment. He was most excited to find a number of species for which there are no recorded specimens from the Guadalupe Mountains area, increasing the botanical value of the collection.

In the future, Zach will begin botanical research using the collection to learn about the Guadalupe Mountains plant ecosystem. His database of the Guadlupe Mountains specimens will be valuable for kickstarting genetics and physiology research using herbarium specimens.

30 Jan 2018 by mossmatters

Today marks a major update release for HybPiper, version 1.3, which I am calling The Herbarium Update. Many of the features added and bugs fixed have to do with processing targeted sequencing data that came from herbarium specimens.

Herbarium material starts with degraded DNA, and frequently has fragment sizes less than 100 bp. When HybSeq libraries are prepared from this material and sequenced on a platform with > 100 bp reads, the reads themselves will be truncated. The lower depth will also result in shorter contigs assembled by SPAdes within HybPiper.

The biggest issue dealing with these short contigs is when they only have a portion of one exon in a multi-exon gene. With the intron/exon boundary missing, Exonerate was proposing multiple (sometimes overlapping) alignments. As a result, HybPiper was generating incorrect coding sequence, sometimes with contigs repeated or in the wrong order. The fix was to select only one contig for each region, ignoring other possible alignments. This can result in shorter sequence recovery, but the region that is recovered will at least be correct!

Sequences recovered from fresh tissue (which normally have long contigs and sufficient depth) should not be affected by this issue.

In addition, I added a new “heatmap” script thanks to suggestions made by Paul Wolf. The new script uses ggplot and is a lot simpler to use and customize.

I am keeping the old script for now, because although the ggplot version looks great as a PNG, there are issues exporting to a “vectorized” PDF. The old script still works for this, so I am keeping it.

The new version of HybPiper is available on GitHub. I’m happy to see that HybPiper is getting some use in published papers!

The Johnson Lab at Texas Tech University is looking for Ph.D. or Masters students interested in plant phylogenomics and/or bioinformatics to start Fall 2018. Our lab is motivated by a central question in evolutionary biology: what influences the origin and maintenance of plant biodiversity? Research in the lab integrates field work (collection and field experiments), wet lab (tissue culture, high-throughput DNA/RNA sequencing), and computational analysis to test hypotheses about genome evolution in non-model organisms at both deep and narrow timescales. Topics currently being studied in the lab include:

  • Phylogenetic systematics using hundreds of nuclear genes via targeted sequence capture (HybSeq).
  • Identifying genomic events (gene/genome duplication, changes in molecular evolution) associated with key innovations in plant evolution.
  • Optimization of HybSeq using herbarium specimens.
  • Identifying the hybrid origin of polyploid species through targeted sequencing.
  • Development of novel bioinformatics tools for sequence analysis and visualization.

We are especially interested in students who would like to employ herbarium specimens in their research. The E.L. Reed Herbarium in the Biological Sciences building contains 20,000 plant specimens including an important collection of the vascular plants of West Texas. Students interested in bioinformatics, genomics, and data visualization are also encouraged to apply. More information about the Johnson lab can be found at: www.mossmatters.com

Requirements: (1) Bachelor’s degree in biological or computer sciences or related field; (2) interest in integrating wet lab, field work, and computational skills; (3) ability to work both independently and collaboratively; and (4) effective communication skills, necessary for both teaching and for sharing results through papers and presentations at scientific meetings. For Ph.D. applicants, prior research experience is preferred but not required.

The lab has financial support for multiple students through a combination of research and teaching assistantships, including summer support. Interested students should first contact Matt Johnson at matt[DOT]johnson[AT]ttu[DOT]edu .

Deadline for applications The Texas Tech Biological Sciences Department has rolling admissions, but students who wish to be considered for scholarships or fellowships must apply by January 15, 2018 for enrollment in Fall 2018.

Texas Tech University is an Equal Opportunity Employer and I welcome applications from qualified persons regardless of nationality, race, sex, disability, religion, sexual orientation, or age. Texas Tech recently received designation as a Hispanic Serving Institution, and we are excited to support Hispanic scholars.

More information about applying for graduate school at Texas Tech can be found here: http://www.depts.ttu.edu/biology/graduate/graduatestudies.php

Matthew G. Johnson, Ph.D. Assistant Professor, Biological Sciences Director, E.L. Reed Herbarium Texas Tech University E-mail: matt[DOT]johnson[AT]ttu[DOT]edu

New Location, new website! I am excited to join the faculty at Texas Tech University in 2017 as an Assistant Professor and Director of the E.L. Reed Herbarium. To go along with this move, I thought it was time to update the website! The previous version of mossmatters.com was created through WordPress, which I found very unintuitive and occasionally frustrating. For example, they don’t even allow GIFs!

The new site is generated by Jekyll, which is the same engine that generates github.io sites. The good news is that I can write nearly all of my content using Markdown. The whole site is generated on my laptop and then uploaded to my server as static webpages. Also, GIFs are now a breeze!

From a recent bocce outing by the Wickett Lab at Chicago Botanic Garden

Over the next few weeks I will be updating the various Pages to reflect ongoing projects and new publications. I will still maintain a blog for writing about new discoveries and adventures in plant phylogenomics. I plan on re-populating the blog with old posts from the previous website, hopefully with some new bells-and-whistles allowed by the new platform!

Designing HybSeq Probes from a large sequence alignment

One of the most important considerations when designing probes for targeted sequencing is how related the the source sequences are to the potential samples that will be enriched. In phylogenetic studies of non-model organisms, there may not be prior sequences available in the target taxa, but minimizing sequence divergence is still important.

One solution is to use any existing sequence data to design probes from multiple ortholgous sources per gene. This effectively increases probe tiling and should also broaden the use of the probe set to more divergent taxa. Given a sequence alignment, we can choose sequences that are representative of specific clades, but this may be biased.

Instead, we can let the data tell us what the most representative sequences should be. In this notebook we will generate pairwise distance matrices from DNA sequence alignments. The distances will be clustered using one or more multivariate statistics techniques (such as k-means clustering or discrimant analysis) to explore the optimal number of clusters for the alignment, and we will select representative sequences from each cluster.

We will use Python implementations of distance matrices and visualizations taken from the Introduction to Applied Bioinformatics.

%matplotlib inline
from skbio import TabularMSA, DNA, DistanceMatrix
from skbio.sequence.distance import hamming, kmer_distance
import pandas as pd
import matplotlib.pyplot as plt

gene = "7653"
fasta_filename = "/Users/mjohnson/Desktop/Projects/AngiospermHybSeq/genes/{}/FNA2AA-upp-masked.fasta".format(gene)
angiosperm_id_fn = "/Users/mjohnson/Desktop/Projects/AngiospermHybSeq/1kp_angio_codes.txt"
angio_1kp_ids = set([x.rstrip() for x in open(angiosperm_id_fn)])

Reading the data

The MSA has a multiple sequence alignemnt of one gene from 1KP. We keep only the sequences from Angiosperms, including genome sequence.

msa = TabularMSA.read(fasta_filename, constructor=DNA)
seqs_to_keep = []
for seq in msa:
    if seq.metadata["id"] in angio_1kp_ids:
angio_msa = TabularMSA(seqs_to_keep)        
Shape(sequence=603, position=4956)

Now that the alignment contains only angiosperms, remove the positions that are more than 95% gaps:

angio_msa_dict = angio_msa.to_dict()
angio_msa_df = pd.DataFrame(angio_msa_dict)

#This might throw an error if there are ever any positions without gaps. Seems unlikely for this dataset...

def gap_dectector(sequence_column):
    '''Returns the number of gap characters in a column of a sequence matrix'''
        return sequence_column.value_counts()[b"-"]
    except KeyError:
        return 0

gapped_columns = angio_msa_df.apply(gap_dectector ,axis=1)
#This could be modified to remove columns that have 90% gaps, etc.
angio_msa_df_nogaps = angio_msa_df[gapped_columns < len(angio_msa_df.columns) * 0.95]

#In skbio, DNA sequences are stored as bytecode, (b'A') so need to convert back to strings

nogap_seqs = [DNA(angio_msa_df_nogaps[i].str.decode("utf-8").str.cat(), metadata = {"id":i}) for i in angio_msa_df_nogaps]
angio_msa_nogap = TabularMSA(nogap_seqs)


Shape(sequence=603, position=1824)

We also want to remove the sequences that have > 50% gaps

seqs_to_keep = []
for seq in angio_msa_nogap:
    num_gaps = len([x for x in seq.gaps() if x])
    if num_gaps < angio_msa_nogap.shape[1] * 0.5:
angio_msa_nogap_noshort = TabularMSA(seqs_to_keep)

Shape(sequence=307, position=1824)

Distance Matrix

We calculate the “Hamming distance” as described here: http://readiab.org/book/latest/2/4#6.3

The Hamming distance between two equal-length sequences is the proportion of differing characters.

We make a small adjustment to only calculate the Hamming distance between sites with no gaps (equivalent to the p-distance calculated by PAUP*)

def p_distance(seq1,seq2):
    from skbio.sequence import Sequence
    from numpy import isnan
    myseq1 = str(seq1)
    myseq2 = str(seq2)
    degapped1 = []
    degapped2 = []
    for i in range(len(myseq1)):
        if myseq1[i] != "-":
            if myseq2[i] != "-":
    degapped1 = "".join(degapped1)
    degapped2 = "".join(degapped2)
    hamming_dist = hamming(Sequence(degapped1),Sequence(degapped2))
    if isnan(hamming_dist):
        #print(seq1.metadata["id"], seq2.metadata["id"])
        return 0.0
        return hamming_dist

p_dm = DistanceMatrix.from_iterable(angio_msa_nogap_noshort, metric=hamming, key='id')
print("Distance between Amborella and Rice:")
Distance between Amborella and Rice:


The square pairwise distance matrix is shown below.

_ = p_dm.plot(cmap='Blues', title='Pairwise Dissimilarity between sequences, gene {}'.format(gene))
p_dm_df = p_dm.to_data_frame()


Finding Representative Sequences – Manual Selection

Now that we have a distance matrix, the next step is to decide which “representative” sequences are best for designing target capture probes. From the figure above we can see that some sequences diverged up to 80%, which is well beyond the tolerated range of 15-25%.

One solution is to manually choose sequences. For instance, we could choose only genomic sequences that we “know” to be relatively diverged from one another, and hope that they represent the spectrum of divergences for this gene. Let’s try this by choosing: Arabidopsis, Amborella, Oryza, Vitis, Mimulus, and Populus.

manual_centroids = ["Arath_TAIR10","Ambtr_v1.0.27","Orysa_v7.0","Vitvi_Genoscope.12X","Mimgu_v2.0","Poptr_v3.0"]
manual_centroid_dist = p_dm_df[manual_centroids].apply(min,1)
plt.title("Minimum Distance to Centroid, Manual Centroids \n Gene {}".format(gene))
print("Number of Distances > 30%: {}".format(len(manual_centroid_dist[manual_centroid_dist > 0.3])))

Number of Distances > 30%: 269


There are too many sequences that are diverged more than 25% from each of our manually chosen sequences. The same is true even if we select all of the genome sequences:

all_genomes = [x for x in p_dm_df.index if len(x) > 4]
all_genomes_centroid_dist = p_dm_df[all_genomes].apply(min,1)
print("Centroids: ",all_genomes)
plt.title("Minimum Distance to Centroid, All Genome Centroids \n Gene {}".format(gene))
print("\nNumber of Distances > 30%: {}".format(len(all_genomes_centroid_dist[all_genomes_centroid_dist > 0.3])))

Centroids:  ['Ambtr_v1.0.27', 'Aquco_v1.1', 'Arath_TAIR10', 'Eucgr_v1.1', 'Manes_v4.1', 'Mimgu_v2.0', 'Orysa_v7.0', 'Phavu_v1.0', 'Poptr_v3.0', 'Prupe_v1.0', 'Solly_iTAGv2.3', 'Sorbi_v2.1', 'Theca_v1.1', 'Vitvi_Genoscope.12X']

Number of Distances > 30%: 247


K-means clustering

Instead, we could let the distances themselves tell us which sequences are best, by clustering the sequences by their pairwise dissimilarity. By pre-selecting a number of clusters, we can let the data tell us which sequences cluster together, and then choose a representative from each cluster.

Based on example from: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#example-cluster-plot-kmeans-digits-py

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

n_digits = 6 #number of clusters
pca = PCA().fit(p_dm_df)

reduced_data = PCA(n_components=2).fit_transform(p_dm_df)

kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the DNA sequence dataset \n(PCA-reduced distance matrix)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)


The figure plots the PCA transformation of the distance matrix– the axes correspond to PCA1 and PCA2, and each point represents a sequence in the alignment.

The polygons are drawn to estimate the cluster boundaries in two dimensions.

The white X represents the “centroid” of each cluster.

Finding representative sequences – cluster centroids

Now that we have predicted clusters, are these clusters sufficient to have all sequences within the cluster be no more than 30% divergent?

For each cluster, we figure out which of the real sequences in each cluster is closest to the centroid (Euclidean distance). Then we figure out the maximum pairwise distance any sequence and the centroid sequences.

#Group the distance matrix by kmeans clusters
from scipy.spatial import distance

grouped = p_dm_df.groupby(kmeans.labels_)
centroids = []
for name,group in grouped:
    #print("Group number: {}".format(name))
    #Find the sample that is closest to the centroid. This is a pd Dataframe row index.
    closest_to_centroid = pd.DataFrame(reduced_data).groupby(kmeans.labels_).get_group(name).apply(
        lambda x: distance.euclidean(x,kmeans.cluster_centers_[name]), axis=1).sort_values().index[0]
    #print("Number of sequences in group: {}".format(len(group)))
    #Reduce the distance matrix to be square within the group
#    reduced_group = group[group.index]
#    print("Max distance within group: {}".format(max(reduced_group.apply(max))))
    closest_id = p_dm_df.index[closest_to_centroid]
    #print("ID closest to centroid (Euclidean): {}".format(closest_id))
#    print("Furthest within-group P distance distances to centroids ID:")
#    print(reduced_group[closest_id].sort_values(ascending=False)[0:2])
#    print()
print("Centroids: ", centroids)
centroid_dist = p_dm_df[centroids].apply(min,1)
plt.title("Minimum Distance to Centroid, {} Clusters\n Gene {}".format(n_digits,gene))
print("\nNumber of Distances > 30%: {}".format(len(centroid_dist[centroid_dist > 0.30])))

Centroids:  ['LQJY', 'AYIY', 'PUDI', 'QZXQ', 'KJAA', 'EQDA']

Number of Distances > 30%: 240


Before, using sequences from all of the genomes left almost twice as many sequences with > 30% divergence. Ideally, we could pick the number of clusters that minimizes the number of sequences with > 30% divergence.

divergent_seqs = []
pca = PCA().fit(p_dm_df)
reduced_data = PCA(n_components=2).fit_transform(p_dm_df)

for i in range(6,20):
    n_digits = i #number of clusters

    kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
    grouped = p_dm_df.groupby(kmeans.labels_)
    centroids = []
    for name,group in grouped:    
        #Find the sample that is closest to the centroid. This is a pd Dataframe row index.
        closest_to_centroid = pd.DataFrame(reduced_data).groupby(kmeans.labels_).get_group(name).apply(
            lambda x: distance.euclidean(x,kmeans.cluster_centers_[name]), axis=1).sort_values().index[0]
        closest_id = p_dm_df.index[closest_to_centroid]
    #print("Centroids: ", centroids)
    centroid_dist = p_dm_df[centroids].apply(min,1)
    num_over_25 = len(centroid_dist[centroid_dist > 0.30])
    #plt.title("Minimum Distance to Centroid, {} Clusters\n Gene {}".format(n_digits,gene))
    #print("\n\n Distances > 25%:")
    #print("\nNumber of Distances > 25%: {}".format(len(centroid_dist[centroid_dist > 0.25])))

divergent_seqs_df = pd.DataFrame(divergent_seqs,columns=["NumClusters","NumDivergent"])
plt.xlabel("Number of clusters")
plt.title("Number of sequences with > 30% divergence from any centroid")
<matplotlib.text.Text at 0x10df71278>


The number of sequences with > 30% divergence may fluctuate as the number of clusters is increased because the clusters (and centroids) may be chosen differently if the kmeans fit is repeated. In this case, choosing 13 or more clusters will have the best effect.

Spectral Clustering

Another option for assigning sequences to clusters is spectral clustering.

import numpy as np
import scipy as sp
from sklearn.cluster import spectral_clustering

similarity = np.exp(-2 * p_dm_df / p_dm_df.std()).as_matrix()

labels = spectral_clustering(similarity,n_clusters=6,assign_labels = 'discretize')
colormap = np.array(["r","g","b","w","purple","orange","brown","lightblue"])
plt.scatter(reduced_data[:, 0], reduced_data[:, 1],c=colormap[labels])
/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py:629: UserWarning: Array is not symmetric, and will be converted to symmetric by average with its transpose.
  warnings.warn("Array is not symmetric, and will be converted "


The assignment of sequences to clusters is less important than reliably definining a “representative,” so we may need to explore alternative ways of “reducing” the distance data besides PCA.

K Medoids

Another way to choose representative sequences is to use the k-medoid approach: https://en.wikipedia.org/wiki/K-medoids

The principle is similar to k-means clustering, in that clusters are made by minimizing within-group distances. However, instead of centroids (which represent the “mean” of a cluster), the clusters are keyed around a specific point within the cluster (analagous to a median). As a result, there will be no need to calculate which point is closest to the centroid, instead one specific sequence will be chosen as the medoid of each cluster.

Python medoid code is taken from here: https://github.com/letiantian/kmedoids The implementation of this method of calculating k-medoids in python is discussed here: https://www.researchgate.net/publication/272351873_NumPy_SciPy_Recipes_for_Data_Science_k-Medoids_Clustering

import numpy as np
import random

def kMedoids(D, k, tmax=1000):
    # determine dimensions of distance matrix D
    m, n = D.shape

    # randomly initialize an array of k medoid indices
    M = np.sort(np.random.choice(n, k))

    # create a copy of the array of medoid indices
    Mnew = np.copy(M)

    # initialize a dictionary to represent clusters
    C = {}
    for t in range(tmax):
        # determine clusters, i. e. arrays of data indices
        J = np.argmin(D[:,M], axis=1)
        for kappa in range(k):
            C[kappa] = np.where(J==kappa)[0]
        # update cluster medoids
        for kappa in range(k):
            J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
            j = np.argmin(J)
            Mnew[kappa] = C[kappa][j]
        # check for convergence
        if np.array_equal(M, Mnew):
        M = np.copy(Mnew)
        # final update of cluster memberships
        J = np.argmin(D[:,M], axis=1)
        for kappa in range(k):
            C[kappa] = np.where(J==kappa)[0]

    # return results
    return M, C
medoids, membership =  kMedoids(p_dm,8)
medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
print("\nNumber of Distances > 30%: {}".format(len(medoid_dist[medoid_dist > 0.30])))
Number of Distances > 30%: 148

<matplotlib.axes._subplots.AxesSubplot at 0x1101845f8>


divergent_seqs_medoids = []

for k in range(6,20):
        medoids,membership = kMedoids(p_dm,k)
        medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
        num_over_25 = len(medoid_dist[medoid_dist > 0.30])
    except ValueError:

divergent_seqs_medoids_df = pd.DataFrame(divergent_seqs_medoids,columns=["NumClusters","NumDivergent"])
plt.xlabel("Number of clusters")
plt.title("Number of sequences with > 30% divergence from any medoid")    
/usr/local/lib/python3.5/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

<matplotlib.text.Text at 0x10e50e438>


Some iterations return an error that I’m not quite sure how to fix…

It appears that clusters can vary greatly based on individual runs of the k-medoids (or k-means) clustering. This problem is best illustrated with this YouTube video: https://www.youtube.com/watch?v=9nKfViAfajY

It really doesn’t matter for our purposes which cluster each sequence belongs to. Our task is a minimizaiton exercise, so we should repeat each value of K a number of times.

divergent_seqs_medoids = []

for k in range(6,50):
    for i in range(10):
            medoids,membership = kMedoids(p_dm,k)
            medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
            num_over_25 = len(medoid_dist[medoid_dist > 0.30])
        except ValueError:

divergent_seqs_medoids_df = pd.DataFrame(divergent_seqs_medoids,columns=["NumClusters","NumDivergent"])
plt.xlabel("Number of clusters")
plt.title("Number of sequences with > 30% divergence from any medoid")    
/usr/local/lib/python3.5/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

<matplotlib.text.Text at 0x10f7ebf60>


This suggests that for this gene there does exist a set of just ten taxa that could represent 98% of all seqeunces!

best_run = len(p_dm_df)
runs = {}
for i in range(100):
        medoids,membership = kMedoids(p_dm,k)
        medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
        num_over_25 = len(medoid_dist[medoid_dist > 0.25])
    except ValueError:
        num_over_25 = np.nan
    runs[i] = (medoids,membership,medoid_dist)
    if num_over_25 < best_run:
        best_run = num_over_25
        best_run_idx = i
medoids,membership,medoid_dist = runs[best_run_idx]         
print("Medoids: ", p_dm_df.ix[medoids].index)
print("\nNumber of Distances > 30%: {}".format(len(medoid_dist[medoid_dist > 0.30])))

Medoids:  Index(['WZFE', 'XHHU', 'XVRU', 'GNPX', 'BERS', 'DDRL', 'BYQM', 'CKDK', 'CWZU',
       'EDIT', 'IHPC', 'Eucgr_v1.1', 'HQRJ', 'FFFY', 'GDKK', 'EYRD', 'DUQG',
       'PEZP', 'HUSX', 'PPPZ', 'JNKW', 'EJBY', 'LAPO', 'CPKP', 'HOKG', 'NMGG',
       'WOHL', 'FZQN', 'OSMU', 'MFIN', 'AXNH', 'BVOF', 'KEGA', 'VXKB', 'QOXT',
       'TEZA', 'Ambtr_v1.0.27', 'UZXL', 'VGHH', 'HAEU', 'LELS', 'WBOD', 'NBMW',
       'ZENX', 'XZME', 'MRKX', 'TIUZ', 'TJQY', 'ZCUA', 'DZLN'],

Number of Distances > 30%: 57

/usr/local/lib/python3.5/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x10fba0048>


The next step will be to write a dedicated script to systematically check K clusters until a minimum number of sequences is found to represent a maximum number of species.

HybPiper Logo

This week marks the “official” release of HybPiper, the bioinformatics pipeline I’ve been working on for most of my post-doc. We published a paper on the method, and demonstrated how it can be used for phylogenetics, in this month’s Applications in Plant Sciences (open access).

This work was a collaboration between the Pleurocarpous Moss Tree of Life team and the Zerega lab at the Chicago Botanic Garden. We used draft genomic data generated from Artocarpus camansi, the wild progenitor of breadfruit, to design a set of genetic markers to be used for phylogenetics and gene family evolution analysis (for more details, see the companion paper in APPS). Using a technique known as HybSeq, high-throughput sequencing libraries are enriched for genes of interest via hybridization with short RNA probe sequences. Although a few other methods exist to tackle target enrichment data, our method is more streamlined for HybSeq data: extracting coding sequences and the flanking intron sequences for phylogenetics. In the paper we describe HybPiper and demonstrate its use on 458 genes enriched for 28 species of Artocarpus and its relatives.

Main Features

To conduct phylogenetic analysis, high-throughput DNA sequencing reads need to be re-assembled into continuous sequences. HybPiper uses several pre-existing bioinformatics tools to automate the process while maintaining an organized set of intermediate files that can aid in more detailed analysis. The main script of HybPiper has three phases:HybPiper_Infographic

Sorts reads by mapping them to target sequences, using BLASTx (protein targets) or BWA (nucleotide targets). Assembles contigs for each gene separately. Aligns contigs to target sequences and extracts exon (coding) sequence. HybPiper also includes a number of scripts that can be used to extract more information from the sequencing data, including:

Coverage depth and target enrichment efficiency data, including a script for plotting a heat map of gene recovery. Retrieval of non-coding flanking sequences (i.e. introns) either separately or together with the coding sequences (supercontigs). Identification of putative paralogous sequences, and methods to help distinguish ancient from recent paralogs. Process HybPiper results from many samples; for example, generation of separate FASTA files for each gene, ready for phylogenetics pipelines For more information about HybPiper, including complete tutorials on installation, usage, and an example toy dataset, check out the GitHub page and the HybPiper wiki.

Developing HybPiper

Coming up with a consistent pipeline for processing target enrichment data was one of the primary tasks for my post-doc as part of the NSF Tree of Life grant that our team received in 2013. As part of that grant, we’ve developed over 800 markers to reconstruct the phylogeny of nearly 400 pleurocarpous mosses, so I new that the pipeline would have to be very efficient. Along the way I discovered tools such as GNU Parallel, which greatly improved the speed of mapping, sequence assembly and exon extraction. Early in the process we made the decision to develop a pipeline that could itself be a product of the grant, a tool with broad applications to address the growing use of high-throughput sequencing and target enrichment in phylogenetic systematics of non-model organisms.

Many of the features in HybPiper were suggested by several of the co-authors on the APPS paper, particularly Yang Liu, Rafael Medina, and Elliot Gardner. Twitter also played an important role: when I discovered that the assembler I was using (Velvet) was generating quite questionable assemblies, a couple people suggested SPAdes as a replacement, which has worked out great!

Naming HybPiper

Just before submitting the manuscript, my advisor called me into his office and with a very serious tone said “You need a new name for the pipeline.” One week later, we had one of the most entertaining lab meetings ever– here’s a brief look into the creative process. HybPiper brainstorm

HybPiper Whiteboard

After 46 minutes and nearly calling the pipeline Skunk Trapper, we had an epiphany with HybPiper. Elliot, who was not at the meeting, provided the logo the very next day.

Development of HybPiper is ongoing, and I hope to maintain and extend the code to other applications in the future. For example, I’ve already received a few requests from users who would like to see an expansion of the handling of HybPiper for non-coding regions, such as plastid intergenic markers.

For someone passionate about Sphagnum, the thought of visiting a place where peat moss is harvested for commercial use might seem a little like an ornithologist visiting a sport-hunting facility, or a mollusk researcher watching a big diesel spewing ship dredge a river channel. But when I saw that a tour of SunGrow’s peat harvesting and restoration facility in Seba Beach was on the menu at the 2015 Botany conference, I signed up instantly.

I must admit that I was initially skeptical; although it is sometimes classified as a renewable resource, peat forms very slowly. Bog growth rate has been measured at about 1 cm per year, and the typical depth of usable peat in Canada is 3-5 meters. So even if conditions instantly returned to peat accumulation, it would take hundreds of years to regenerate. This makes it more comparable not to other renewables like switchgrass or even loblolly pine, but rather closer on the renewable scale to coal.

I was put at ease by Dr. Line Rochefort, of Laval University in Quebec, who conducted the tour. She has been working with peat harvesting companies for 25 years, advising them on best practices. She helped organize the Canadian Sphagnum Peat Moss Association, a collection of researchers and companies around Canada with a stated interest in responsibly restoring peatland habitat after harvesting. This is not mandated by the Canadian government, which only requires that the site be returned to a wetland. But something as simple as filling in the ditches would never return the peatland to its original state.

Peatland Restoration Tour
Line Rochefort speaks to the group about peatland restoration

Essentially, the restoration of peatlands comes down to one argument: peatlands are going to get harvested, as long as there is demand for peat in the horticultural industry (the association estimates peat harvesting is worth $337 million annually). It would be much better if we (as humans) did this as responsibly as possible, rather than by taking all we could and running away. Peat harvesting has the potential to impact the environment in the short term, through wetland destruction; and long term– removing a peatland causes a net increase of carbon emissions beyond just the removal of living plant material.

Peat Harvesters
Peat harvesters are essentially giant vacuums

After harvesting, even if the drainage ditches are plugged, the peatland cannot recover. Dr. Rochefort mentioned visiting peatlands in Colorado that had been harvested with no attempt to restore them; after 140 years, they still look like barren wastelands. This is because Sphagnum peat moss, the most important plant for northern boreal bog peatlands, cannot recolonize the peatland without some help. Dr. Rochefort told us that when she searches for a “donor site”– a peatland that can have its top layer removed and distributed atop harvested peat for reclamation– the most important plant is not Sphagnum, but another moss, Polytrichum.

Sphagnum and Polytrichum
Polytrichum (left) and Sphagnum (right). Stubby bryologist fingers for scale.

Polytrichum is an upright moss with hardy “stems” that form a matrix onto which Sphagnum can grab hold and begin to build hummocks and accumulate peat. In the first phase of peatland restoration, material from the donor site is spread onto a harvested peatland in a 1:10 ratio (one hectare of donor site can be spread across 10 hectares of harvested peatland). Sphagnum will regenerate from broken fragments, but have a very hard time establishing on its own. Polytrichum, meanwhile, will easily germinate from its very abundant spore bank in the newly spread donor site material. Dr. Rochefort said that Polytrichum is so important that they make sure to add rock phosphate to the site to help Polytrichum spores germinate.

During the tour, we visited three sites: one that had been reclaimed using their procedure in 2009, another that had been reclaimed in a less “rigorous” way in 1994, and a donor site. The more recent reclamation site had an abundance of all the types of mosses you would see in a mature bog: Sphagnum magellanicum, Sphagnum fuscum, Sphagnum angustifolium, Aulacomnium palustre, and a few “brown mosses” such as Tomenthypnum and Drepanocladus. The vegetation had recovered nicely, but did have some species that did not belong in a bog: cotton grass and small birch trees were the most obvious. Dr. Rochefort said that based on earlier restorations, these plants were likely to die out as the Sphagnum took ahold of the habitat.

The elder reclamation site was more interesting from a botanizing perspective. Near the road there was a part where no moss mixture had been spread, and it looked like the wasteland Dr. Rochefort described from Colorado. Just a few feet away, there was a lush and healthy lawn of Sphagnum, perhaps ten species living happily. Although it wasn’t a true raised bog– that might take hundreds of years for the peat to regrow– I would not have been able to tell it was bare peat just 20 years earlier.

The peat extraction process is pretty interesting itself. They actually harvest in the dead of winter, once everything is frozen and covered with snow. Then they use a harrowing device to remove the top 20 cm of plant material– everything including mosses, grasses, and even black spruce trees! This material is kept in a pile for the winter and spring; the water from the snow and the mulch from the woodchips helps maintain the plants while they await spreading on a new peatland. At this point someone asked about the impact of cutting down all the spruce trees. “Compared to the peat moss,” Dr. Rochefort said, “the carbon produced by the spruce is negligible.” Such a difference compared to other restoration efforts, where mosses are barely considered, here is perhaps the ultimate situation in which moss matters.

In the spring, the peat is turned and dried until late summer. Then, extremely large driving vacuums come along and suck up the peat. They turn over the peat behind themselves, and once this layer dries, the process can be repeated. They may get 20-30 years of peat harvest from a single plot, and after this is done, plant material from a donor site will be spread to encourage restoration.

Overall, I left with a positive opinion of the efforts being done at SunGro. I won’t say too much about the political and business implications of peat harvesting, as I will leave that to those who are more informed on those issues. From a botanist’s perspective, and one who has some experience with natural peatlands, their efforts are commendable. I thoroughly enjoyed the tour, not least because I finally got the opportunity to get away from the computer and look at some plants!