Post-Docs Wanted

Collaborative project: Diversity of the moss Physcomitrium pyriforme: significance of autopolyploidy within a phylogenomic and experimental framework.

PIs: Dr. Bernard Goffinet (University of Connecticut, Storrs;) & Dr. Matt Johnson (Texas Tech University, Lubbock).

Additional PI: Dr. Rafael Medina (Augustana College, Illinois).

We seek candidates with interest and experience in phylogenomics, bryophyte systematics and evolution of plant genomes (see project description below) to fill two research associate positions:

  1. University of Connecticut, Storrs: candidate would lead or supervise population sampling in the field and genomic library preparation, genome size, karyotype, and morphological characterization, and testing of reproductive isolation. Candidate is expected to mentor undergraduates seeking research opportunities.

  2. Texas Tech University, Lubbock: candidate would lead development of novel bioinformatics, data visualization, and phylogenetic methods for analysis of targeted sequencing data in polypoid species. Candidate will organize and lead a bioinformatics methods workshop and is expected to mentor undergraduates.

Salary: ± $48,000 plus health insurance

Duration: 1 year, renewable pending satisfactory progress.

Start dates: September 1, 2018 at UCONN and January 1, 2019 at Texas Tech.

Interested candidates should send a single pdf composed of their CV, personal statement highlighting their qualifications and research interests and contact information for three references to either Bernard Goffinet ( or Matt Johnson ( Review of applicants begins immediately.

Project abstract: Whole genome duplication or autopolyploidy occurred repeatedly during the evolution of land plants and likely act as a major driver of evolutionary change. When genome duplications first occur within species they potentially result in immediate reproductive isolation of autopolyploids within populations. Genome duplications may also trigger significant genomic restructuring, preventing meiotic pairing and hence interbreeding between two independent autopolyploids. Genome duplications are thus effective mechanisms of microevolutionary change and are expected to be frequent enough within species for some to give rise to new evolutionary lineages. Our project seeks to test whether shifts in ploidy are phylogenetically structured within a complex of cryptic moss species, the Physcomitrium pyriforme complex. This complex is widespread in North America and Europe. It harbors seven karyotypes worldwide and exhibits much morphological variation, as reflected by the 29 synonyms. These annual, bisexual and selfing mosses are easily grown, and genome doubling is readily induced in vitro from sporophytic tissue, enabling tests of reproductive isolation among wild and artificial autopolyploids. Our project addresses four inter-related objectives: (1) Reconstruct the phylogenomic relationships of 400 populations of P. pyriforme complex using targeted sequencing of 800 low-copy nuclear genes; (2) Characterize the karyotype and genome size of 400 populations of the P. pyriforme-complex across Europe, and infer frequencies of ploidal shifts within a phylogenomic hypothesis; (3) identify morphological signatures of artificial genome duplication and through comparison with wild populations test whether these erode through time; and (4) complement these inferences with experiments testing for reproductive isolation among wild and artificial polyploids and thereby for the evolutionary significance of autopolyploidy.

30 Jan 2018 by mossmatters

Today marks a major update release for HybPiper, version 1.3, which I am calling The Herbarium Update. Many of the features added and bugs fixed have to do with processing targeted sequencing data that came from herbarium specimens.

Herbarium material starts with degraded DNA, and frequently has fragment sizes less than 100 bp. When HybSeq libraries are prepared from this material and sequenced on a platform with > 100 bp reads, the reads themselves will be truncated. The lower depth will also result in shorter contigs assembled by SPAdes within HybPiper.

The biggest issue dealing with these short contigs is when they only have a portion of one exon in a multi-exon gene. With the intron/exon boundary missing, Exonerate was proposing multiple (sometimes overlapping) alignments. As a result, HybPiper was generating incorrect coding sequence, sometimes with contigs repeated or in the wrong order. The fix was to select only one contig for each region, ignoring other possible alignments. This can result in shorter sequence recovery, but the region that is recovered will at least be correct!

Sequences recovered from fresh tissue (which normally have long contigs and sufficient depth) should not be affected by this issue.

In addition, I added a new “heatmap” script thanks to suggestions made by Paul Wolf. The new script uses ggplot and is a lot simpler to use and customize.

I am keeping the old script for now, because although the ggplot version looks great as a PNG, there are issues exporting to a “vectorized” PDF. The old script still works for this, so I am keeping it.

The new version of HybPiper is available on GitHub. I’m happy to see that HybPiper is getting some use in published papers!

The Johnson Lab at Texas Tech University is looking for Ph.D. or Masters students interested in plant phylogenomics and/or bioinformatics to start Fall 2018. Our lab is motivated by a central question in evolutionary biology: what influences the origin and maintenance of plant biodiversity? Research in the lab integrates field work (collection and field experiments), wet lab (tissue culture, high-throughput DNA/RNA sequencing), and computational analysis to test hypotheses about genome evolution in non-model organisms at both deep and narrow timescales. Topics currently being studied in the lab include:

  • Phylogenetic systematics using hundreds of nuclear genes via targeted sequence capture (HybSeq).
  • Identifying genomic events (gene/genome duplication, changes in molecular evolution) associated with key innovations in plant evolution.
  • Optimization of HybSeq using herbarium specimens.
  • Identifying the hybrid origin of polyploid species through targeted sequencing.
  • Development of novel bioinformatics tools for sequence analysis and visualization.

We are especially interested in students who would like to employ herbarium specimens in their research. The E.L. Reed Herbarium in the Biological Sciences building contains 20,000 plant specimens including an important collection of the vascular plants of West Texas. Students interested in bioinformatics, genomics, and data visualization are also encouraged to apply. More information about the Johnson lab can be found at:

Requirements: (1) Bachelor’s degree in biological or computer sciences or related field; (2) interest in integrating wet lab, field work, and computational skills; (3) ability to work both independently and collaboratively; and (4) effective communication skills, necessary for both teaching and for sharing results through papers and presentations at scientific meetings. For Ph.D. applicants, prior research experience is preferred but not required.

The lab has financial support for multiple students through a combination of research and teaching assistantships, including summer support. Interested students should first contact Matt Johnson at matt[DOT]johnson[AT]ttu[DOT]edu .

Deadline for applications The Texas Tech Biological Sciences Department has rolling admissions, but students who wish to be considered for scholarships or fellowships must apply by January 15, 2018 for enrollment in Fall 2018.

Texas Tech University is an Equal Opportunity Employer and I welcome applications from qualified persons regardless of nationality, race, sex, disability, religion, sexual orientation, or age. Texas Tech recently received designation as a Hispanic Serving Institution, and we are excited to support Hispanic scholars.

More information about applying for graduate school at Texas Tech can be found here:

Matthew G. Johnson, Ph.D. Assistant Professor, Biological Sciences Director, E.L. Reed Herbarium Texas Tech University E-mail: matt[DOT]johnson[AT]ttu[DOT]edu

New Location, new website! I am excited to join the faculty at Texas Tech University in 2017 as an Assistant Professor and Director of the E.L. Reed Herbarium. To go along with this move, I thought it was time to update the website! The previous version of was created through WordPress, which I found very unintuitive and occasionally frustrating. For example, they don’t even allow GIFs!

The new site is generated by Jekyll, which is the same engine that generates sites. The good news is that I can write nearly all of my content using Markdown. The whole site is generated on my laptop and then uploaded to my server as static webpages. Also, GIFs are now a breeze!

From a recent bocce outing by the Wickett Lab at Chicago Botanic Garden

Over the next few weeks I will be updating the various Pages to reflect ongoing projects and new publications. I will still maintain a blog for writing about new discoveries and adventures in plant phylogenomics. I plan on re-populating the blog with old posts from the previous website, hopefully with some new bells-and-whistles allowed by the new platform!

Designing HybSeq Probes from a large sequence alignment

One of the most important considerations when designing probes for targeted sequencing is how related the the source sequences are to the potential samples that will be enriched. In phylogenetic studies of non-model organisms, there may not be prior sequences available in the target taxa, but minimizing sequence divergence is still important.

One solution is to use any existing sequence data to design probes from multiple ortholgous sources per gene. This effectively increases probe tiling and should also broaden the use of the probe set to more divergent taxa. Given a sequence alignment, we can choose sequences that are representative of specific clades, but this may be biased.

Instead, we can let the data tell us what the most representative sequences should be. In this notebook we will generate pairwise distance matrices from DNA sequence alignments. The distances will be clustered using one or more multivariate statistics techniques (such as k-means clustering or discrimant analysis) to explore the optimal number of clusters for the alignment, and we will select representative sequences from each cluster.

We will use Python implementations of distance matrices and visualizations taken from the Introduction to Applied Bioinformatics.

%matplotlib inline
from skbio import TabularMSA, DNA, DistanceMatrix
from skbio.sequence.distance import hamming, kmer_distance
import pandas as pd
import matplotlib.pyplot as plt

gene = "7653"
fasta_filename = "/Users/mjohnson/Desktop/Projects/AngiospermHybSeq/genes/{}/FNA2AA-upp-masked.fasta".format(gene)
angiosperm_id_fn = "/Users/mjohnson/Desktop/Projects/AngiospermHybSeq/1kp_angio_codes.txt"
angio_1kp_ids = set([x.rstrip() for x in open(angiosperm_id_fn)])

Reading the data

The MSA has a multiple sequence alignemnt of one gene from 1KP. We keep only the sequences from Angiosperms, including genome sequence.

msa =, constructor=DNA)
seqs_to_keep = []
for seq in msa:
    if seq.metadata["id"] in angio_1kp_ids:
angio_msa = TabularMSA(seqs_to_keep)        
Shape(sequence=603, position=4956)

Now that the alignment contains only angiosperms, remove the positions that are more than 95% gaps:

angio_msa_dict = angio_msa.to_dict()
angio_msa_df = pd.DataFrame(angio_msa_dict)

#This might throw an error if there are ever any positions without gaps. Seems unlikely for this dataset...

def gap_dectector(sequence_column):
    '''Returns the number of gap characters in a column of a sequence matrix'''
        return sequence_column.value_counts()[b"-"]
    except KeyError:
        return 0

gapped_columns = angio_msa_df.apply(gap_dectector ,axis=1)
#This could be modified to remove columns that have 90% gaps, etc.
angio_msa_df_nogaps = angio_msa_df[gapped_columns < len(angio_msa_df.columns) * 0.95]

#In skbio, DNA sequences are stored as bytecode, (b'A') so need to convert back to strings

nogap_seqs = [DNA(angio_msa_df_nogaps[i].str.decode("utf-8"), metadata = {"id":i}) for i in angio_msa_df_nogaps]
angio_msa_nogap = TabularMSA(nogap_seqs)


Shape(sequence=603, position=1824)

We also want to remove the sequences that have > 50% gaps

seqs_to_keep = []
for seq in angio_msa_nogap:
    num_gaps = len([x for x in seq.gaps() if x])
    if num_gaps < angio_msa_nogap.shape[1] * 0.5:
angio_msa_nogap_noshort = TabularMSA(seqs_to_keep)

Shape(sequence=307, position=1824)

Distance Matrix

We calculate the “Hamming distance” as described here:

The Hamming distance between two equal-length sequences is the proportion of differing characters.

We make a small adjustment to only calculate the Hamming distance between sites with no gaps (equivalent to the p-distance calculated by PAUP*)

def p_distance(seq1,seq2):
    from skbio.sequence import Sequence
    from numpy import isnan
    myseq1 = str(seq1)
    myseq2 = str(seq2)
    degapped1 = []
    degapped2 = []
    for i in range(len(myseq1)):
        if myseq1[i] != "-":
            if myseq2[i] != "-":
    degapped1 = "".join(degapped1)
    degapped2 = "".join(degapped2)
    hamming_dist = hamming(Sequence(degapped1),Sequence(degapped2))
    if isnan(hamming_dist):
        #print(seq1.metadata["id"], seq2.metadata["id"])
        return 0.0
        return hamming_dist

p_dm = DistanceMatrix.from_iterable(angio_msa_nogap_noshort, metric=hamming, key='id')
print("Distance between Amborella and Rice:")
Distance between Amborella and Rice:


The square pairwise distance matrix is shown below.

_ = p_dm.plot(cmap='Blues', title='Pairwise Dissimilarity between sequences, gene {}'.format(gene))
p_dm_df = p_dm.to_data_frame()


Finding Representative Sequences – Manual Selection

Now that we have a distance matrix, the next step is to decide which “representative” sequences are best for designing target capture probes. From the figure above we can see that some sequences diverged up to 80%, which is well beyond the tolerated range of 15-25%.

One solution is to manually choose sequences. For instance, we could choose only genomic sequences that we “know” to be relatively diverged from one another, and hope that they represent the spectrum of divergences for this gene. Let’s try this by choosing: Arabidopsis, Amborella, Oryza, Vitis, Mimulus, and Populus.

manual_centroids = ["Arath_TAIR10","Ambtr_v1.0.27","Orysa_v7.0","Vitvi_Genoscope.12X","Mimgu_v2.0","Poptr_v3.0"]
manual_centroid_dist = p_dm_df[manual_centroids].apply(min,1)
plt.title("Minimum Distance to Centroid, Manual Centroids \n Gene {}".format(gene))
print("Number of Distances > 30%: {}".format(len(manual_centroid_dist[manual_centroid_dist > 0.3])))

Number of Distances > 30%: 269


There are too many sequences that are diverged more than 25% from each of our manually chosen sequences. The same is true even if we select all of the genome sequences:

all_genomes = [x for x in p_dm_df.index if len(x) > 4]
all_genomes_centroid_dist = p_dm_df[all_genomes].apply(min,1)
print("Centroids: ",all_genomes)
plt.title("Minimum Distance to Centroid, All Genome Centroids \n Gene {}".format(gene))
print("\nNumber of Distances > 30%: {}".format(len(all_genomes_centroid_dist[all_genomes_centroid_dist > 0.3])))

Centroids:  ['Ambtr_v1.0.27', 'Aquco_v1.1', 'Arath_TAIR10', 'Eucgr_v1.1', 'Manes_v4.1', 'Mimgu_v2.0', 'Orysa_v7.0', 'Phavu_v1.0', 'Poptr_v3.0', 'Prupe_v1.0', 'Solly_iTAGv2.3', 'Sorbi_v2.1', 'Theca_v1.1', 'Vitvi_Genoscope.12X']

Number of Distances > 30%: 247


K-means clustering

Instead, we could let the distances themselves tell us which sequences are best, by clustering the sequences by their pairwise dissimilarity. By pre-selecting a number of clusters, we can let the data tell us which sequences cluster together, and then choose a representative from each cluster.

Based on example from:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

n_digits = 6 #number of clusters
pca = PCA().fit(p_dm_df)

reduced_data = PCA(n_components=2).fit_transform(p_dm_df)

kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the DNA sequence dataset \n(PCA-reduced distance matrix)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)


The figure plots the PCA transformation of the distance matrix– the axes correspond to PCA1 and PCA2, and each point represents a sequence in the alignment.

The polygons are drawn to estimate the cluster boundaries in two dimensions.

The white X represents the “centroid” of each cluster.

Finding representative sequences – cluster centroids

Now that we have predicted clusters, are these clusters sufficient to have all sequences within the cluster be no more than 30% divergent?

For each cluster, we figure out which of the real sequences in each cluster is closest to the centroid (Euclidean distance). Then we figure out the maximum pairwise distance any sequence and the centroid sequences.

#Group the distance matrix by kmeans clusters
from scipy.spatial import distance

grouped = p_dm_df.groupby(kmeans.labels_)
centroids = []
for name,group in grouped:
    #print("Group number: {}".format(name))
    #Find the sample that is closest to the centroid. This is a pd Dataframe row index.
    closest_to_centroid = pd.DataFrame(reduced_data).groupby(kmeans.labels_).get_group(name).apply(
        lambda x: distance.euclidean(x,kmeans.cluster_centers_[name]), axis=1).sort_values().index[0]
    #print("Number of sequences in group: {}".format(len(group)))
    #Reduce the distance matrix to be square within the group
#    reduced_group = group[group.index]
#    print("Max distance within group: {}".format(max(reduced_group.apply(max))))
    closest_id = p_dm_df.index[closest_to_centroid]
    #print("ID closest to centroid (Euclidean): {}".format(closest_id))
#    print("Furthest within-group P distance distances to centroids ID:")
#    print(reduced_group[closest_id].sort_values(ascending=False)[0:2])
#    print()
print("Centroids: ", centroids)
centroid_dist = p_dm_df[centroids].apply(min,1)
plt.title("Minimum Distance to Centroid, {} Clusters\n Gene {}".format(n_digits,gene))
print("\nNumber of Distances > 30%: {}".format(len(centroid_dist[centroid_dist > 0.30])))

Centroids:  ['LQJY', 'AYIY', 'PUDI', 'QZXQ', 'KJAA', 'EQDA']

Number of Distances > 30%: 240


Before, using sequences from all of the genomes left almost twice as many sequences with > 30% divergence. Ideally, we could pick the number of clusters that minimizes the number of sequences with > 30% divergence.

divergent_seqs = []
pca = PCA().fit(p_dm_df)
reduced_data = PCA(n_components=2).fit_transform(p_dm_df)

for i in range(6,20):
    n_digits = i #number of clusters

    kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
    grouped = p_dm_df.groupby(kmeans.labels_)
    centroids = []
    for name,group in grouped:    
        #Find the sample that is closest to the centroid. This is a pd Dataframe row index.
        closest_to_centroid = pd.DataFrame(reduced_data).groupby(kmeans.labels_).get_group(name).apply(
            lambda x: distance.euclidean(x,kmeans.cluster_centers_[name]), axis=1).sort_values().index[0]
        closest_id = p_dm_df.index[closest_to_centroid]
    #print("Centroids: ", centroids)
    centroid_dist = p_dm_df[centroids].apply(min,1)
    num_over_25 = len(centroid_dist[centroid_dist > 0.30])
    #plt.title("Minimum Distance to Centroid, {} Clusters\n Gene {}".format(n_digits,gene))
    #print("\n\n Distances > 25%:")
    #print("\nNumber of Distances > 25%: {}".format(len(centroid_dist[centroid_dist > 0.25])))

divergent_seqs_df = pd.DataFrame(divergent_seqs,columns=["NumClusters","NumDivergent"])
plt.xlabel("Number of clusters")
plt.title("Number of sequences with > 30% divergence from any centroid")
<matplotlib.text.Text at 0x10df71278>


The number of sequences with > 30% divergence may fluctuate as the number of clusters is increased because the clusters (and centroids) may be chosen differently if the kmeans fit is repeated. In this case, choosing 13 or more clusters will have the best effect.

Spectral Clustering

Another option for assigning sequences to clusters is spectral clustering.

import numpy as np
import scipy as sp
from sklearn.cluster import spectral_clustering

similarity = np.exp(-2 * p_dm_df / p_dm_df.std()).as_matrix()

labels = spectral_clustering(similarity,n_clusters=6,assign_labels = 'discretize')
colormap = np.array(["r","g","b","w","purple","orange","brown","lightblue"])
plt.scatter(reduced_data[:, 0], reduced_data[:, 1],c=colormap[labels])
/usr/local/lib/python3.5/site-packages/sklearn/utils/ UserWarning: Array is not symmetric, and will be converted to symmetric by average with its transpose.
  warnings.warn("Array is not symmetric, and will be converted "


The assignment of sequences to clusters is less important than reliably definining a “representative,” so we may need to explore alternative ways of “reducing” the distance data besides PCA.

K Medoids

Another way to choose representative sequences is to use the k-medoid approach:

The principle is similar to k-means clustering, in that clusters are made by minimizing within-group distances. However, instead of centroids (which represent the “mean” of a cluster), the clusters are keyed around a specific point within the cluster (analagous to a median). As a result, there will be no need to calculate which point is closest to the centroid, instead one specific sequence will be chosen as the medoid of each cluster.

Python medoid code is taken from here: The implementation of this method of calculating k-medoids in python is discussed here:

import numpy as np
import random

def kMedoids(D, k, tmax=1000):
    # determine dimensions of distance matrix D
    m, n = D.shape

    # randomly initialize an array of k medoid indices
    M = np.sort(np.random.choice(n, k))

    # create a copy of the array of medoid indices
    Mnew = np.copy(M)

    # initialize a dictionary to represent clusters
    C = {}
    for t in range(tmax):
        # determine clusters, i. e. arrays of data indices
        J = np.argmin(D[:,M], axis=1)
        for kappa in range(k):
            C[kappa] = np.where(J==kappa)[0]
        # update cluster medoids
        for kappa in range(k):
            J = np.mean(D[np.ix_(C[kappa],C[kappa])],axis=1)
            j = np.argmin(J)
            Mnew[kappa] = C[kappa][j]
        # check for convergence
        if np.array_equal(M, Mnew):
        M = np.copy(Mnew)
        # final update of cluster memberships
        J = np.argmin(D[:,M], axis=1)
        for kappa in range(k):
            C[kappa] = np.where(J==kappa)[0]

    # return results
    return M, C
medoids, membership =  kMedoids(p_dm,8)
medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
print("\nNumber of Distances > 30%: {}".format(len(medoid_dist[medoid_dist > 0.30])))
Number of Distances > 30%: 148

<matplotlib.axes._subplots.AxesSubplot at 0x1101845f8>


divergent_seqs_medoids = []

for k in range(6,20):
        medoids,membership = kMedoids(p_dm,k)
        medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
        num_over_25 = len(medoid_dist[medoid_dist > 0.30])
    except ValueError:

divergent_seqs_medoids_df = pd.DataFrame(divergent_seqs_medoids,columns=["NumClusters","NumDivergent"])
plt.xlabel("Number of clusters")
plt.title("Number of sequences with > 30% divergence from any medoid")    
/usr/local/lib/python3.5/site-packages/numpy/core/ RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

<matplotlib.text.Text at 0x10e50e438>


Some iterations return an error that I’m not quite sure how to fix…

It appears that clusters can vary greatly based on individual runs of the k-medoids (or k-means) clustering. This problem is best illustrated with this YouTube video:

It really doesn’t matter for our purposes which cluster each sequence belongs to. Our task is a minimizaiton exercise, so we should repeat each value of K a number of times.

divergent_seqs_medoids = []

for k in range(6,50):
    for i in range(10):
            medoids,membership = kMedoids(p_dm,k)
            medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
            num_over_25 = len(medoid_dist[medoid_dist > 0.30])
        except ValueError:

divergent_seqs_medoids_df = pd.DataFrame(divergent_seqs_medoids,columns=["NumClusters","NumDivergent"])
plt.xlabel("Number of clusters")
plt.title("Number of sequences with > 30% divergence from any medoid")    
/usr/local/lib/python3.5/site-packages/numpy/core/ RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

<matplotlib.text.Text at 0x10f7ebf60>


This suggests that for this gene there does exist a set of just ten taxa that could represent 98% of all seqeunces!

best_run = len(p_dm_df)
runs = {}
for i in range(100):
        medoids,membership = kMedoids(p_dm,k)
        medoid_dist = p_dm_df[p_dm_df.ix[medoids].index].apply(min,1)
        num_over_25 = len(medoid_dist[medoid_dist > 0.25])
    except ValueError:
        num_over_25 = np.nan
    runs[i] = (medoids,membership,medoid_dist)
    if num_over_25 < best_run:
        best_run = num_over_25
        best_run_idx = i
medoids,membership,medoid_dist = runs[best_run_idx]         
print("Medoids: ", p_dm_df.ix[medoids].index)
print("\nNumber of Distances > 30%: {}".format(len(medoid_dist[medoid_dist > 0.30])))

Medoids:  Index(['WZFE', 'XHHU', 'XVRU', 'GNPX', 'BERS', 'DDRL', 'BYQM', 'CKDK', 'CWZU',
       'EDIT', 'IHPC', 'Eucgr_v1.1', 'HQRJ', 'FFFY', 'GDKK', 'EYRD', 'DUQG',
       'PEZP', 'HUSX', 'PPPZ', 'JNKW', 'EJBY', 'LAPO', 'CPKP', 'HOKG', 'NMGG',
       'WOHL', 'FZQN', 'OSMU', 'MFIN', 'AXNH', 'BVOF', 'KEGA', 'VXKB', 'QOXT',
       'TEZA', 'Ambtr_v1.0.27', 'UZXL', 'VGHH', 'HAEU', 'LELS', 'WBOD', 'NBMW',
       'ZENX', 'XZME', 'MRKX', 'TIUZ', 'TJQY', 'ZCUA', 'DZLN'],

Number of Distances > 30%: 57

/usr/local/lib/python3.5/site-packages/numpy/core/ RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x10fba0048>


The next step will be to write a dedicated script to systematically check K clusters until a minimum number of sequences is found to represent a maximum number of species.

HybPiper Logo

This week marks the “official” release of HybPiper, the bioinformatics pipeline I’ve been working on for most of my post-doc. We published a paper on the method, and demonstrated how it can be used for phylogenetics, in this month’s Applications in Plant Sciences (open access).

This work was a collaboration between the Pleurocarpous Moss Tree of Life team and the Zerega lab at the Chicago Botanic Garden. We used draft genomic data generated from Artocarpus camansi, the wild progenitor of breadfruit, to design a set of genetic markers to be used for phylogenetics and gene family evolution analysis (for more details, see the companion paper in APPS). Using a technique known as HybSeq, high-throughput sequencing libraries are enriched for genes of interest via hybridization with short RNA probe sequences. Although a few other methods exist to tackle target enrichment data, our method is more streamlined for HybSeq data: extracting coding sequences and the flanking intron sequences for phylogenetics. In the paper we describe HybPiper and demonstrate its use on 458 genes enriched for 28 species of Artocarpus and its relatives.

Main Features

To conduct phylogenetic analysis, high-throughput DNA sequencing reads need to be re-assembled into continuous sequences. HybPiper uses several pre-existing bioinformatics tools to automate the process while maintaining an organized set of intermediate files that can aid in more detailed analysis. The main script of HybPiper has three phases:HybPiper_Infographic

Sorts reads by mapping them to target sequences, using BLASTx (protein targets) or BWA (nucleotide targets). Assembles contigs for each gene separately. Aligns contigs to target sequences and extracts exon (coding) sequence. HybPiper also includes a number of scripts that can be used to extract more information from the sequencing data, including:

Coverage depth and target enrichment efficiency data, including a script for plotting a heat map of gene recovery. Retrieval of non-coding flanking sequences (i.e. introns) either separately or together with the coding sequences (supercontigs). Identification of putative paralogous sequences, and methods to help distinguish ancient from recent paralogs. Process HybPiper results from many samples; for example, generation of separate FASTA files for each gene, ready for phylogenetics pipelines For more information about HybPiper, including complete tutorials on installation, usage, and an example toy dataset, check out the GitHub page and the HybPiper wiki.

Developing HybPiper

Coming up with a consistent pipeline for processing target enrichment data was one of the primary tasks for my post-doc as part of the NSF Tree of Life grant that our team received in 2013. As part of that grant, we’ve developed over 800 markers to reconstruct the phylogeny of nearly 400 pleurocarpous mosses, so I new that the pipeline would have to be very efficient. Along the way I discovered tools such as GNU Parallel, which greatly improved the speed of mapping, sequence assembly and exon extraction. Early in the process we made the decision to develop a pipeline that could itself be a product of the grant, a tool with broad applications to address the growing use of high-throughput sequencing and target enrichment in phylogenetic systematics of non-model organisms.

Many of the features in HybPiper were suggested by several of the co-authors on the APPS paper, particularly Yang Liu, Rafael Medina, and Elliot Gardner. Twitter also played an important role: when I discovered that the assembler I was using (Velvet) was generating quite questionable assemblies, a couple people suggested SPAdes as a replacement, which has worked out great!

Naming HybPiper

Just before submitting the manuscript, my advisor called me into his office and with a very serious tone said “You need a new name for the pipeline.” One week later, we had one of the most entertaining lab meetings ever– here’s a brief look into the creative process. HybPiper brainstorm

HybPiper Whiteboard

After 46 minutes and nearly calling the pipeline Skunk Trapper, we had an epiphany with HybPiper. Elliot, who was not at the meeting, provided the logo the very next day.

Development of HybPiper is ongoing, and I hope to maintain and extend the code to other applications in the future. For example, I’ve already received a few requests from users who would like to see an expansion of the handling of HybPiper for non-coding regions, such as plastid intergenic markers.

For someone passionate about Sphagnum, the thought of visiting a place where peat moss is harvested for commercial use might seem a little like an ornithologist visiting a sport-hunting facility, or a mollusk researcher watching a big diesel spewing ship dredge a river channel. But when I saw that a tour of SunGrow’s peat harvesting and restoration facility in Seba Beach was on the menu at the 2015 Botany conference, I signed up instantly.

I must admit that I was initially skeptical; although it is sometimes classified as a renewable resource, peat forms very slowly. Bog growth rate has been measured at about 1 cm per year, and the typical depth of usable peat in Canada is 3-5 meters. So even if conditions instantly returned to peat accumulation, it would take hundreds of years to regenerate. This makes it more comparable not to other renewables like switchgrass or even loblolly pine, but rather closer on the renewable scale to coal.

I was put at ease by Dr. Line Rochefort, of Laval University in Quebec, who conducted the tour. She has been working with peat harvesting companies for 25 years, advising them on best practices. She helped organize the Canadian Sphagnum Peat Moss Association, a collection of researchers and companies around Canada with a stated interest in responsibly restoring peatland habitat after harvesting. This is not mandated by the Canadian government, which only requires that the site be returned to a wetland. But something as simple as filling in the ditches would never return the peatland to its original state.

Peatland Restoration Tour
Line Rochefort speaks to the group about peatland restoration

Essentially, the restoration of peatlands comes down to one argument: peatlands are going to get harvested, as long as there is demand for peat in the horticultural industry (the association estimates peat harvesting is worth $337 million annually). It would be much better if we (as humans) did this as responsibly as possible, rather than by taking all we could and running away. Peat harvesting has the potential to impact the environment in the short term, through wetland destruction; and long term– removing a peatland causes a net increase of carbon emissions beyond just the removal of living plant material.

Peat Harvesters
Peat harvesters are essentially giant vacuums

After harvesting, even if the drainage ditches are plugged, the peatland cannot recover. Dr. Rochefort mentioned visiting peatlands in Colorado that had been harvested with no attempt to restore them; after 140 years, they still look like barren wastelands. This is because Sphagnum peat moss, the most important plant for northern boreal bog peatlands, cannot recolonize the peatland without some help. Dr. Rochefort told us that when she searches for a “donor site”– a peatland that can have its top layer removed and distributed atop harvested peat for reclamation– the most important plant is not Sphagnum, but another moss, Polytrichum.

Sphagnum and Polytrichum
Polytrichum (left) and Sphagnum (right). Stubby bryologist fingers for scale.

Polytrichum is an upright moss with hardy “stems” that form a matrix onto which Sphagnum can grab hold and begin to build hummocks and accumulate peat. In the first phase of peatland restoration, material from the donor site is spread onto a harvested peatland in a 1:10 ratio (one hectare of donor site can be spread across 10 hectares of harvested peatland). Sphagnum will regenerate from broken fragments, but have a very hard time establishing on its own. Polytrichum, meanwhile, will easily germinate from its very abundant spore bank in the newly spread donor site material. Dr. Rochefort said that Polytrichum is so important that they make sure to add rock phosphate to the site to help Polytrichum spores germinate.

During the tour, we visited three sites: one that had been reclaimed using their procedure in 2009, another that had been reclaimed in a less “rigorous” way in 1994, and a donor site. The more recent reclamation site had an abundance of all the types of mosses you would see in a mature bog: Sphagnum magellanicum, Sphagnum fuscum, Sphagnum angustifolium, Aulacomnium palustre, and a few “brown mosses” such as Tomenthypnum and Drepanocladus. The vegetation had recovered nicely, but did have some species that did not belong in a bog: cotton grass and small birch trees were the most obvious. Dr. Rochefort said that based on earlier restorations, these plants were likely to die out as the Sphagnum took ahold of the habitat.

The elder reclamation site was more interesting from a botanizing perspective. Near the road there was a part where no moss mixture had been spread, and it looked like the wasteland Dr. Rochefort described from Colorado. Just a few feet away, there was a lush and healthy lawn of Sphagnum, perhaps ten species living happily. Although it wasn’t a true raised bog– that might take hundreds of years for the peat to regrow– I would not have been able to tell it was bare peat just 20 years earlier.

The peat extraction process is pretty interesting itself. They actually harvest in the dead of winter, once everything is frozen and covered with snow. Then they use a harrowing device to remove the top 20 cm of plant material– everything including mosses, grasses, and even black spruce trees! This material is kept in a pile for the winter and spring; the water from the snow and the mulch from the woodchips helps maintain the plants while they await spreading on a new peatland. At this point someone asked about the impact of cutting down all the spruce trees. “Compared to the peat moss,” Dr. Rochefort said, “the carbon produced by the spruce is negligible.” Such a difference compared to other restoration efforts, where mosses are barely considered, here is perhaps the ultimate situation in which moss matters.

In the spring, the peat is turned and dried until late summer. Then, extremely large driving vacuums come along and suck up the peat. They turn over the peat behind themselves, and once this layer dries, the process can be repeated. They may get 20-30 years of peat harvest from a single plot, and after this is done, plant material from a donor site will be spread to encourage restoration.

Overall, I left with a positive opinion of the efforts being done at SunGro. I won’t say too much about the political and business implications of peat harvesting, as I will leave that to those who are more informed on those issues. From a botanist’s perspective, and one who has some experience with natural peatlands, their efforts are commendable. I thoroughly enjoyed the tour, not least because I finally got the opportunity to get away from the computer and look at some plants!