Zika tutorial

Following on from the zika install

Get the NextStrain zika tutorial | https://github.com/nextstrain/zika-tutorial

git clone https://github.com/nextstrain/zika-tutorial.git
cd zika-tutorial

Prepare the Sequences

A Nextstrain build typically starts with a collection of pathogen sequences in a single FASTA file and a corresponding table of metadata describing those sequences in a tab-delimited text file. For this tutorial, we will use an example data set with a subset of 34 viruses.

Each example virus sequence record looks like the following, with the virus’s strain id as the sequence name in the header line followed by the virus sequence.

https://github.com/nextstrain/zika-tutorial/blob/master/data/sequences.fasta

>PAN/CDC_259359_V1_V3/2015
gaatttgaagcgaatgctaacaacagtatcaacaggttttattttggatttggaaacgag
agtttctggtcatgaaaaacccaaaaaagaaatccggaggattccggattgtcaatatgc
taaaacgcggagtagcccgtgtgagcccctttgggggcttgaagaggctgccagccggac
ttctgctgggtcatgggcccatcaggatggtcttggcgattctagcctttttgagattca

Each sequence record’s virus strain id links to the tab-delimited metadata file by the latter’s strain field. The metadata file contains a header of column names followed by one row per virus strain id in the sequences file. An example metadata file looks like the following.

https://github.com/nextstrain/zika-tutorial/blob/master/data/metadata.tsv

strain	virus	accession	date	region	country	division	city	db	segment	authors	url	title	journal	paper_url
1_0087_PF	zika	KX447509	2013-12-XX	oceania	french_polynesia	french_polynesia	french_polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447509	How Did Zika Virus Emerge in the Pacific Islands and Latin America?	MBio 7 (5), e01239-16 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27729507
1_0181_PF	zika	KX447512	2013-12-XX	oceania	french_polynesia	french_polynesia	french_polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447512	How Did Zika Virus Emerge in the Pacific Islands and Latin America?	MBio 7 (5), e01239-16 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27729507
1_0199_PF	zika	KX447519	2013-11-XX	oceania	french_polynesia	french_polynesia	french_polynesia	genbank	genome	Pettersson et al	https://www.ncbi.nlm.nih.gov/nuccore/KX447519	How Did Zika Virus Emerge in the Pacific Islands and Latin America?	MBio 7 (5), e01239-16 (2016)	https://www.ncbi.nlm.nih.gov/pubmed/27729507
Aedes_aegypti/USA/2016/FL05	zika	KY075937	2016-09-09	north_america	usa	usa	usa	genbank	genome	Grubaugh et al	https://www.ncbi.nlm.nih.gov/nuccore/KY075937	Genomic epidemiology reveals multiple introductions of Zika virus into the United States	Nature (2017) In press	https://www.ncbi.nlm.nih.gov/pubmed/28538723

A metadata file must have the following columns:

  • strain
  • virus
  • date

Builds using published data should include the following additional columns, as shown in the example above:

  • accession (e.g., NCBI GenBank, EMBL EBI, etc.)
  • authors
  • url
  • title
  • journal
  • paper_url

Start Nextstrain shell

nextstrain shell .

Filter the Sequences

Filter the parsed sequences and metadata to exclude strains from subsequent analysis and subsample the remaining strains to a fixed number of samples per group.

Make results directory

mkdir -p results/

Use augur to filter sequences

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --exclude config/dropped_strains.txt \
  --output results/filtered.fasta \
  --group-by country year month \
  --sequences-per-group 20 \
  --min-date 2012

* Paste as is and execute.

Align the Sequences

Create a multiple alignment of the sequences using a custom reference. After this alignment, columns with gaps in the reference are removed. Additionally, the –fill-gaps flag fills gaps in non-reference sequences with “N” characters. These modifications force all sequences into the same coordinate space as the reference sequence.

augur align \
  --sequences results/filtered.fasta \
  --reference-sequence config/zika_outgroup.gb \
  --output results/aligned.fasta \
  --fill-gaps

Get a Time-Resolved Tree

Augur can also adjust branch lengths in this tree to position tips by their sample date and infer the most likely time of their ancestors, using TreeTime. Run the refine command to apply TreeTime to the original phylogenetic tree and produce a “time tree”.

augur refine \
  --tree results/tree_raw.nwk \
  --alignment results/aligned.fasta \
  --metadata data/metadata.tsv \
  --output-tree results/tree.nwk \
  --output-node-data results/branch_lengths.json \
  --timetree \
  --coalescent opt \
  --date-confidence \
  --date-inference marginal \
  --clock-filter-iqd 4

In addition to assigning times to internal nodes, the refine command filters tips that are likely outliers and assigns confidence intervals to inferred dates. Branch lengths in the resulting Newick tree measure adjusted nucleotide divergence. All other data inferred by TreeTime is stored by strain or internal node name in the corresponding JSON file.

Annotate the Phylogeny

Reconstruct Ancestral Traits

TreeTime can also infer ancestral traits from an existing phylogenetic tree and metadata annotating each tip of the tree. The following command infers the region and country of all internal nodes from the time tree and original strain metadata. As with the refine command, the resulting JSON output is indexed by strain or internal node name.

augur traits \
  --tree results/tree.nwk \
  --metadata data/metadata.tsv \
  --output results/traits.json \
  --columns region country \
  --confidence

Infer Ancestral Sequences

Infer the ancestral sequence of each internal node and identify any nucleotide mutations on the branches leading to any node in the tree.

augur ancestral \
  --tree results/tree.nwk \
  --alignment results/aligned.fasta \
  --output results/nt_muts.json \
  --inference joint

Identify Amino-Acid Mutations

Identify amino acid mutations from the nucleotide mutations and a reference sequence with gene coordinate annotations. The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name. To export a FASTA file with the complete amino acid translations for each gene from each node’s sequence, specify the –alignment-output parameter in the form of results/aligned_aa_%GENE.fasta.

augur translate \
  --tree results/tree.nwk \
  --ancestral-sequences results/nt_muts.json \
  --reference-sequence config/zika_outgroup.gb \
  --output results/aa_muts.json

Export the Results

Finally, collect all node annotations and metadata and export it all in auspice’s JSON format. The resulting tree and metadata JSON files are the inputs to the auspice visualization tool.

augur export \
  --tree results/tree.nwk \
  --metadata data/metadata.tsv \
  --node-data results/branch_lengths.json \
              results/traits.json \
              results/nt_muts.json \
              results/aa_muts.json \
  --colors config/colors.tsv \
  --auspice-config config/auspice_config.json \
  --output-tree auspice/zika_tree.json \
  --output-meta auspice/zika_meta.json

Visualize the Results

If you entered the Nextstrain build environment using nextstrain shell at the beginning of this tutorial, leave it now using the exit command and then use nextstrain view to visualize the Zika build output in auspice/*.json.

# Leave the shell you entered earlier.

exit

# View results in your auspice/ directory.

nextstrain view auspice/

If you’re not using the Nextstrain CLI shell, then copy the auspice/*.json files into the data directory of your local auspice installation and start auspice from there. You can use the commands below (adjusted if necessary), or copy them using a graphical file explorer.

# Copy files into auspice data directory. Adjust
# paths if auspice isn’t installed in ~/src/auspice/.

mkdir ~/src/auspice/data/
cp auspice/*.json ~/src/auspice/data/

Then enter your auspice directry and start auspice.

# Navigate into auspice.

cd ~/src/auspice/data/

# Start auspice.

npm run dev

When auspice is running, navigate to http://localhost:4000/local/zika in your browser to view the results.

To stop auspice and return to the command line when you are done viewing your data, press CTRL+C.

Automate the Build with Snakemake

While it is instructive to run all of the above commands manually, it is more practical to automate their execution with a single script. Nextstrain implements these automated pathogen builds with Snakemake by defining a Snakefile like the one in the Zika repository you downloaded.

First delete the output from the manual steps above. (Be sure to navigate into the zika-tutorial/ directory first.)

rm -rf results/ auspice/

Then, if you’re using the Nextstrain CLI tool, run:

nextstrain build .

to run the automated pathogen build.

If you’re not using the Nextstrain CLI tool, run:

snakemake

The automated build runs all of the manual steps above up through the auspice export. View the results the same way you did before to confirm it produced the same Zika build you made manually.

Note that automated builds will only re-run steps when the data changes. This means builds will pick up where they left off if they are restarted after being interrupted. If you want to force a re-run of the whole build, first remove any previous output with nextstrain build . clean or snakemake clean.

Snakemake file

rule all:
    input:
        auspice_tree = "auspice/zika_tree.json",
        auspice_meta = "auspice/zika_meta.json"

rule files:
    params:
        input_fasta = "data/sequences.fasta",
        input_metadata = "data/metadata.tsv",
        dropped_strains = "config/dropped_strains.txt",
        reference = "config/zika_outgroup.gb",
        colors = "config/colors.tsv",
        auspice_config = "config/auspice_config.json"

files = rules.files.params

rule filter:
    message:
        """
        Filtering to
          - {params.sequences_per_group} sequence(s) per {params.group_by!s}
          - from {params.min_date} onwards
          - excluding strains in {input.exclude}
        """
    input:
        sequences = files.input_fasta,
        metadata = files.input_metadata,
        exclude = files.dropped_strains
    output:
        sequences = "results/filtered.fasta"
    params:
        group_by = "country year month",
        sequences_per_group = 20,
        min_date = 2012
    shell:
        """
        augur filter \
            --sequences {input.sequences} \
            --metadata {input.metadata} \
            --exclude {input.exclude} \
            --output {output.sequences} \
            --group-by {params.group_by} \
            --sequences-per-group {params.sequences_per_group} \
            --min-date {params.min_date}
        """

rule align:
    message:
        """
        Aligning sequences to {input.reference}
          - filling gaps with N
        """
    input:
        sequences = rules.filter.output.sequences,
        reference = files.reference
    output:
        alignment = "results/aligned.fasta"
    shell:
        """
        augur align \
            --sequences {input.sequences} \
            --reference-sequence {input.reference} \
            --output {output.alignment} \
            --fill-gaps
        """

rule tree:
    message: "Building tree"
    input:
        alignment = rules.align.output.alignment
    output:
        tree = "results/tree_raw.nwk"
    shell:
        """
        augur tree \
            --alignment {input.alignment} \
            --output {output.tree}
        """

rule refine:
    message:
        """
        Refining tree
          - estimate timetree
          - use {params.coalescent} coalescent timescale
          - estimate {params.date_inference} node dates
          - filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
        """
    input:
        tree = rules.tree.output.tree,
        alignment = rules.align.output,
        metadata = files.input_metadata
    output:
        tree = "results/tree.nwk",
        node_data = "results/branch_lengths.json"
    params:
        coalescent = "opt",
        date_inference = "marginal",
        clock_filter_iqd = 4
    shell:
        """
        augur refine \
            --tree {input.tree} \
            --alignment {input.alignment} \
            --metadata {input.metadata} \
            --output-tree {output.tree} \
            --output-node-data {output.node_data} \
            --timetree \
            --coalescent {params.coalescent} \
            --date-confidence \
            --date-inference {params.date_inference} \
            --clock-filter-iqd {params.clock_filter_iqd}
        """

rule ancestral:
    message: "Reconstructing ancestral sequences and mutations"
    input:
        tree = rules.refine.output.tree,
        alignment = rules.align.output
    output:
        node_data = "results/nt_muts.json"
    params:
        inference = "joint"
    shell:
        """
        augur ancestral \
            --tree {input.tree} \
            --alignment {input.alignment} \
            --output {output.node_data} \
            --inference {params.inference}
        """

rule translate:
    message: "Translating amino acid sequences"
    input:
        tree = rules.refine.output.tree,
        node_data = rules.ancestral.output.node_data,
        reference = files.reference
    output:
        node_data = "results/aa_muts.json"
    shell:
        """
        augur translate \
            --tree {input.tree} \
            --ancestral-sequences {input.node_data} \
            --reference-sequence {input.reference} \
            --output {output.node_data} \
        """

rule traits:
    message: "Inferring ancestral traits for {params.columns!s}"
    input:
        tree = rules.refine.output.tree,
        metadata = files.input_metadata
    output:
        node_data = "results/traits.json",
    params:
        columns = "region country"
    shell:
        """
        augur traits \
            --tree {input.tree} \
            --metadata {input.metadata} \
            --output {output.node_data} \
            --columns {params.columns} \
            --confidence
        """

rule export:
    message: "Exporting data files for for auspice"
    input:
        tree = rules.refine.output.tree,
        metadata = files.input_metadata,
        branch_lengths = rules.refine.output.node_data,
        traits = rules.traits.output.node_data,
        nt_muts = rules.ancestral.output.node_data,
        aa_muts = rules.translate.output.node_data,
        colors = files.colors,
        auspice_config = files.auspice_config
    output:
        auspice_tree = rules.all.input.auspice_tree,
        auspice_meta = rules.all.input.auspice_meta
    shell:
        """
        augur export \
            --tree {input.tree} \
            --metadata {input.metadata} \
            --node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} \
            --colors {input.colors} \
            --auspice-config {input.auspice_config} \
            --output-tree {output.auspice_tree} \
            --output-meta {output.auspice_meta}
        """

rule clean:
    message: "Removing directories: {params}"
    params:
        "results ",
        "auspice"
    shell:
        "rm -rfv {params}"