src.embed module#

Command line interface#

pathogen-embed#

Reduced dimension embeddings for pathogen sequences

usage: pathogen-embed [-h] --alignment ALIGNMENT [ALIGNMENT ...] [--distance-matrix DISTANCE_MATRIX [DISTANCE_MATRIX ...]] [--separator SEPARATOR] [--indel-distance]
                      [--random-seed RANDOM_SEED] [--output-dataframe OUTPUT_DATAFRAME] [--output-figure OUTPUT_FIGURE] [--embedding-parameters EMBEDDING_PARAMETERS]
                      [--output-pairwise-distance-figure OUTPUT_PAIRWISE_DISTANCE_FIGURE]
                      {pca,t-sne,umap,mds} ...

Positional Arguments#

command

Possible choices: pca, t-sne, umap, mds

Named Arguments#

--alignment

an aligned FASTA file (or files) to create a distance matrix with. Make sure the strain order in this file matches the order in the distance matrix. If adding more than one alignment, make sure the order of the strains and strain names are the same between all the files.

--distance-matrix

a distance matrix (or matrices) that can be read in by pandas, index column as row 0. If adding more than one distance matrix, make sure the order of the strains and strain names in the header are the same between all the files.

--separator

separator between columns in the given distance matrix

Default: “,”

--indel-distance

include insertions/deletions in genetic distance calculations

Default: False

--random-seed

an integer used for reproducible results.

Default: 314159

--output-dataframe

a csv file outputting the embedding with the strain name and its components.

--output-figure

outputs a plot of the embedding

--embedding-parameters

The file containing the parameters by which to tune the embedding. The values from the first record of this file will override default values or values provided by the command line arguments.

--output-pairwise-distance-figure

a scatterplot correlating the genetic vs Euclidean distances

Sub-commands#

pca#

Principal Component Analysis

pathogen-embed pca [-h] [--components COMPONENTS] [--explained-variance EXPLAINED_VARIANCE]
Named Arguments#
--components

the number of components for PCA

Default: 10

--explained-variance

the path for the CSV explained variance for each component

t-sne#

t-distributed Stochastic Neighborhood Embedding

pathogen-embed t-sne [-h] [--components COMPONENTS] [--perplexity PERPLEXITY] [--learning-rate LEARNING_RATE]
Named Arguments#
--components

the number of components for t-SNE

Default: 2

--perplexity

The perplexity is related to the number of nearest neighbors. Because of this, the size of the dataset is proportional to the best perplexity value (large dataset -> large perplexity). Values between 5 and 50 work best. The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.

Default: 30.0

--learning-rate

The learning rate for t-SNE is usually between 10.0 and 1000.0. Values out of these bounds may create innacurate results. The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.

Default: auto

umap#

Uniform Manifold Approximation and Projection

pathogen-embed umap [-h] [--components COMPONENTS] [--nearest-neighbors NEAREST_NEIGHBORS] [--min-dist MIN_DIST]
Named Arguments#
--components

the number of components for UMAP

Default: 2

--nearest-neighbors

Nearest neighbors controls how UMAP balances local versus global structure in the data (finer detail patterns versus global structure). This value is proportional to the size of the data (large dataset -> large nearest neighbors). The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.

Default: 200

--min-dist

Minimum Distance controls how tightly packed the UMAP embedding is. While it does not change the structure of the data, it does change the embedding’s shape. The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.

Default: 0.5

mds#

Multidimensional Scaling

pathogen-embed mds [-h] [--components COMPONENTS] [--stress STRESS]
Named Arguments#
--components

the number of components for MDS

Default: 10

--stress

the path for the CSV stress for the embedding

pathogen-distance#

Hamming distance (optionally indel sensitive) similarity matrix for pathogen sequences

usage: pathogen-distance [-h] --alignment ALIGNMENT [--indel-distance] --output OUTPUT

Named Arguments#

--alignment

an aligned FASTA file to create a distance matrix with. Make sure the strain order in this file matches the order in the distance matrix.

--indel-distance

include insertions/deletions in genetic distance calculations

Default: False

--output

a csv file outputting the distance matrix annotated with strain names as the columns

pathogen-cluster#

HDBSCAN clustering for reduced dimension embeddings

usage: pathogen-cluster [-h] --embedding EMBEDDING [--label-attribute LABEL_ATTRIBUTE] [--random-seed RANDOM_SEED] [--min-size MIN_SIZE] [--min-samples MIN_SAMPLES]
                        [--distance-threshold DISTANCE_THRESHOLD] --output-dataframe OUTPUT_DATAFRAME [--output-figure OUTPUT_FIGURE]

Named Arguments#

--embedding

The embedding to assign clustering labels to via HDBSCAN (https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)

--label-attribute

the name of the cluster used to label the column in the resulting dataframe

--random-seed

an integer used for reproducible results.

Default: 314159

--min-size

minimum cluster size for HDBSCAN

Default: 5

--min-samples

minimum number of sample to seed a cluster for HDBSCAN. Lowering this value reduces number of samples that do not get clustered.

Default: 5

--distance-threshold

The float value for the distance threshold by which to cluster data in the embedding and assign labels via HDBSCAN. If no value is given in distance-threshold, the default distance threshold of 0.0 will be used.

--output-dataframe

a csv file outputting the embedding with the strain name and its components.

--output-figure

outputs a PDF with a plot of the embedding colored by cluster

API#

pathogen-embed.

Reduced dimension embeddings for pathogen sequences.