src.embed module#
Command line interface#
pathogen-embed#
Reduced dimension embeddings for pathogen sequences
usage: pathogen-embed [-h] --alignment ALIGNMENT [ALIGNMENT ...] [--distance-matrix DISTANCE_MATRIX [DISTANCE_MATRIX ...]] [--separator SEPARATOR] [--indel-distance]
[--random-seed RANDOM_SEED] [--output-dataframe OUTPUT_DATAFRAME] [--output-figure OUTPUT_FIGURE] [--embedding-parameters EMBEDDING_PARAMETERS]
[--output-pairwise-distance-figure OUTPUT_PAIRWISE_DISTANCE_FIGURE]
{pca,t-sne,umap,mds} ...
Positional Arguments#
- command
Possible choices: pca, t-sne, umap, mds
Named Arguments#
- --alignment
an aligned FASTA file (or files) to create a distance matrix with. Make sure the strain order in this file matches the order in the distance matrix. If adding more than one alignment, make sure the order of the strains and strain names are the same between all the files.
- --distance-matrix
a distance matrix (or matrices) that can be read in by pandas, index column as row 0. If adding more than one distance matrix, make sure the order of the strains and strain names in the header are the same between all the files.
- --separator
separator between columns in the given distance matrix
Default: “,”
- --indel-distance
include insertions/deletions in genetic distance calculations
Default: False
- --random-seed
an integer used for reproducible results.
Default: 314159
- --output-dataframe
a csv file outputting the embedding with the strain name and its components.
- --output-figure
outputs a plot of the embedding
- --embedding-parameters
The file containing the parameters by which to tune the embedding. The values from the first record of this file will override default values or values provided by the command line arguments.
- --output-pairwise-distance-figure
a scatterplot correlating the genetic vs Euclidean distances
Sub-commands#
pca#
Principal Component Analysis
pathogen-embed pca [-h] [--components COMPONENTS] [--explained-variance EXPLAINED_VARIANCE]
Named Arguments#
- --components
the number of components for PCA
Default: 10
- --explained-variance
the path for the CSV explained variance for each component
t-sne#
t-distributed Stochastic Neighborhood Embedding
pathogen-embed t-sne [-h] [--components COMPONENTS] [--perplexity PERPLEXITY] [--learning-rate LEARNING_RATE]
Named Arguments#
- --components
the number of components for t-SNE
Default: 2
- --perplexity
The perplexity is related to the number of nearest neighbors. Because of this, the size of the dataset is proportional to the best perplexity value (large dataset -> large perplexity). Values between 5 and 50 work best. The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.
Default: 30.0
- --learning-rate
The learning rate for t-SNE is usually between 10.0 and 1000.0. Values out of these bounds may create innacurate results. The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.
Default: auto
umap#
Uniform Manifold Approximation and Projection
pathogen-embed umap [-h] [--components COMPONENTS] [--nearest-neighbors NEAREST_NEIGHBORS] [--min-dist MIN_DIST]
Named Arguments#
- --components
the number of components for UMAP
Default: 2
- --nearest-neighbors
Nearest neighbors controls how UMAP balances local versus global structure in the data (finer detail patterns versus global structure). This value is proportional to the size of the data (large dataset -> large nearest neighbors). The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.
Default: 200
- --min-dist
Minimum Distance controls how tightly packed the UMAP embedding is. While it does not change the structure of the data, it does change the embedding’s shape. The default value is the value consistently the best for pathogen analyses, results from an exhaustive grid search.
Default: 0.5
mds#
Multidimensional Scaling
pathogen-embed mds [-h] [--components COMPONENTS] [--stress STRESS]
Named Arguments#
- --components
the number of components for MDS
Default: 10
- --stress
the path for the CSV stress for the embedding
pathogen-distance#
Hamming distance (optionally indel sensitive) similarity matrix for pathogen sequences
usage: pathogen-distance [-h] --alignment ALIGNMENT [--indel-distance] --output OUTPUT
Named Arguments#
- --alignment
an aligned FASTA file to create a distance matrix with. Make sure the strain order in this file matches the order in the distance matrix.
- --indel-distance
include insertions/deletions in genetic distance calculations
Default: False
- --output
a csv file outputting the distance matrix annotated with strain names as the columns
pathogen-cluster#
HDBSCAN clustering for reduced dimension embeddings
usage: pathogen-cluster [-h] --embedding EMBEDDING [--label-attribute LABEL_ATTRIBUTE] [--random-seed RANDOM_SEED] [--min-size MIN_SIZE] [--min-samples MIN_SAMPLES]
[--distance-threshold DISTANCE_THRESHOLD] --output-dataframe OUTPUT_DATAFRAME [--output-figure OUTPUT_FIGURE]
Named Arguments#
- --embedding
The embedding to assign clustering labels to via HDBSCAN (https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)
- --label-attribute
the name of the cluster used to label the column in the resulting dataframe
- --random-seed
an integer used for reproducible results.
Default: 314159
- --min-size
minimum cluster size for HDBSCAN
Default: 5
- --min-samples
minimum number of sample to seed a cluster for HDBSCAN. Lowering this value reduces number of samples that do not get clustered.
Default: 5
- --distance-threshold
The float value for the distance threshold by which to cluster data in the embedding and assign labels via HDBSCAN. If no value is given in distance-threshold, the default distance threshold of 0.0 will be used.
- --output-dataframe
a csv file outputting the embedding with the strain name and its components.
- --output-figure
outputs a PDF with a plot of the embedding colored by cluster
API#
pathogen-embed.
Reduced dimension embeddings for pathogen sequences.