Quickstart
If you want to run SigSeekr right away upon installing it, you can do so with a toy dataset.
This dataset is hosted on figshare - to get it, run the following command:
wget https://ndownloader.figshare.com/files/9885379 && tar xf 9885379
You should now have a folder called example-data
in your present working directory. To run SigSeekr, enter the following command:
sigseekr.py -i example-data/inclusion/ -e example-data/exclusion/ -o sigseekr_output
The directory specified with the -o
flag can be anything - it's the name of a directory where the output files will be created.
Upon entering the command, you should see output that is something like this:
[Elapsed Time: 0.00 seconds] Creating inclusion kmer set...
[Elapsed Time: 0.41 seconds] Creating exclusion kmer set...
[Elapsed Time: 0.83 seconds] Subtracting exclusion kmers from inclusion kmers with cutoff 1...
[Elapsed Time: 0.97 seconds] Found kmers unique to inclusion...
[Elapsed Time: 0.97 seconds] Generating contiguous sequences from inclusion kmers...
[Elapsed Time: 4.10 seconds] Removing unnecessary output files...
[Elapsed Time: 4.10 seconds] SigSeekr run complete!
The sigseekr_output
folder should have two files in it: inclusion_kmers.fasta
, which lists all the kmers that are unique to the inclusion set, and sigseekr_result.fasta
, which contains the regions that unique kmers span. In this case, sigseekr_result.fasta
should have one unique region. To take a look at it, use the cat
command:
cat sigseekr_output/sigseekr_result.fasta
The result that should come out of this is:
>contig1_sequence1
AACAGGCGACAGGCAGCATCACTAGCTACTA
Detailed Usage
Detailed usage options can be found by typing sigseekr.py --help
, which will give the following output.
Further details on each option can be found below.
usage: sigseekr.py [-h] -i INCLUSION -e EXCLUSION -o OUTPUT_FOLDER
[-s KMER_SIZE] [-t THREADS] [-pcr] [-k]
[-p PLASMID_FILTERING] [-l]
[-a AMPLICON_SIZE [AMPLICON_SIZE ...]]
optional arguments:
-h, --help show this help message and exit
-i INCLUSION, --inclusion INCLUSION
Path to folder containing genome(s) you want signature
sequences for. Genomes can be in FASTA or FASTQ
format. FASTA-formatted files should be uncompressed,
FASTQ-formatted files can be gzip-compressed or
uncompressed.
-e EXCLUSION, --exclusion EXCLUSION
Path to folder containing exclusion genome(s) - those
you do not want signature sequences for. Genomes can
be in FASTA or FASTQ format. FASTA-formatted files
should be uncompressed, FASTQ-formatted files can be
gzip-compressed or uncompressed.
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Path to folder where you want to store output files.
Folder will be created if it does not exist.
-s KMER_SIZE, --kmer_size KMER_SIZE
Kmer size used to search for sequences unique to
inclusion. Default 31. No idea how changing this
affects results. TO BE INVESTIGATED.
-t THREADS, --threads THREADS
Number of threads to run analysis on. Defaults to
number of cores on your machine.
-pcr, --pcr Enable to filter out inclusion kmers that have close
relatives in exclusion kmers.
-k, --keep_tmpfiles If enabled, will not clean up a bunch of (fairly)
useless files at the end of a run.
-p PLASMID_FILTERING, --plasmid_filtering PLASMID_FILTERING
To ensure unique sequences are not plasmid-borne, a
FASTA-formatted database can be provided with this
argument. Any unique kmers that are in the plasmid
database will be filtered out.
-l, --low_memory Activate this flag to cause plasmid filtering to use
substantially less RAM (and go faster), at the cost of
some sensitivity.
-a AMPLICON_SIZE [AMPLICON_SIZE ...], --amplicon_size AMPLICON_SIZE [AMPLICON_SIZE ...]
Desired size for PCR amplicons. Default 200. If you
want to find more than one amplicon size, enter
multiple, separated by spaces.
Additional info:
-i, --inclusion
: Not too much to say about this - it's the folder where you'll want to place any genomes that you want signature sequences for. If you place more than one genome here, SigSeekr will look for kmers common to all input genomes and develop a signature sequence based on those. These genomes can be FASTA-formatted assemblies (recommended) in which case they must be uncompressed, or raw FASTQ reads, in which case they can be either uncompressed or gzip-compressed.-e, --exclusion
: The collection of genomes you do not want your signature sequences to match to. Same file format rules as the inclusion folder.-o, --output_folder
: The folder where output files will be stored. Created if it doesn't exist. Recommended that you create a new folder for each run, as outputs will be overwritten from previous runs.-t, --threads
: Number of threads to run SigSeekr with. Recommended to leave at the default setting of all cores on your machine, as most programs in the SigSeekr pipeline scale very well with additional threads.-k, --keep_tmpfiles
: By default, a number of fairly boring (but sometimes quite large) files are deleted at the end of a run to save on disk space. Specifying this option will keep them around if you want to inspect them more closely. Files that will be kept around with this option specified include the KMC inclusion and exclusion, and unique to inclusion databases (inclusion_db
,exclusion_db
, andunique_to_inclusion_db
), FASTA files of all inclusion kmers (inclusion_kmers.fasta
), and a bedfile showing coverage of inclusion kmers across one of the inclusion genomes specified (regions_to_mask.bed
).-p, --plasmid_filtering
: If you're looking for sequences unique to a genome, you probably don't want them on mobile elements that might not be there the next time you look. To help alleviate this potential problem, you can specify the path to a FASTA-formatted database with this option. Any inclusion kmers found in the database will be excluded from further analysis. A relatively extensive plasmid database (~9000 RefSeq plasmids spanning all of Bacteria), can be downloaded with the following command:https://ndownloader.figshare.com/files/9827323 && tar xf 9827323
. This will create a folder calleddatabases
in your present working directory. Within the folder,plasmid_db.fasta
is the plasmid database.-l, --low_memory
: Using the above-mentioned plasmid database can be memory-intensive. To help alleviate that, add this flag, which will use less memory and go faster, at the cost of some sensitivity.-pcr, --pcr
: Will attempt to find amplicons in your inclusion genomes that would be acceptable PCR products. By default, searches for 200 base pair amplicons, but can be changed to any size with the-a
option. Output file for this isconfirmed_amplicons_X.fasta
, where X is your specified amplicon size-a, --amplicon_size
: Changes the size of the PCR amplicons that SigSeekr attempts to find. If given multiple arguments separated by spaces (i.e.-a 200 300 400
), SigSeekr will attempt to find amplicons of all sizes specified