Quickstart

If you want to run SigSeekr right away upon installing it, you can do so with a toy dataset.

This dataset is hosted on figshare - to get it, run the following command:

You should now have a folder called example-data in your present working directory. To run SigSeekr, enter the following command:

The directory specified with the -o flag can be anything - it's the name of a directory where the output files will be created. Upon entering the command, you should see output that is something like this:

 [Elapsed Time: 0.00 seconds] Creating inclusion kmer set... 
 [Elapsed Time: 0.41 seconds] Creating exclusion kmer set... 
 [Elapsed Time: 0.83 seconds] Subtracting exclusion kmers from inclusion kmers with cutoff 1... 
 [Elapsed Time: 0.97 seconds] Found kmers unique to inclusion... 
 [Elapsed Time: 0.97 seconds] Generating contiguous sequences from inclusion kmers... 
 [Elapsed Time: 4.10 seconds] Removing unnecessary output files... 
 [Elapsed Time: 4.10 seconds] SigSeekr run complete! 

The sigseekr_output folder should have two files in it: inclusion_kmers.fasta, which lists all the kmers that are unique to the inclusion set, and sigseekr_result.fasta, which contains the regions that unique kmers span. In this case, sigseekr_result.fasta should have one unique region. To take a look at it, use the cat command:

The result that should come out of this is:

>contig1_sequence1
AACAGGCGACAGGCAGCATCACTAGCTACTA

Detailed Usage

Detailed usage options can be found by typing sigseekr.py --help, which will give the following output. Further details on each option can be found below.

usage: sigseekr.py [-h] -i INCLUSION -e EXCLUSION -o OUTPUT_FOLDER
                   [-s KMER_SIZE] [-t THREADS] [-pcr] [-k]
                   [-p PLASMID_FILTERING] [-l]
                   [-a AMPLICON_SIZE [AMPLICON_SIZE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -i INCLUSION, --inclusion INCLUSION
                        Path to folder containing genome(s) you want signature
                        sequences for. Genomes can be in FASTA or FASTQ
                        format. FASTA-formatted files should be uncompressed,
                        FASTQ-formatted files can be gzip-compressed or
                        uncompressed.
  -e EXCLUSION, --exclusion EXCLUSION
                        Path to folder containing exclusion genome(s) - those
                        you do not want signature sequences for. Genomes can
                        be in FASTA or FASTQ format. FASTA-formatted files
                        should be uncompressed, FASTQ-formatted files can be
                        gzip-compressed or uncompressed.
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Path to folder where you want to store output files.
                        Folder will be created if it does not exist.
  -s KMER_SIZE, --kmer_size KMER_SIZE
                        Kmer size used to search for sequences unique to
                        inclusion. Default 31. No idea how changing this
                        affects results. TO BE INVESTIGATED.
  -t THREADS, --threads THREADS
                        Number of threads to run analysis on. Defaults to
                        number of cores on your machine.
  -pcr, --pcr           Enable to filter out inclusion kmers that have close
                        relatives in exclusion kmers.
  -k, --keep_tmpfiles   If enabled, will not clean up a bunch of (fairly)
                        useless files at the end of a run.
  -p PLASMID_FILTERING, --plasmid_filtering PLASMID_FILTERING
                        To ensure unique sequences are not plasmid-borne, a
                        FASTA-formatted database can be provided with this
                        argument. Any unique kmers that are in the plasmid
                        database will be filtered out.
  -l, --low_memory      Activate this flag to cause plasmid filtering to use
                        substantially less RAM (and go faster), at the cost of
                        some sensitivity.
  -a AMPLICON_SIZE [AMPLICON_SIZE ...], --amplicon_size AMPLICON_SIZE [AMPLICON_SIZE ...]
                        Desired size for PCR amplicons. Default 200. If you
                        want to find more than one amplicon size, enter
                        multiple, separated by spaces.

Additional info: