手元のPCで ntデータベース を使用する機会があり、その際に時短に貢献してくれたツールについて備忘録を残しておきます。

第2世代シーケンサー、キャピラリーシーケンサー、Long-read シーケンサーなどから得られたDNA配列の分類群を推定する際、BLASTn による相同性検索 は代表的な方法のひとつです。

BLASTn には、誰でもWeb上で実行できるweb BLASTと、自身のPCに解析環境を用意して行う Local BLAST (BLAST+)があります。

Web BLAST は NCBI の配列データベースをそのまま使える利点がある一方で、カスタム化や自動化には向きにくい印象があります。また、メンテナンスやサーバーダウンなど、自身の管理下にない要因で止まった場合は待つしかない、というデメリットもあります。

Local BLAST は Web BLAST ほどの手軽さはないものの、比較的導入のハードルが低く、バッチ処理や自動化に向いた相同性検索手段です。

塩基配列に対して BLAST 検索(BLASTn)を行う場合、必要となるのは以下の2点です。

  1. 系統推定したい DNA 配列 .1 DNA 配列データベース(BLAST DB)

BLAST を行うのであれば、(1) の配列はすでに手元にあるはずなので、あとは BLAST+ の導入 と データベースの準備 になります。

私の手元の環境は Linux PC と Windows PC(WSL)なので、macOS で同様に動作するかは未確認です。

  1. Linux PC
  2. OS Ubuntu LTS 20.04
  3. Windows PC
  4. OS Windows 10
  5. WSL (Ubuntu LTS 20.04をインストール) 表現があっているかは不明

BLAST+の導入

Windows でも WSL の実行環境が整っていれば、.exe を用意せずに Linux と同様に導入できます。

以下でインストールします。

sudo apt update -y
sudo apt isntall ncbi-blast+

確認

blastn -h

ここまでは、すでに導入済みの方も多いと思います。

pigzの導入

pigzはマルチコアで.gz圧縮・解凍を行うツールです。.xz.bz2 については pxz/pixzpbzip2 などがあります。

マルチコアで.gz圧縮・解凍を行ってくれるツール。.xzや.bz2についてはpxz/pixzやpbzip2があります。

# aptでのインストール
sudo apt install -y pigz
# condaベースのインストール(必要なら)
mamba install -y pigz

確認

pigz -h

aria2の導入

aria2curlwgetと比べて複数コネクションによる並列ダウンロードに対応しており、大きなファイルを短時間でダウンロードしたい場合に強力です。

sudo apt install aria2

確認

aria2c -h

これで必要なツールの導入は完了

NBCI v5 blastdbの確認

NCBI のBLAST DB にはいくつかのカテゴリがあります。対象分類群が決まっている場合は、それに合ったデータベースを選ぶほうが効率的です。

v5 の FTP サイトはこちら

塩基配列データベースとして最も大きい区分は nt のようです。 (※分割数や総容量は時期によって変わります)

ここでは、当時(2022年3月時点)に nt が 0〜58 の計59ファイルに分割されていた前提で進めます。

aria2でntデータベースのblastdbをダウンロードしてtarとpigzで解凍する

aria2での並列ダウンロード

同時接続数を上げると高速化できますが、サーバーへの負荷にもなるため、コネクション数は控えめにします。

以下は例として nt.00 を取得する場合

aria2c --continue=true --max-connection-per-server=4 --split=4 --min-split-size=1M \
ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.00.tar.gz

これを全ファイルに対して手作業で繰り返すのは大変なので、forで回します。また、-Zを使って同時に.md5もダウンロードします。

for i in $(seq -f %02g 0 58); do
  aria2c --continue=true --max-connection-per-server=4 --split=4 --min-split-size=1M -Z \
    ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.$i.tar.gz \
    ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.$i.tar.gz.md5
done

これでカレントディレクトリにnt.??.tar.gznt.??.tar.gz.mdが揃います。

次に、ダウンロードしたファイルのMD5を検証します。

cat *.md5 > md5sum_v && md5sum -c md5sum_v

以下のようにOKが並べば問題ありません。

nt.00.tar.gz: OK
nt.01.tar.gz: OK
nt.02.tar.gz: OK

もし、FALEDがでた場合は、そのファイルを再ダウンロードします。


tar + pigz による並列解凍

tarはアーカイブ(複数ファイルをまとめたもの)を展開できます。pigz.gz の圧縮・解凍をマルチコアで高速化できます。

ここでは tar の -I オプションで pigz を指定して展開します。 また、元の.tar.gzを残す場合はpigz-kを付けます(必要なら)。

# 解凍先フォルダを作成
mkdir -p nt-database
# 解凍
for s in `seq -f %02g 0 58` ; do tar -xvf nt.$s.tar.gz -C nt-database/ -I "pigz -d -k";  done

これでnt-database/に BLAST DB が展開されます。あとは好みの設定でblastnを実行して系統推定を進めてください。

感想

コア数の多いPCでは、ダウンロードや解凍、処理全体を 並列化・並行化 することで待ち時間を大きく短縮できる可能性があります。 xargsGNU parallel も有効なので、用途に応じて活用すると良いと思います。

Reference

各ツールのUSAGE

blastn

blastn -h
USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
    [-negative_taxidlist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-perc_identity float_value] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value] [-penalty penalty]
    [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]
    [-template_type type] [-template_length int_value] [-dust DUST_options]
    [-filtering_db filtering_database]
    [-window_masker_taxid window_masker_taxid]
    [-window_masker_db window_masker_db] [-soft_masking soft_masking]
    [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
    [-best_hit_score_edge float_value] [-subject_besthit]
    [-window_size int_value] [-off_diagonal_range int_value]
    [-use_index boolean] [-index_name string] [-lcase_masking]
    [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format]
    [-show_gis] [-num_descriptions int_value] [-num_alignments int_value]
    [-line_length line_length] [-html] [-sorthits sort_hits]
    [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-remote] [-version]

DESCRIPTION
   Nucleotide-Nucleotide BLAST 2.9.0+

Use '-help' to print detailed descriptions of command line arguments

pigz

pgiz -h
Usage: pigz [options] [files ...]
  will compress files in place, adding the suffix '.gz'. If no files are
  specified, stdin will be compressed to stdout. pigz does what gzip does,
  but spreads the work over multiple processors and cores when compressing.

Options:
  -0 to -9, -11        Compression level (level 11, zopfli, is much slower)
  --fast, --best       Compression levels 1 and 9 respectively
  -b, --blocksize mmm  Set compression block size to mmmK (default 128K)
  -c, --stdout         Write all processed output to stdout (won't delete)
  -d, --decompress     Decompress the compressed input
  -f, --force          Force overwrite, compress .gz, links, and to terminal
  -F  --first          Do iterations first, before block split for -11
  -h, --help           Display a help screen and quit
  -i, --independent    Compress blocks independently for damage recovery
  -I, --iterations n   Number of iterations for -11 optimization
  -J, --maxsplits n    Maximum number of split blocks for -11
  -k, --keep           Do not delete original file after processing
  -K, --zip            Compress to PKWare zip (.zip) single entry format
  -l, --list           List the contents of the compressed input
  -L, --license        Display the pigz license and quit
  -m, --no-time        Do not store or restore mod time
  -M, --time           Store or restore mod time
  -n, --no-name        Do not store or restore file name or mod time
  -N, --name           Store or restore file name and mod time
  -O  --oneblock       Do not split into smaller blocks for -11
  -p, --processes n    Allow up to n compression threads (default is the
                       number of online processors, or 8 if unknown)
  -q, --quiet          Print no messages, even on error
  -r, --recursive      Process the contents of all subdirectories
  -R, --rsyncable      Input-determined block locations for rsync
  -S, --suffix .sss    Use suffix .sss instead of .gz (for compression)
  -t, --test           Test the integrity of the compressed input
  -v, --verbose        Provide more verbose output
  -V  --version        Show the version of pigz
  -Y  --synchronous    Force output file write to permanent storage
  -z, --zlib           Compress to zlib (.zz) instead of gzip format
  --                   All arguments after "--" are treated as files

aria2

aria2c -h
Usage: aria2c [OPTIONS] [URI | MAGNET | TORRENT_FILE | METALINK_FILE]...
Printing options tagged with '#basic'.
See 'aria2c -h#help' to know all available tags.
Options:
 -v, --version                Print the version number and exit.

                              Tags: #basic

 -h, --help[=TAG|KEYWORD]     Print usage and exit.
                              The help messages are classified with tags. A tag
                              starts with "#". For example, type "--help=#http"
                              to get the usage for the options tagged with
                              "#http". If non-tag word is given, print the usage
                              for the options whose name includes that word.

                              Possible Values: #basic, #advanced, #http, #https, #ftp, #metalink, #bittorrent, #cookie, #hook, #file, #rpc, #checksum, #experimental, #deprecated, #help, #all
                              Default: #basic
                              Tags: #basic, #help

 -l, --log=LOG                The file name of the log file. If '-' is
                              specified, log is written to stdout.

                              Possible Values: /path/to/file, -
                              Tags: #basic

 -d, --dir=DIR                The directory to store the downloaded file.

                              Possible Values: /path/to/directory
                              Default: /home/naoki/Desktop
                              Tags: #basic, #file

 -o, --out=FILE               The file name of the downloaded file. It is
                              always relative to the directory given in -d
                              option. When the -Z option is used, this option
                              will be ignored.

                              Possible Values: /path/to/file
                              Tags: #basic, #http, #ftp, #file

 -s, --split=N                Download a file using N connections. If more
                              than N URIs are given, first N URIs are used and
                              remaining URLs are used for backup. If less than
                              N URIs are given, those URLs are used more than
                              once so that N connections total are made
                              simultaneously. The number of connections to the
                              same host is restricted by the
                              --max-connection-per-server option. See also the
                              --min-split-size option.

                              Possible Values: 1-*
                              Default: 5
                              Tags: #basic, #http, #ftp

 --file-allocation=METHOD     Specify file allocation method.
                              'none' doesn't pre-allocate file space. 'prealloc'
                              pre-allocates file space before download begins.
                              This may take some time depending on the size of
                              the file.
                              If you are using newer file systems such as ext4
                              (with extents support), btrfs, xfs or NTFS
                              (MinGW build only), 'falloc' is your best
                              choice. It allocates large(few GiB) files
                              almost instantly. Don't use 'falloc' with legacy
                              file systems such as ext3 and FAT32 because it
                              takes almost same time as 'prealloc' and it
                              blocks aria2 entirely until allocation finishes.
                              'falloc' may not be available if your system
                              doesn't have posix_fallocate() function.
                              'trunc' uses ftruncate() system call or
                              platform-specific counterpart to truncate a file
                              to a specified length.

                              Possible Values: none, prealloc, trunc, falloc
                              Default: prealloc
                              Tags: #basic, #file

 -V, --check-integrity[=true|false] Check file integrity by validating piece
                              hashes or a hash of entire file. This option has
                              effect only in BitTorrent, Metalink downloads
                              with checksums or HTTP(S)/FTP downloads with
                              --checksum option. If piece hashes are provided,
                              this option can detect damaged portions of a file
                              and re-download them. If a hash of entire file is
                              provided, hash check is only done when file has
                              been already download. This is determined by file
                              length. If hash check fails, file is
                              re-downloaded from scratch. If both piece hashes
                              and a hash of entire file are provided, only
                              piece hashes are used.

                              Possible Values: true, false
                              Default: false
                              Tags: #basic, #metalink, #bittorrent, #file, #checksum

 -c, --continue[=true|false]  Continue downloading a partially downloaded
                              file. Use this option to resume a download
                              started by a web browser or another program
                              which downloads files sequentially from the
                              beginning. Currently this option is only
                              applicable to http(s)/ftp downloads.

                              Possible Values: true, false
                              Default: false
                              Tags: #basic, #http, #ftp

 -i, --input-file=FILE        Downloads URIs found in FILE. You can specify
                              multiple URIs for a single entity: separate
                              URIs on a single line using the TAB character.
                              Reads input from stdin when '-' is specified.
                              Additionally, options can be specified after each
                              line of URI. This optional line must start with
                              one or more white spaces and have one option per
                              single line. See INPUT FILE section of man page
                              for details. See also --deferred-input option.

                              Possible Values: /path/to/file, -
                              Tags: #basic

 -j, --max-concurrent-downloads=N Set maximum number of parallel downloads for
                              every static (HTTP/FTP) URL, torrent and metalink.
                              See also --split and --optimize-concurrent-downloads options.

                              Possible Values: 1-*
                              Default: 5
                              Tags: #basic

 -Z, --force-sequential[=true|false] Fetch URIs in the command-line sequentially
                              and download each URI in a separate session, like
                              the usual command-line download utilities.

                              Possible Values: true, false
                              Default: false
                              Tags: #basic

 -x, --max-connection-per-server=NUM The maximum number of connections to one
                              server for each download.

                              Possible Values: 1-16
                              Default: 1
                              Tags: #basic, #http, #ftp

 -k, --min-split-size=SIZE    aria2 does not split less than 2*SIZE byte range.
                              For example, let's consider downloading 20MiB
                              file. If SIZE is 10M, aria2 can split file into 2
                              range [0-10MiB) and [10MiB-20MiB) and download it
                              using 2 sources(if --split >= 2, of course).
                              If SIZE is 15M, since 2*15M > 20MiB, aria2 does
                              not split file and download it using 1 source.
                              You can append K or M(1K = 1024, 1M = 1024K).

                              Possible Values: 1048576-1073741824
                              Default: 20M
                              Tags: #basic, #http, #ftp

 --ftp-user=USER              Set FTP user. This affects all URLs.

                              Tags: #basic, #ftp

 --ftp-passwd=PASSWD          Set FTP password. This affects all URLs.

                              Tags: #basic, #ftp

 --http-user=USER             Set HTTP user. This affects all URLs.

                              Tags: #basic, #http

 --http-passwd=PASSWD         Set HTTP password. This affects all URLs.

                              Tags: #basic, #http

 --load-cookies=FILE          Load Cookies from FILE using the Firefox3 format
                              and Mozilla/Firefox(1.x/2.x)/Netscape format.

                              Possible Values: /path/to/file
                              Tags: #basic, #http, #cookie

 -S, --show-files[=true|false] Print file listing of .torrent, .meta4 and
                              .metalink file and exit. More detailed
                              information will be listed in case of torrent
                              file.

                              Possible Values: true, false
                              Default: false
                              Tags: #basic, #metalink, #bittorrent

 --max-overall-upload-limit=SPEED Set max overall upload speed in bytes/sec.
                              0 means unrestricted.
                              You can append K or M(1K = 1024, 1M = 1024K).
                              To limit the upload speed per torrent, use
                              --max-upload-limit option.

                              Possible Values: 0-*
                              Default: 0
                              Tags: #basic, #bittorrent

 -u, --max-upload-limit=SPEED Set max upload speed per each torrent in
                              bytes/sec. 0 means unrestricted.
                              You can append K or M(1K = 1024, 1M = 1024K).
                              To limit the overall upload speed, use
                              --max-overall-upload-limit option.

                              Possible Values: 0-*
                              Default: 0
                              Tags: #basic, #bittorrent

 -T, --torrent-file=TORRENT_FILE  The path to the .torrent file.

                              Possible Values: /path/to/file
                              Tags: #basic, #bittorrent

 --listen-port=PORT...        Set TCP port number for BitTorrent downloads.
                              Multiple ports can be specified by using ',',
                              for example: "6881,6885". You can also use '-'
                              to specify a range: "6881-6999". ',' and '-' can
                              be used together.

                              Possible Values: 1024-65535
                              Default: 6881-6999
                              Tags: #basic, #bittorrent

 --enable-dht[=true|false]    Enable IPv4 DHT functionality. It also enables
                              UDP tracker support. If a private flag is set
                              in a torrent, aria2 doesn't use DHT for that
                              download even if ``true`` is given.

                              Possible Values: true, false
                              Default: true
                              Tags: #basic, #bittorrent

 --dht-listen-port=PORT...    Set UDP listening port used by DHT(IPv4, IPv6)
                              and UDP tracker. Multiple ports can be specified
                              by using ',', for example: "6881,6885". You can
                              also use '-' to specify a range: "6881-6999".
                              ',' and '-' can be used together.

                              Possible Values: 1024-65535
                              Default: 6881-6999
                              Tags: #basic, #bittorrent

 --enable-dht6[=true|false]   Enable IPv6 DHT functionality.
                              Use --dht-listen-port option to specify port
                              number to listen on. See also --dht-listen-addr6
                              option.

                              Possible Values: true, false
                              Default: false
                              Tags: #basic, #bittorrent

 --dht-listen-addr6=ADDR      Specify address to bind socket for IPv6 DHT.
                              It should be a global unicast IPv6 address of the
                              host.

                              Tags: #basic, #bittorrent

 -M, --metalink-file=METALINK_FILE The file path to the .meta4 and .metalink
                              file. Reads input from stdin when '-' is
                              specified.

                              Possible Values: /path/to/file, -
                              Tags: #basic, #metalink

URI, MAGNET, TORRENT_FILE, METALINK_FILE:
 You can specify multiple HTTP(S)/FTP URIs. Unless you specify -Z option, all
 URIs must point to the same file or downloading will fail.
 You can also specify arbitrary number of BitTorrent Magnet URIs, torrent/
 metalink files stored in a local drive. Please note that they are always
 treated as a separate download.

 You can specify both torrent file with -T option and URIs. By doing this,
 download a file from both torrent swarm and HTTP/FTP server at the same time,
 while the data from HTTP/FTP are uploaded to the torrent swarm. For single file
 torrents, URI can be a complete URI pointing to the resource or if URI ends
 with '/', 'name' in torrent file is added. For multi-file torrents, 'name' and
 'path' in torrent are added to form a URI for each file.

 Make sure that URI is quoted with single(') or double(") quotation if it
 contains "&" or any characters that have special meaning in shell.

About the number of connections
 Since 1.10.0 release, aria2 uses 1 connection per host by default and has 20MiB
 segment size restriction. So whatever value you specify using -s option, it
 uses 1 connection per host. To make it behave like 1.9.x, use
 --max-connection-per-server=4 --min-split-size=1M.

Refer to man page for more information.