手元のPCで ntデータベース を使用する機会があり、その際に時短に貢献してくれたツールについて備忘録を残しておきます。
第2世代シーケンサー、キャピラリーシーケンサー、Long-read シーケンサーなどから得られたDNA配列の分類群を推定する際、BLASTn による相同性検索 は代表的な方法のひとつです。
BLASTn には、誰でもWeb上で実行できるweb BLASTと、自身のPCに解析環境を用意して行う Local BLAST (BLAST+)があります。
Web BLAST は NCBI の配列データベースをそのまま使える利点がある一方で、カスタム化や自動化には向きにくい印象があります。また、メンテナンスやサーバーダウンなど、自身の管理下にない要因で止まった場合は待つしかない、というデメリットもあります。
Local BLAST は Web BLAST ほどの手軽さはないものの、比較的導入のハードルが低く、バッチ処理や自動化に向いた相同性検索手段です。
塩基配列に対して BLAST 検索(BLASTn)を行う場合、必要となるのは以下の2点です。
- 系統推定したい DNA 配列 .1 DNA 配列データベース(BLAST DB)
BLAST を行うのであれば、(1) の配列はすでに手元にあるはずなので、あとは BLAST+ の導入 と データベースの準備 になります。
私の手元の環境は Linux PC と Windows PC(WSL)なので、macOS で同様に動作するかは未確認です。
- Linux PC
- OS Ubuntu LTS 20.04
- Windows PC
- OS Windows 10
- WSL (Ubuntu LTS 20.04をインストール) 表現があっているかは不明
BLAST+の導入
Windows でも WSL の実行環境が整っていれば、.exe を用意せずに Linux と同様に導入できます。
以下でインストールします。
sudo apt update -y
sudo apt isntall ncbi-blast+
確認
blastn -h
ここまでは、すでに導入済みの方も多いと思います。
pigzの導入
pigzはマルチコアで.gz圧縮・解凍を行うツールです。.xz や .bz2 については pxz/pixz や pbzip2 などがあります。
マルチコアで.gz圧縮・解凍を行ってくれるツール。.xzや.bz2についてはpxz/pixzやpbzip2があります。
# aptでのインストール
sudo apt install -y pigz
# condaベースのインストール(必要なら)
mamba install -y pigz
確認
pigz -h
aria2の導入
aria2はcurlやwgetと比べて複数コネクションによる並列ダウンロードに対応しており、大きなファイルを短時間でダウンロードしたい場合に強力です。
sudo apt install aria2
確認
aria2c -h
これで必要なツールの導入は完了
NBCI v5 blastdbの確認
NCBI のBLAST DB にはいくつかのカテゴリがあります。対象分類群が決まっている場合は、それに合ったデータベースを選ぶほうが効率的です。
v5 の FTP サイトはこちら
塩基配列データベースとして最も大きい区分は nt のようです。 (※分割数や総容量は時期によって変わります)
ここでは、当時(2022年3月時点)に nt が 0〜58 の計59ファイルに分割されていた前提で進めます。
aria2でntデータベースのblastdbをダウンロードしてtarとpigzで解凍する
aria2での並列ダウンロード
同時接続数を上げると高速化できますが、サーバーへの負荷にもなるため、コネクション数は控えめにします。
以下は例として nt.00 を取得する場合
aria2c --continue=true --max-connection-per-server=4 --split=4 --min-split-size=1M \
ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.00.tar.gz
これを全ファイルに対して手作業で繰り返すのは大変なので、forで回します。また、-Zを使って同時に.md5もダウンロードします。
for i in $(seq -f %02g 0 58); do
aria2c --continue=true --max-connection-per-server=4 --split=4 --min-split-size=1M -Z \
ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.$i.tar.gz \
ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/nt.$i.tar.gz.md5
done
これでカレントディレクトリにnt.??.tar.gzとnt.??.tar.gz.mdが揃います。
次に、ダウンロードしたファイルのMD5を検証します。
cat *.md5 > md5sum_v && md5sum -c md5sum_v
以下のようにOKが並べば問題ありません。
nt.00.tar.gz: OK
nt.01.tar.gz: OK
nt.02.tar.gz: OK
もし、FALEDがでた場合は、そのファイルを再ダウンロードします。
tar + pigz による並列解凍
tarはアーカイブ(複数ファイルをまとめたもの)を展開できます。pigz は .gz の圧縮・解凍をマルチコアで高速化できます。
ここでは tar の -I オプションで pigz を指定して展開します。
また、元の.tar.gzを残す場合はpigzに-kを付けます(必要なら)。
# 解凍先フォルダを作成
mkdir -p nt-database
# 解凍
for s in `seq -f %02g 0 58` ; do tar -xvf nt.$s.tar.gz -C nt-database/ -I "pigz -d -k"; done
これでnt-database/に BLAST DB が展開されます。あとは好みの設定でblastnを実行して系統推定を進めてください。
感想
コア数の多いPCでは、ダウンロードや解凍、処理全体を 並列化・並行化 することで待ち時間を大きく短縮できる可能性があります。
xargsや GNU parallel も有効なので、用途に応じて活用すると良いと思います。
Reference
各ツールのUSAGE
blastn
blastn -h
USAGE
blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
[-negative_taxidlist filename] [-entrez_query entrez_query]
[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[-gapopen open_penalty] [-gapextend extend_penalty]
[-perc_identity float_value] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value] [-penalty penalty]
[-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]
[-template_type type] [-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-subject_besthit]
[-window_size int_value] [-off_diagonal_range int_value]
[-use_index boolean] [-index_name string] [-lcase_masking]
[-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format]
[-show_gis] [-num_descriptions int_value] [-num_alignments int_value]
[-line_length line_length] [-html] [-sorthits sort_hits]
[-sorthsps sort_hsps] [-max_target_seqs num_sequences]
[-num_threads int_value] [-remote] [-version]
DESCRIPTION
Nucleotide-Nucleotide BLAST 2.9.0+
Use '-help' to print detailed descriptions of command line arguments
pigz
pgiz -h
Usage: pigz [options] [files ...]
will compress files in place, adding the suffix '.gz'. If no files are
specified, stdin will be compressed to stdout. pigz does what gzip does,
but spreads the work over multiple processors and cores when compressing.
Options:
-0 to -9, -11 Compression level (level 11, zopfli, is much slower)
--fast, --best Compression levels 1 and 9 respectively
-b, --blocksize mmm Set compression block size to mmmK (default 128K)
-c, --stdout Write all processed output to stdout (won't delete)
-d, --decompress Decompress the compressed input
-f, --force Force overwrite, compress .gz, links, and to terminal
-F --first Do iterations first, before block split for -11
-h, --help Display a help screen and quit
-i, --independent Compress blocks independently for damage recovery
-I, --iterations n Number of iterations for -11 optimization
-J, --maxsplits n Maximum number of split blocks for -11
-k, --keep Do not delete original file after processing
-K, --zip Compress to PKWare zip (.zip) single entry format
-l, --list List the contents of the compressed input
-L, --license Display the pigz license and quit
-m, --no-time Do not store or restore mod time
-M, --time Store or restore mod time
-n, --no-name Do not store or restore file name or mod time
-N, --name Store or restore file name and mod time
-O --oneblock Do not split into smaller blocks for -11
-p, --processes n Allow up to n compression threads (default is the
number of online processors, or 8 if unknown)
-q, --quiet Print no messages, even on error
-r, --recursive Process the contents of all subdirectories
-R, --rsyncable Input-determined block locations for rsync
-S, --suffix .sss Use suffix .sss instead of .gz (for compression)
-t, --test Test the integrity of the compressed input
-v, --verbose Provide more verbose output
-V --version Show the version of pigz
-Y --synchronous Force output file write to permanent storage
-z, --zlib Compress to zlib (.zz) instead of gzip format
-- All arguments after "--" are treated as files
aria2
aria2c -h
Usage: aria2c [OPTIONS] [URI | MAGNET | TORRENT_FILE | METALINK_FILE]...
Printing options tagged with '#basic'.
See 'aria2c -h#help' to know all available tags.
Options:
-v, --version Print the version number and exit.
Tags: #basic
-h, --help[=TAG|KEYWORD] Print usage and exit.
The help messages are classified with tags. A tag
starts with "#". For example, type "--help=#http"
to get the usage for the options tagged with
"#http". If non-tag word is given, print the usage
for the options whose name includes that word.
Possible Values: #basic, #advanced, #http, #https, #ftp, #metalink, #bittorrent, #cookie, #hook, #file, #rpc, #checksum, #experimental, #deprecated, #help, #all
Default: #basic
Tags: #basic, #help
-l, --log=LOG The file name of the log file. If '-' is
specified, log is written to stdout.
Possible Values: /path/to/file, -
Tags: #basic
-d, --dir=DIR The directory to store the downloaded file.
Possible Values: /path/to/directory
Default: /home/naoki/Desktop
Tags: #basic, #file
-o, --out=FILE The file name of the downloaded file. It is
always relative to the directory given in -d
option. When the -Z option is used, this option
will be ignored.
Possible Values: /path/to/file
Tags: #basic, #http, #ftp, #file
-s, --split=N Download a file using N connections. If more
than N URIs are given, first N URIs are used and
remaining URLs are used for backup. If less than
N URIs are given, those URLs are used more than
once so that N connections total are made
simultaneously. The number of connections to the
same host is restricted by the
--max-connection-per-server option. See also the
--min-split-size option.
Possible Values: 1-*
Default: 5
Tags: #basic, #http, #ftp
--file-allocation=METHOD Specify file allocation method.
'none' doesn't pre-allocate file space. 'prealloc'
pre-allocates file space before download begins.
This may take some time depending on the size of
the file.
If you are using newer file systems such as ext4
(with extents support), btrfs, xfs or NTFS
(MinGW build only), 'falloc' is your best
choice. It allocates large(few GiB) files
almost instantly. Don't use 'falloc' with legacy
file systems such as ext3 and FAT32 because it
takes almost same time as 'prealloc' and it
blocks aria2 entirely until allocation finishes.
'falloc' may not be available if your system
doesn't have posix_fallocate() function.
'trunc' uses ftruncate() system call or
platform-specific counterpart to truncate a file
to a specified length.
Possible Values: none, prealloc, trunc, falloc
Default: prealloc
Tags: #basic, #file
-V, --check-integrity[=true|false] Check file integrity by validating piece
hashes or a hash of entire file. This option has
effect only in BitTorrent, Metalink downloads
with checksums or HTTP(S)/FTP downloads with
--checksum option. If piece hashes are provided,
this option can detect damaged portions of a file
and re-download them. If a hash of entire file is
provided, hash check is only done when file has
been already download. This is determined by file
length. If hash check fails, file is
re-downloaded from scratch. If both piece hashes
and a hash of entire file are provided, only
piece hashes are used.
Possible Values: true, false
Default: false
Tags: #basic, #metalink, #bittorrent, #file, #checksum
-c, --continue[=true|false] Continue downloading a partially downloaded
file. Use this option to resume a download
started by a web browser or another program
which downloads files sequentially from the
beginning. Currently this option is only
applicable to http(s)/ftp downloads.
Possible Values: true, false
Default: false
Tags: #basic, #http, #ftp
-i, --input-file=FILE Downloads URIs found in FILE. You can specify
multiple URIs for a single entity: separate
URIs on a single line using the TAB character.
Reads input from stdin when '-' is specified.
Additionally, options can be specified after each
line of URI. This optional line must start with
one or more white spaces and have one option per
single line. See INPUT FILE section of man page
for details. See also --deferred-input option.
Possible Values: /path/to/file, -
Tags: #basic
-j, --max-concurrent-downloads=N Set maximum number of parallel downloads for
every static (HTTP/FTP) URL, torrent and metalink.
See also --split and --optimize-concurrent-downloads options.
Possible Values: 1-*
Default: 5
Tags: #basic
-Z, --force-sequential[=true|false] Fetch URIs in the command-line sequentially
and download each URI in a separate session, like
the usual command-line download utilities.
Possible Values: true, false
Default: false
Tags: #basic
-x, --max-connection-per-server=NUM The maximum number of connections to one
server for each download.
Possible Values: 1-16
Default: 1
Tags: #basic, #http, #ftp
-k, --min-split-size=SIZE aria2 does not split less than 2*SIZE byte range.
For example, let's consider downloading 20MiB
file. If SIZE is 10M, aria2 can split file into 2
range [0-10MiB) and [10MiB-20MiB) and download it
using 2 sources(if --split >= 2, of course).
If SIZE is 15M, since 2*15M > 20MiB, aria2 does
not split file and download it using 1 source.
You can append K or M(1K = 1024, 1M = 1024K).
Possible Values: 1048576-1073741824
Default: 20M
Tags: #basic, #http, #ftp
--ftp-user=USER Set FTP user. This affects all URLs.
Tags: #basic, #ftp
--ftp-passwd=PASSWD Set FTP password. This affects all URLs.
Tags: #basic, #ftp
--http-user=USER Set HTTP user. This affects all URLs.
Tags: #basic, #http
--http-passwd=PASSWD Set HTTP password. This affects all URLs.
Tags: #basic, #http
--load-cookies=FILE Load Cookies from FILE using the Firefox3 format
and Mozilla/Firefox(1.x/2.x)/Netscape format.
Possible Values: /path/to/file
Tags: #basic, #http, #cookie
-S, --show-files[=true|false] Print file listing of .torrent, .meta4 and
.metalink file and exit. More detailed
information will be listed in case of torrent
file.
Possible Values: true, false
Default: false
Tags: #basic, #metalink, #bittorrent
--max-overall-upload-limit=SPEED Set max overall upload speed in bytes/sec.
0 means unrestricted.
You can append K or M(1K = 1024, 1M = 1024K).
To limit the upload speed per torrent, use
--max-upload-limit option.
Possible Values: 0-*
Default: 0
Tags: #basic, #bittorrent
-u, --max-upload-limit=SPEED Set max upload speed per each torrent in
bytes/sec. 0 means unrestricted.
You can append K or M(1K = 1024, 1M = 1024K).
To limit the overall upload speed, use
--max-overall-upload-limit option.
Possible Values: 0-*
Default: 0
Tags: #basic, #bittorrent
-T, --torrent-file=TORRENT_FILE The path to the .torrent file.
Possible Values: /path/to/file
Tags: #basic, #bittorrent
--listen-port=PORT... Set TCP port number for BitTorrent downloads.
Multiple ports can be specified by using ',',
for example: "6881,6885". You can also use '-'
to specify a range: "6881-6999". ',' and '-' can
be used together.
Possible Values: 1024-65535
Default: 6881-6999
Tags: #basic, #bittorrent
--enable-dht[=true|false] Enable IPv4 DHT functionality. It also enables
UDP tracker support. If a private flag is set
in a torrent, aria2 doesn't use DHT for that
download even if ``true`` is given.
Possible Values: true, false
Default: true
Tags: #basic, #bittorrent
--dht-listen-port=PORT... Set UDP listening port used by DHT(IPv4, IPv6)
and UDP tracker. Multiple ports can be specified
by using ',', for example: "6881,6885". You can
also use '-' to specify a range: "6881-6999".
',' and '-' can be used together.
Possible Values: 1024-65535
Default: 6881-6999
Tags: #basic, #bittorrent
--enable-dht6[=true|false] Enable IPv6 DHT functionality.
Use --dht-listen-port option to specify port
number to listen on. See also --dht-listen-addr6
option.
Possible Values: true, false
Default: false
Tags: #basic, #bittorrent
--dht-listen-addr6=ADDR Specify address to bind socket for IPv6 DHT.
It should be a global unicast IPv6 address of the
host.
Tags: #basic, #bittorrent
-M, --metalink-file=METALINK_FILE The file path to the .meta4 and .metalink
file. Reads input from stdin when '-' is
specified.
Possible Values: /path/to/file, -
Tags: #basic, #metalink
URI, MAGNET, TORRENT_FILE, METALINK_FILE:
You can specify multiple HTTP(S)/FTP URIs. Unless you specify -Z option, all
URIs must point to the same file or downloading will fail.
You can also specify arbitrary number of BitTorrent Magnet URIs, torrent/
metalink files stored in a local drive. Please note that they are always
treated as a separate download.
You can specify both torrent file with -T option and URIs. By doing this,
download a file from both torrent swarm and HTTP/FTP server at the same time,
while the data from HTTP/FTP are uploaded to the torrent swarm. For single file
torrents, URI can be a complete URI pointing to the resource or if URI ends
with '/', 'name' in torrent file is added. For multi-file torrents, 'name' and
'path' in torrent are added to form a URI for each file.
Make sure that URI is quoted with single(') or double(") quotation if it
contains "&" or any characters that have special meaning in shell.
About the number of connections
Since 1.10.0 release, aria2 uses 1 connection per host by default and has 20MiB
segment size restriction. So whatever value you specify using -s option, it
uses 1 connection per host. To make it behave like 1.9.x, use
--max-connection-per-server=4 --min-split-size=1M.
Refer to man page for more information.