Skip to content

VFDB contains few genes that are not part of any cluster #331

@PovilasMat

Description

@PovilasMat

Hi,

ariba was running into weird issue while running on vf database:
[E::hts_idx_push] Unsorted positions on sequence # 1: 109 followed by 11
OSError: building of index for /scratch/shadow/tmpr7wt7j_c/ariba_virulencefinder/ariba_virulencefinder/read_store.gz failed

I figured that it was because read_store.gz is incorrectly sorted because one of the genes doesnt have cluster information. I changed read_store.py to sort correctly even with cluster information missing but then it failed in future step:
_init_and_run_clusters    reference_names=self.cluster_ids[cluster_name],
KeyError: ''

Obviously, because cluster name was missing. :)

Then I started digging around and made this small test:

mkdir vftest
cd vftest
ariba getref virulencefinder out.virulencefinder
ariba prepareref -f out.virulencefinder.fa -m out.virulencefinder.tsv ./test
cd test
cat 02.cdhit.clusters.tsv | awk '{$1="";print}' | tr " " "\n" | sort | uniq > cluster_file
grep ">" 02.cdhit.all.fa | sed 's/>//g' | sort > all_file
wc -l all_file
wc -l cluster_file
diff cluster_file all_file

Output of the last three lines:

5558 all_file
5554 cluster_file //cluster file contains one empty line in the beginning
1d0 //this is the empty line
< //this is the empty line
718a718
> csnA_4_KJ922517
973a974
> eltIIAB_c8_1_AASRQF010000005
4943a4945
> stx2_122_CP022279_122
5082a5085
> stx2b_O128_24196_97_95_AJ567995_95
5157a5161
> stx2h_O102_STEC299_122_CP022279_122

So the issue is because one or more of those 5 genes (in my case stx2h_O102_STEC299_122_CP022279_122) can be found in my sequencing reads but they are not part of any cluster. Whenever read_store is made, they do not contain any cluster name which fails the script.

ariba version
ARIBA version: 2.14.6
External dependencies:
bowtie2 2.2.5 /srv/data/tools/anaconda3/envs/env_cge_update/bin/bowtie2
cdhit 4.8.1 /srv/data/tools/anaconda3/envs/env_cge_update/bin/cd-hit-est
nucmer 3.1 /srv/data/tools/anaconda3/envs/env_cge_update/bin/nucmer
spades 3.15.5 /srv/data/tools/anaconda3/envs/env_cge_update/bin/spades.py
External dependencies OK: True
Python version:
3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29)
[GCC 10.4.0]
Python packages:
ariba 2.14.6 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/ariba/init.py
bs4 4.11.1 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/bs4/init.py
dendropy 4.5.2 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/dendropy/init.py
pyfastaq 3.17.0 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/pyfastaq/init.py
pymummer 0.11.0 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/pymummer/init.py
pysam 0.18.0 /srv/data/tools/anaconda3/envs/env_cge_update/lib/python3.9/site-packages/pysam/init.py
Python packages OK: True
Everything looks OK: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions