Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions data_managers/data_manager_deacon/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
categories:
- Data Managers
- Metagenomics
homepage_url: https://github.com/bede/deacon
description: Data manager for Deacon index files
long_description: Data manager for Deacon index files
name: deacon_build_database
owner: iuc
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/main/data_managers/data_manager_deacon
type: unrestricted
149 changes: 149 additions & 0 deletions data_managers/data_manager_deacon/data_manager/deacon_datamanager.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
<tool id="deacon_build_database" name="Deacon" tool_type="manage_data" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@">
<description>database builder</description>
<macros>
<!-- on update run a local test setting `test` to something else than "true" -->
<token name="@TOOL_VERSION@">0.12.0</token>
<token name="@VERSION_SUFFIX@">0</token>
<token name="@PROFILE@">24.1</token>
</macros>
<requirements>
<requirement type="package" version="@TOOL_VERSION@">deacon</requirement>
</requirements>
<command detect_errors="exit_code"><![CDATA[
mkdir -p '$out_file.extra_files_path' &&
#if $test != "true"
#if $input.is_select == "prebuild"
#if $download == "human"
wget -P '$out_file.extra_files_path' 'https://zenodo.org/records/17288185/files/panhuman-1.k31w15.idx' &&
#else
wget -P '$out_file.extra_files_path' 'https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o/deacon/3/panmouse-1.k31w15.idx' &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they provide md5 hashes for checking? I'm worried about data corruption.

Also stange that the do not gz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know but i can ask upstream if they have or if they are willing to upload it to zenodo. And no they dont use gz since the index files are small. The human index file only has around 4GB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking does not hurt.

Copy link

@bede bede Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, Deacon author here.

  • Index compression. Deacon indexes are dense and high entropy, compressing very poorly with old fashioned compression approaches like gzip. It is not worthwhile.
  • Checksums. I do not provide checksums at this time but can consider implementing them if valuable for Galaxy integration. I wouldn't want to bake checksums into the codebase, but could put them in a remote json file in the S3 bucket to which I have write access, hosted by ModMedMicro at the University of Oxford. As you are aware, currently the panhuman-1 index is deposited on Zenodo, but panmouse-1 is not. I am happy to put panmouse-1 onto Zenodo in a manner consistent with panhuman-1 if desired. Please note that Zenodo downloads are much slower (at least in the UK) than the S3 bucket downloads, which is why Deacon defaults to using object storage with deacon index fetch <name>, added in 0.13.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checksums

Maybe just .md5 files? The intention is only to verify the integrity of the download.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zenodo

The advantage for us is: guaranteed stable URLs.

Copy link

@bede bede Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksums

If we care about Zenodo, to add md5 files to I would need to make new versions of those records. They seem to recommend using the zenodo REST API, which provides checksums for all files like so:

curl -s "https://zenodo.org/api/records/17288185" | jq -r '.files[] | select(.key=="panhuman-1.k31w15.idx") | .checksum'

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage for us is: guaranteed stable URLs.

Makes sense, I will put panmouse-1 (and all public indexes) on Zenodo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which other indexes are there? Since we start using the newest version of the tool all public index files with format version 1 and 2 can not be used in this case. If you only do this for galaxy then you can save some work in this case since we only want the format version 3 index files :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksums

If we care about Zenodo, to add md5 files to I would need to make new versions of those records. They seem to recommend using the zenodo REST API, which provides checksums for all files like so:

curl -s "https://zenodo.org/api/records/17288185" | jq -r '.files[] | select(.key=="panhuman-1.k31w15.idx") | .checksum'

Cool trick.

#end if
#else
wget -P '$out_file.extra_files_path' '$link' &&
#end if
#else
touch '$out_file.extra_files_path'/test.idx &&
#end if
cp '$dmjson' '$out_file'
]]></command>
<configfiles>
<configfile name="dmjson"><![CDATA[
#from datetime import datetime
#set time=datetime.now().strftime("%Y-%m-%d")
{
"data_tables":{
"deacon":[
{
#if $input.is_select == "prebuild"
#if $download == "human"
"path":"panhuman-1.k31w15.idx",
#else
"path":"panmouse-1.k31w15.idx",
#end if
#else
"path":"$link.strip('/')[-1]",
#end if
#if $input.is_select == "prebuild"
"dbkey":"@TOOL_VERSION@",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why abuse dbkey to store tool version?

Do you know which genome versions have been used to generate these human / mouse genomes? Then we could set the dbkey correctly.

Seems to include GRCh38.p14 resp. GRCm39 but if I get it right there is other data included. So maybe the dbkey is not useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also implement deacon index build to build indices for data from all_fasta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why abuse dbkey to store tool version?

Do you know which genome versions have been used to generate these human / mouse genomes? Then we could set the dbkey correctly.

Seems to include GRCh38.p14 resp. GRCm39 but if I get it right there is other data included. So maybe the dbkey is not useful.

There are multiple source of data used for both prebuild index files so using them are not good in my eyes.

Why not also implement deacon index build to build indices for data from all_fasta

I can implement this aswell yes. I didnt thought about it since it only thought about using url to dowload any index file yet good idea!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still: Why abuse dbkey to store tool version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did seen some tools did this so i thought i just use it aswell but i can change it for the prebuild files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the tool version is a really good idea. It just should be stored in a version column.

Either we use the dbkey column to store the information linking to the dbkeys provided by Galaxy, i.e. genome builds, or we remove the column (or leave it empty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave them empty then i think this is easier but if you prefer i could change them to something with informations

#else
"dbkey":"custom $time",
#end if
#if $input.is_select == "prebuild"
#if $download == "human"
"name":"panhuman-1 (k=31, w=15)",
#else
"name":"panmouse-1 (k=31, w=15, e=0.5)",
#end if
#else
"name":"$name",
#end if
#if $input.is_select == "prebuild"
"version":"@TOOL_VERSION@",
#else
"version":"$version",
#end if
#if $input.is_select == "prebuild"
"value":"pre-build $time",
#else
"value":"custom $time",
#end if
#if $input.is_select == "prebuild"
"format_version":"3",
#else
"format_version":"$format_version",
#end if
#if $input.is_select == "prebuild"
"note":"Pre-build index files from the devs of deacon"
#else
"note":"$note"
#end if
}
]
}
}]]>
</configfile>
</configfiles>
<inputs>
<conditional name="input">
<param name="is_select" type="select" label="Choose how to add data to the DM">
<option value="url">Copy an index file from a URL</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intention of the url option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That admins can add other index files which where created by other source to have them for better using for all the other tools/WF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of not having this option is that one will need to update the DM and the result of this extra effort can be used by all admins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hope would be that in case a new reference becomes available then people update the DM. Then also admins of other Galaxy servers can download the new reference using the DM.

Otherwise I'm afraid that admins will be "lazy" and just copy paste the new URL here.

Up to you, we can also leave it if you prefer to keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay now i know what you mean and i didnt thought about this to be fair. The idea from the ULR was to add index files which have other sources like from papers to the DM. Therefore i did add the note column so for example the release date can be written there so admins could check if this data is in the DM yet or not.

<option value="prebuild" selected="true">Download a pre-build file</option>
Copy link

@bede bede Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest referring to "prebuilt" indexes throughout rather than "prebuild" or "pre-build"

Copy link
Contributor Author

@SantaMcCloud SantaMcCloud Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this then no worries!

</param>
<when value="prebuild">
<param name="download" type="select" label="Select which pre-build should be downloaded" help="See help section for more information">
<option value="human">panhuman-1 (k=31, w=15)</option>
<option value="mouse">panmouse-1 (k=31, w=15, e=0.5)</option>
</param>
</when>
<when value="url">
<param name="link" type="text" label="Input the URL to download a index file"/>
<param name="name" type="text" label="Set the name for the entry in the DM" help="For an example look in the name of the pre-build files. Also add in brackets the values used for building the file!"/>
<param name="version" type="text" label="Which version was used to build the file to copy" help="State the tool version used to build the index file"/>
<param name="format_version" type="text" label="Set the index format version" help="The current index format version for Deacon v.@TOOL_VERSION@ is 3"/>
<param name="note" type="text" label="Add a note" help="Here some notes can be set for example from where the data comes, who created the data and so on"/>
</when>
</conditional>
<param name="test" type="hidden"/>
</inputs>
<outputs>
<data name="out_file" format="data_manager_json" />
</outputs>
<tests>
<test expect_num_outputs="1">
<conditional name="input">
<param name="is_select" value="prebuild"/>
<param name="download" value="human"/>
</conditional>
<param name="test" value="true"/>
<output name="out_file">
<assert_contents>
<has_text text='"format_version":"3"'/>
<has_text text='"name":"panhuman-1 (k=31, w=15)"'/>
</assert_contents>
</output>
</test>
<test expect_num_outputs="1">
<conditional name="input">
<param name="is_select" value="url"/>
<param name="link" value="https://zenodo.org/records/17288185/files/panhuman-1.k31w15.idx"/>
<param name="name" value="panhuman-1 (k=31, w=15)"/>
<param name="version" value="0.12.0"/>
<param name="format_version" value="3"/>
<param name="note" value="test"/>
</conditional>
<param name="test" value="true"/>
<output name="out_file">
<assert_contents>
<has_text text='"format_version":"3"'/>
<has_text text='"name":"panhuman-1 (k=31, w=15)"'/>
</assert_contents>
</output>
</test>
</tests>
<help><![CDATA[
Download pre-build index files for deacon or download other index files made for deacon via url.
]]></help>
<citations>
<citation type="doi">10.1101/2025.06.09.658732</citation>
</citations>
</tool>
22 changes: 22 additions & 0 deletions data_managers/data_manager_deacon/data_manager_conf.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<data_managers>
<data_manager tool_file="data_manager/deacon_datamanager.xml" id="deacon_build_database">
<data_table name="deacon">
<output>
<column name="value"/>
<column name="dbkey"/>
<column name="name"/>
<column name="version"/>
<column name="path" output_ref="out_file">
<move type="file">
<source>${path}</source>
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">deacon/${value}/${path}</target>
</move>
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/deacon/${value}/${path}</value_translation>
<value_translation type="function">abspath</value_translation>
</column>
<column name="format_version"/>
<column name="note"/>
</output>
</data_table>
</data_manager>
</data_managers>
4 changes: 4 additions & 0 deletions data_managers/data_manager_deacon/test-data/deacon.loc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

db_download_xxxx-xx-xx 0.12.0 deacon human index db 0.12.0 /tmp/tmpf_hplx2a/galaxy-dev/tool-data/deacon/0.12.0/test.idx 3 Testing
pre-build 2025-11-17 0.12.0 panhuman-1 (k=31, w=15) 0.12.0 /home/sf373/sf373/galaxy/tool-data/deacon/pre-build 2025-11-17/panhuman-1.k31w15.idx 3 Pre-build index files from the devs of deacon
custom 2025-11-17 custom 2025-11-17 panhuman-1 (k=31, w=15) 0.12.0 /home/sf373/sf373/galaxy/tool-data/deacon/custom 2025-11-17/https:/zenodo.org/records/17288185/files/panhuman-1.k31w15.idx[-1] 3 test
9 changes: 9 additions & 0 deletions data_managers/data_manager_deacon/tool-data/deacon.loc.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#This is a sample file distributed with Galaxy that enables tools
#to use a the deacon dabase.
#
#<unique_build_id> <dbkey> <display_name> <version> <file_base_path> <index_format_version> <note_like_who_did_create_the_db>

#The <version> column indicates the deacon version that generated the database

#
#deacon_db 0.12.0 Deacon_database 0.12.0 /mnt/galaxyIndices/deacon_database/test.idx 3 just for the test
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<tables>
<table name="deacon" comment_char="#" allow_duplicate_entries="False">
<columns>value, dbkey, name, version, path, format_version, note</columns>
<file path="tool-data/deacoon.loc" />
</table>
</tables>
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<tables>
<!-- Location of deacon indexes for testing -->
<table name="deacon" comment_char="#" allow_duplicate_entries="False">
<columns>value, dbkey, name, version, path, format_version, note</columns>
<file path="${__HERE__}/test-data/deacon.loc" />
</table>
</tables>
13 changes: 13 additions & 0 deletions tools/deacon/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: deacon
owner: iuc
description: filters DNA sequences in FASTA/Q files and streams using accelerated minimizer comparison
homepage_url: https://github.com/bede/deacon
long_description: |
Filter sequences using accelerated minimizer comparison with query sequence(s),
emitting either matching sequences (search mode), or sequences without matches (deplete mode).
Sequences match when they share enough distinct minimizers with the indexed query to exceed chosen
absolute and relative thresholds.
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/main/tools/deacon
type: unrestricted
categories:
- Metagenomics
Loading
Loading