-
Notifications
You must be signed in to change notification settings - Fork 770
Description
New feature
Hi, thank you for developing Nextflow!
I am wondering, I build lots of my workflow steps using Nextflow builtin operators. One thing that could be helpful is a feature to keep track of which file was input to splitCsv or countLines operator.
Usage scenario
More precisely, by metadata I mostly mean the source file. This would allow to compute something on each file of a channel, but keep the provenance, so as to easily merge back this information.
For example, let's say we need to know the number of lines of each file (for use as size in groupKey for example):
Each item of the desired output channel should be like [ source_file_path, number_of_lines ].
For splitCsv outputs, having the source file automatically added means we can process lines by source file, and/or use collectFile to gather back lines into files corresponding to their original source.
Suggested implementation
The most basic idea would be to add a specific boolean option, such as withSourceFile:
Channel.fromPath('{a,b,c}.txt')
| countLines(withSourceFile: true)
| view[a.txt, 5]
[b.txt, 12]
[c.txt, 23]
and similarly for e.g. splitCsv, splitText and others.
Alternatively, one rather transparent but more flexible implementation would be to allow all tuple elements to be passed along, instead of discarding them, so that we can do:
Channel.fromPath('{a,b,c}.txt')
| map { [it.name, it] }
| countLines(elem: 1, passAllElems: true)
| view["a.txt", 5]
["b.txt", 12]
["c.txt", 23]
Current workaround
Currently, to count lines and keep the source file name, I have to either:
-
read the file with groovy code inside a
mapoperator:int countlines(fpath) { int n = 0 fpath.eachLine { n++ } } Channel.fromPath('{a,b,c}.txt') | map { [ it.name, countlines(it) ] }
-
define a process doing the same.
For split* operators such as splitCsv, the best seems to first use a process to insert the filename into the file itself, for example:
process insert_sourcefilename {
input:
path(tsv)
output:
path('new.tsv')
"""
sed 's/^/${tsv.name}\\t/' $tsv > new.tsv
"""
}I hope that's a reasonable suggestion... I am happy to explain more the use cases.