Include metadata in the output of file processing operators

## New feature

Hi, thank you for developing Nextflow!

I am wondering, I build lots of my workflow steps using Nextflow builtin operators. One thing that could be helpful is a feature to keep track of which file was input to `splitCsv` or `countLines` operator.

## Usage scenario

More precisely, by metadata I mostly mean the _source file_. This would allow to compute something on each file of a channel, but keep the provenance, so as to easily merge back this information.

For example, let's say we need to know the number of lines _of_ each file (for use as size in `groupKey` for example):

Each item of the desired output channel should be like `[ source_file_path, number_of_lines ]`.

For `splitCsv` outputs, having the source file automatically added means we can process lines _by source file_, and/or use `collectFile` to gather back lines into files corresponding to their original source.

## Suggested implementation 

The most basic idea would be to add a specific boolean option, such as `withSourceFile`:

```groovy
Channel.fromPath('{a,b,c}.txt')
| countLines(withSourceFile: true)
| view
```

```stdout
[a.txt, 5]
[b.txt, 12]
[c.txt, 23]
```

and similarly for e.g. `splitCsv`, `splitText` and others.

Alternatively, one rather transparent but more flexible implementation would be to allow all tuple elements to be passed along, instead of discarding them, so that we can do:

```groovy
Channel.fromPath('{a,b,c}.txt')
| map { [it.name, it] }
| countLines(elem: 1, passAllElems: true)
| view
```

```stdout
["a.txt", 5]
["b.txt", 12]
["c.txt", 23]
```

## Current workaround

Currently, to count lines and keep the source file name, I have to either:

* read the file with groovy code inside a `map` operator:

    ```groovy
    int countlines(fpath) {
        int n = 0
        fpath.eachLine { n++ }
    }
    
    Channel.fromPath('{a,b,c}.txt')
    | map { [ it.name, countlines(it) ] }
    ```

* define a process doing the same.

For `split*` operators such as `splitCsv`, the best seems to first use a process to insert the filename into the file itself, for example:

```groovy
process insert_sourcefilename {
    input:
    path(tsv)

    output:
    path('new.tsv')

    """
    sed 's/^/${tsv.name}\\t/' $tsv > new.tsv
    """
}
```

I hope that's a reasonable suggestion... I am happy to explain more the use cases. 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include metadata in the output of file processing operators #5741

New feature

Usage scenario

Suggested implementation

Current workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Include metadata in the output of file processing operators #5741

Description

New feature

Usage scenario

Suggested implementation

Current workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions