-
Notifications
You must be signed in to change notification settings - Fork 2
Transient Errors
Transient errors are errors caused by bad luck. Resolving transient errors is almost always very simple: Simply rerun the pipeline exactly the same as when it failed, and it will work, as if by magic.
The real issue here is that the transience of an error message is often not obvious, and if the underlying cause is a serious system outage, you're going to keep failing until the service comes back up.
In my experience, Terra in particular is relatively prone to transient errors. This is because a workflow on Terra requires all of these systems to work perfectly:
- Google Cloud storage (gs://)
- GCP (the VMs your workflow runs on)
- Docker Hub, or whatever other container registry you rely upon for hosting your WDL's Docker images
- Cromwell
If your workflow is imported from Dockstore, launching a workflow additionally queries:
- GitHub
- Dockstore
In my experience, the most common transient failures on Terra are, in order:
- GCP fumbling the VM causing it to exit early (this is not the same as preemption)
- Issues with Terra's workflow scheduler, resulting in some instances of a task "getting stuck"
- Docker Hub rejecting Cromwell's attempts to pull the Docker image specified in the workflow task's
runtimesection (I think this has something to do with the sheer number of Docker images Terra has to pull from Docker Hub at any given time, since they don't appear to be cached, but I can only speculate)
The clustering script has an option to call the Microreact API to generate Microreact projects, one per cluster. If Microreact is down, then clusters will of course fail to upload.
myco_sra is somewhat of a unique case due to calling NCBI SRA in order to download FQs, but having built in error handling that might allow it to continue even if NCBI SRA is temporarily down (please note this is theoretical as an NCBI outage has never happened while running myco_sra, at least not to my knowledge). This is because there are dozens of reasons why fastqs may not download from NCBI SRA, almost all of which are problems with the data itself, so myco_sra's download task treats essentially any failure as "we'll just skip that one," and blithely continue with any accessions that did successfully download.
Tip
For this reason, you shouldn't assume everything that failed to download from NCBI SRA is necessarily a bad/incompatible/corrupt sample. myco_sra concatenates information about every queried sample, pass or fail, in a "download report" file. Consider checking it for samples that are reporting an "unknown error," as this may indicate an NCBI SRA outage (but most often, it's actually an issue with the data itself).
Transient errors shouldn't be confused with preempted VMs, which are clearly advertised as "this can be taken away from you at any time" by Google, in exchange for being much cheaper. By default, myco and Tree Nine use preemptible instances for some of their quicker tasks to help keep cloud costs down. When running on GCP including Terra, Cromwell attempts to detect preemption and retry automatically as described in its spec. However, if this retry fails to "fire" properly, this could turn into an actual transient error.