Skip to content

feat(backend, website): download of original data for revisions#5856

Open
chaoran-chen wants to merge 2 commits intomainfrom
download-original-data
Open

feat(backend, website): download of original data for revisions#5856
chaoran-chen wants to merge 2 commits intomainfrom
download-original-data

Conversation

@chaoran-chen
Copy link
Member

@chaoran-chen chaoran-chen commented Jan 19, 2026

resolves #3167

This PR adds the feature to download original submission data (metadata and sequences) for released sequences, making it easier to prepare revisions in bulk. Previously, users could only download processed data from the website, but revisions require the original submitted data format.

A new backend endpoint POST /{organism}/get-original-data returns a zip file containing metadata.tsv and sequences.fasta in the same format expected for submissions. The download is limited to 500 entries as it has not been optimized for memory efficiency (no streaming or out-of-memory storage). On the website, a new button has been added to the released sequences page for downloading the file.

The design of providing a zip download would be compatible with @theosanderson's proposal in #5821 (comment) to "provide a zip download which has the whole directory structure to upload in order to put in the existing sequences -- except that instead of each file actually containing the contents it just contains the text md5:da23ef or whatever, and that's used as a pointer to the existing file".

Screenshot

image

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)
    • Workflow of downloading original data (test organism and CCHF), changing something, and using the files for revision

🚀 Preview: https://download-original-data.loculus.org

@claude claude bot added backend related to the loculus backend component website Tasks related to the web application labels Jan 19, 2026
@chaoran-chen chaoran-chen force-pushed the download-original-data branch 3 times, most recently from 8d11338 to b2d1706 Compare January 21, 2026 14:30
@chaoran-chen chaoran-chen added the preview Triggers a deployment to argocd label Jan 21, 2026
@chaoran-chen chaoran-chen force-pushed the download-original-data branch from 7fa72bd to cff80b1 Compare January 21, 2026 23:37
@chaoran-chen chaoran-chen force-pushed the download-original-data branch from 2e442b1 to 29c0ce4 Compare January 23, 2026 14:58
@chaoran-chen chaoran-chen marked this pull request as ready for review January 23, 2026 15:15
@claude
Copy link
Contributor

claude bot commented Jan 23, 2026

Claude finished @chaoran-chen's task in 4m 10s —— View job


Code Review Complete

I've completed a comprehensive review of PR #5856. This is a well-implemented feature that adds the ability to download original submission data for revisions. The implementation is solid with good test coverage and documentation.

Summary

Strengths:

  • Well-designed API with clear separation of concerns
  • Comprehensive test coverage including unit tests, endpoint tests, and integration tests
  • Good documentation updates explaining the feature to users
  • Proper authentication and authorization checks (group membership validation)
  • Handles edge cases like multi-segmented organisms and organisms without consensus sequences
  • Includes a workflow test demonstrating the full revision cycle

Areas for improvement:
I've added inline comments for specific issues found during the review. The main concerns are:

  1. Resource Management: TsvWriter and FastaWriter don't properly close their underlying writers (only flush them)
  2. Memory Efficiency: The acknowledged limitation of loading all data into memory - the 500-entry limit mitigates this
  3. Error Handling: Minor improvements needed in frontend error handling and backend null sequence handling
  4. Performance: Small optimization opportunities in metadata key collection
  5. Accessibility: Missing aria-label on error dismiss button

Key Observations

Security: ✅ Good

  • Proper authentication via bearer token
  • Group membership validation prevents unauthorized access
  • Uses parameterized queries (Exposed ORM) to prevent SQL injection

Performance: ⚠️ Acceptable with limitations

  • The .toList() on line 525 loads all data into memory, but the 500-entry limit makes this manageable
  • Two separate transactions for count + data retrieval could be optimized

Testing: ✅ Excellent

  • Comprehensive endpoint tests covering auth, edge cases, and multi-segmented organisms
  • Full workflow test demonstrating download → modify → revise cycle
  • Integration test with Playwright validating the UI flow

Documentation: ✅ Good

  • Clear user-facing documentation explaining the feature
  • API documentation with proper Swagger annotations

Testing & Linting

I cannot run the full test suite or linter from this environment due to Docker requirements, but I recommend running:

cd backend && ./gradlew test ktlintFormat
cd website && CI=1 npm run test && npm run check-types && npm run format

The inline comments I've provided are actionable suggestions. Most are minor improvements and don't block merging, but the resource management issues in TsvWriter/FastaWriter should be addressed to prevent potential data loss.

Overall, this is a well-executed feature that solves a real user need. Nice work!

organism,
body.groupId,
body.accessionsFilter,
).toList()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory efficiency concern: Calling .toList() on the sequence loads all data into memory at once, which defeats the purpose of using a Sequence for streaming. The PR description mentions this is "not optimized for memory efficiency" and limited to 500 entries because of this.

While the 500-entry limit mitigates the risk, consider refactoring in the future to properly stream data:

  • Write TSV headers first
  • Stream and write rows one at a time
  • Write FASTA entries as you go

This would allow handling larger datasets without the hard limit.

}

override fun close() {
printer.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource management issue: The close() method only flushes the printer but doesn't close the underlying writer. This could lead to incomplete writes if the buffered writer isn't properly flushed and closed.

Suggested change
printer.flush()
override fun close() {
printer.flush()
writer.close()
}

Since TsvWriter implements AutoCloseable and is used with .use {}, it's responsible for properly closing all resources it manages.

}

override fun close() {
writer.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource management issue: Same as TsvWriter - the close() method only flushes but doesn't close the underlying writer.

Suggested change
writer.flush()
override fun close() {
writer.flush()
writer.close()
}

outputStream: java.io.OutputStream,
isMultiSegmented: Boolean,
) {
val metadataKeys = data.flatMap { it.originalData.metadata.keys }.toSet().sorted()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential performance issue: For large datasets (up to 500 entries), flatMap followed by toSet() could be expensive if entries have many metadata fields. Consider using flatMapTo(mutableSetOf()) for better performance:

val metadataKeys = data.flatMapTo(mutableSetOf()) { it.originalData.metadata.keys }.sorted()

Comment on lines +52 to +104
const handleDownload = useCallback(async () => {
setIsDownloading(true);
setError(null);

try {
let accessionVersions: string[];

if (sequenceCount !== undefined) {
const apiParams = sequenceFilter.toApiParams();
const selectedVersions = apiParams.accessionVersion;
if (!Array.isArray(selectedVersions)) {
throw new Error('No accessions selected');
}
accessionVersions = selectedVersions;
} else {
accessionVersions = await fetchAccessions();
if (accessionVersions.length > MAX_DOWNLOAD_ENTRIES) {
throw new Error(
`Too many sequences (${accessionVersions.length}). Please filter to ${MAX_DOWNLOAD_ENTRIES} or fewer.`,
);
}
}

const accessions = extractAccessions(accessionVersions);
if (accessions.length === 0) {
throw new Error('No sequences to download');
}

const result = await getOriginalData(backendUrl, organism, accessToken, {
groupId,
accessionsFilter: accessions,
});

if (!result.ok) {
throw new Error(
`Download failed: ${result.error.status} ${result.error.statusText}. ${result.error.detail}`,
);
}

if (result.blob.size === 0) {
throw new Error('No data to download');
}

const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19);
const filename = `${organism}_original_data_${timestamp}.zip`;
downloadBlob(result.blob, filename);
} catch (err) {
const message = err instanceof Error ? err.message : 'Download failed';
setError(message);
} finally {
setIsDownloading(false);
}
}, [sequenceFilter, backendUrl, accessToken, organism, groupId, sequenceCount, fetchAccessions]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for fetchAccessions: The fetchAccessions async call on line 67 could fail (network error, API error), but there's no try-catch around it. If it throws before reaching the main try block's catch, the error state won't be updated and isDownloading will remain true, leaving the button in a stuck "Downloading..." state.

Consider wrapping the entire async logic in the try block or adding specific error handling for the fetchAccessions call.

accessionVersions = selectedVersions;
} else {
accessionVersions = await fetchAccessions();
if (accessionVersions.length > MAX_DOWNLOAD_ENTRIES) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client-side validation inconsistency: This check validates the number of accessions after fetching them, but the backend has already validated this (limited to 500). However, the check here uses accessionVersions.length while the backend counts unique accessions.

If some accessions have multiple versions, this could incorrectly reject downloads that would be accepted by the backend. Consider either:

  1. Extracting accessions first, then checking the count
  2. Removing this check and relying on the backend's validation (which provides a better error message anyway)

@HiddenParam authenticatedUser: AuthenticatedUser,
@RequestBody body: GetOriginalDataRequest,
): ResponseEntity<StreamingResponseBody> {
val entryCount = transaction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transaction wrapping concern: The entryCount query is wrapped in a transaction but the main data retrieval on line 519 is also wrapped in a separate transaction. This creates two separate database transactions for a single request.

Consider combining these into a single transaction to ensure data consistency (the count matches what will be downloaded) and reduce database overhead. The count could be calculated from the sequence itself if needed.

for (entry in data) {
val id = "${entry.accession}.${entry.version}"
for ((originalFastaId, sequence) in entry.originalData.unalignedNucleotideSequences) {
if (sequence != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null sequence handling: The code checks if (sequence != null) but silently skips null sequences. This could lead to a mismatch between metadata and FASTA files if some sequences are null.

Consider:

  1. Adding logging when sequences are skipped
  2. Documenting why sequences might be null
  3. Validating that at least some sequences exist before creating the file

});

if (!response.ok) {
const errorText = await response.text();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error response parsing: The error detail is read as text, but the backend might return structured JSON error messages. Consider trying to parse the response as JSON first, falling back to text if that fails:

let detail = errorText;
try {
    const errorJson = JSON.parse(errorText);
    detail = errorJson.message || errorJson.detail || errorText;
} catch {
    // Keep as text if not JSON
}

This would provide better error messages to users.

description = "The group ID to download data for.",
required = true,
)
val groupId: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema annotation: The groupId is marked as required = true in the @Schema annotation, but this is redundant since Kotlin's type system already makes it non-nullable. This is fine but could be simplified by removing the required = true parameter (it's the default for non-nullable types).

Comment on lines +118 to +124
{error !== null && (
<div className='absolute top-full left-0 mt-1 text-sm text-red-600 bg-white p-2 rounded shadow-md z-10 max-w-xs'>
{error}
<Button className='ml-2 text-gray-500 hover:text-gray-700' onClick={() => setError(null)}>
x
</Button>
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessibility issue: The error dismiss button only contains "x" text without proper accessibility attributes. Add an aria-label for screen readers:

<Button 
    className='ml-2 text-gray-500 hover:text-gray-700' 
    onClick={() => setError(null)}
   
>
    x
</Button>

@theosanderson
Copy link
Member

@codex review pls

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e72fdb0a3f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +538 to +542
} catch (e: Exception) {
val duration = System.currentTimeMillis() - startTime
log.error(e) { "[get-original-data] Error after ${duration}ms: $e" }
throw e
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Clear MDC on stream errors in get-original-data

If ZipOutputStream/DB streaming throws, this catch block rethrows the exception, so the MDC cleanup below never runs. Because StreamingResponseBody executes on a separate thread (outside the request filter that clears MDC), the request/organism values can leak into later log lines on that worker thread. Consider moving the MDC.remove calls into a finally or clearing them in the catch before rethrowing so errors don’t leave stale MDC entries.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component preview Triggers a deployment to argocd website Tasks related to the web application

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revision in bulk

2 participants