feat(backend, website): download of original data for revisions by chaoran-chen · Pull Request #5856 · loculus-project/loculus

chaoran-chen · 2026-01-19T23:31:38Z

resolves #3167

This PR adds the feature to download original submission data (metadata and sequences) for released sequences, making it easier to prepare revisions in bulk. Previously, users could only download processed data from the website, but revisions require the original submitted data format.

A new backend endpoint POST /{organism}/get-original-data returns a zip file containing metadata.tsv and sequences.fasta in the same format expected for submissions. The download is limited to 500 entries as it has not been optimized for memory efficiency (no streaming or out-of-memory storage). On the website, a new button has been added to the released sequences page for downloading the file.

The design of providing a zip download would be compatible with @theosanderson's proposal in #5821 (comment) to "provide a zip download which has the whole directory structure to upload in order to put in the existing sequences -- except that instead of each file actually containing the contents it just contains the text md5:da23ef or whatever, and that's used as a pointer to the existing file".

Screenshot

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)
- Workflow of downloading original data (test organism and CCHF), changing something, and using the files for revision

🚀 Preview: https://download-original-data.loculus.org

claude · 2026-01-23T15:15:53Z

Claude finished @chaoran-chen's task in 4m 10s —— View job

Code Review Complete

I've completed a comprehensive review of PR #5856. This is a well-implemented feature that adds the ability to download original submission data for revisions. The implementation is solid with good test coverage and documentation.

Summary

Strengths:

Well-designed API with clear separation of concerns
Comprehensive test coverage including unit tests, endpoint tests, and integration tests
Good documentation updates explaining the feature to users
Proper authentication and authorization checks (group membership validation)
Handles edge cases like multi-segmented organisms and organisms without consensus sequences
Includes a workflow test demonstrating the full revision cycle

Areas for improvement:
I've added inline comments for specific issues found during the review. The main concerns are:

Resource Management: TsvWriter and FastaWriter don't properly close their underlying writers (only flush them)
Memory Efficiency: The acknowledged limitation of loading all data into memory - the 500-entry limit mitigates this
Error Handling: Minor improvements needed in frontend error handling and backend null sequence handling
Performance: Small optimization opportunities in metadata key collection
Accessibility: Missing aria-label on error dismiss button

Key Observations

Security: ✅ Good

Proper authentication via bearer token
Group membership validation prevents unauthorized access
Uses parameterized queries (Exposed ORM) to prevent SQL injection

Performance: ⚠️ Acceptable with limitations

The .toList() on line 525 loads all data into memory, but the 500-entry limit makes this manageable
Two separate transactions for count + data retrieval could be optimized

Testing: ✅ Excellent

Comprehensive endpoint tests covering auth, edge cases, and multi-segmented organisms
Full workflow test demonstrating download → modify → revise cycle
Integration test with Playwright validating the UI flow

Documentation: ✅ Good

Clear user-facing documentation explaining the feature
API documentation with proper Swagger annotations

Testing & Linting

I cannot run the full test suite or linter from this environment due to Docker requirements, but I recommend running:

cd backend && ./gradlew test ktlintFormat
cd website && CI=1 npm run test && npm run check-types && npm run format

The inline comments I've provided are actionable suggestions. Most are minor improvements and don't block merging, but the resource management issues in TsvWriter/FastaWriter should be addressed to prevent potential data loss.

Overall, this is a well-executed feature that solves a real user need. Nice work!

claude · 2026-01-23T15:18:43Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

+                            organism,
+                            body.groupId,
+                            body.accessionsFilter,
+                        ).toList()


Memory efficiency concern: Calling .toList() on the sequence loads all data into memory at once, which defeats the purpose of using a Sequence for streaming. The PR description mentions this is "not optimized for memory efficiency" and limited to 500 entries because of this.

While the 500-entry limit mitigates the risk, consider refactoring in the future to properly stream data:

Write TSV headers first

Stream and write rows one at a time

Write FASTA entries as you go

This would allow handling larger datasets without the hard limit.

claude · 2026-01-23T15:18:50Z

backend/src/main/kotlin/org/loculus/backend/utils/TsvWriter.kt

+    }
+
+    override fun close() {
+        printer.flush()


Resource management issue: The close() method only flushes the printer but doesn't close the underlying writer. This could lead to incomplete writes if the buffered writer isn't properly flushed and closed.

Suggested change

printer.flush()

override fun close() {

printer.flush()

writer.close()

}

Since TsvWriter implements AutoCloseable and is used with .use {}, it's responsible for properly closing all resources it manages.

claude · 2026-01-23T15:18:55Z

backend/src/main/kotlin/org/loculus/backend/utils/FastaWriter.kt

+    }
+
+    override fun close() {
+        writer.flush()


Resource management issue: Same as TsvWriter - the close() method only flushes but doesn't close the underlying writer.

Suggested change

writer.flush()

override fun close() {

writer.flush()

writer.close()

}

claude · 2026-01-23T15:19:01Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

+        outputStream: java.io.OutputStream,
+        isMultiSegmented: Boolean,
+    ) {
+        val metadataKeys = data.flatMap { it.originalData.metadata.keys }.toSet().sorted()


Potential performance issue: For large datasets (up to 500 entries), flatMap followed by toSet() could be expensive if entries have many metadata fields. Consider using flatMapTo(mutableSetOf()) for better performance:

val metadataKeys = data.flatMapTo(mutableSetOf()) { it.originalData.metadata.keys }.sorted()

claude · 2026-01-23T15:19:09Z

website/src/components/SearchPage/DownloadDialog/DownloadOriginalDataButton.tsx

+    const handleDownload = useCallback(async () => {
+        setIsDownloading(true);
+        setError(null);
+
+        try {
+            let accessionVersions: string[];
+
+            if (sequenceCount !== undefined) {
+                const apiParams = sequenceFilter.toApiParams();
+                const selectedVersions = apiParams.accessionVersion;
+                if (!Array.isArray(selectedVersions)) {
+                    throw new Error('No accessions selected');
+                }
+                accessionVersions = selectedVersions;
+            } else {
+                accessionVersions = await fetchAccessions();
+                if (accessionVersions.length > MAX_DOWNLOAD_ENTRIES) {
+                    throw new Error(
+                        `Too many sequences (${accessionVersions.length}). Please filter to ${MAX_DOWNLOAD_ENTRIES} or fewer.`,
+                    );
+                }
+            }
+
+            const accessions = extractAccessions(accessionVersions);
+            if (accessions.length === 0) {
+                throw new Error('No sequences to download');
+            }
+
+            const result = await getOriginalData(backendUrl, organism, accessToken, {
+                groupId,
+                accessionsFilter: accessions,
+            });
+
+            if (!result.ok) {
+                throw new Error(
+                    `Download failed: ${result.error.status} ${result.error.statusText}. ${result.error.detail}`,
+                );
+            }
+
+            if (result.blob.size === 0) {
+                throw new Error('No data to download');
+            }
+
+            const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19);
+            const filename = `${organism}_original_data_${timestamp}.zip`;
+            downloadBlob(result.blob, filename);
+        } catch (err) {
+            const message = err instanceof Error ? err.message : 'Download failed';
+            setError(message);
+        } finally {
+            setIsDownloading(false);
+        }
+    }, [sequenceFilter, backendUrl, accessToken, organism, groupId, sequenceCount, fetchAccessions]);


Missing error handling for fetchAccessions: The fetchAccessions async call on line 67 could fail (network error, API error), but there's no try-catch around it. If it throws before reaching the main try block's catch, the error state won't be updated and isDownloading will remain true, leaving the button in a stuck "Downloading..." state.

Consider wrapping the entire async logic in the try block or adding specific error handling for the fetchAccessions call.

claude · 2026-01-23T15:19:17Z

website/src/components/SearchPage/DownloadDialog/DownloadOriginalDataButton.tsx

+                accessionVersions = selectedVersions;
+            } else {
+                accessionVersions = await fetchAccessions();
+                if (accessionVersions.length > MAX_DOWNLOAD_ENTRIES) {


Client-side validation inconsistency: This check validates the number of accessions after fetching them, but the backend has already validated this (limited to 500). However, the check here uses accessionVersions.length while the backend counts unique accessions.

If some accessions have multiple versions, this could incorrectly reject downloads that would be accepted by the backend. Consider either:

Extracting accessions first, then checking the count

Removing this check and relying on the backend's validation (which provides a better error message anyway)

claude · 2026-01-23T15:19:25Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

+        @HiddenParam authenticatedUser: AuthenticatedUser,
+        @RequestBody body: GetOriginalDataRequest,
+    ): ResponseEntity<StreamingResponseBody> {
+        val entryCount = transaction {


Transaction wrapping concern: The entryCount query is wrapped in a transaction but the main data retrieval on line 519 is also wrapped in a separate transaction. This creates two separate database transactions for a single request.

Consider combining these into a single transaction to ensure data consistency (the count matches what will be downloaded) and reduce database overhead. The count could be calculated from the sequence itself if needed.

claude · 2026-01-23T15:19:31Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

+            for (entry in data) {
+                val id = "${entry.accession}.${entry.version}"
+                for ((originalFastaId, sequence) in entry.originalData.unalignedNucleotideSequences) {
+                    if (sequence != null) {


Null sequence handling: The code checks if (sequence != null) but silently skips null sequences. This could lead to a mismatch between metadata and FASTA files if some sequences are null.

Consider:

Adding logging when sequences are skipped

Documenting why sequences might be null

Validating that at least some sequences exist before creating the file

claude · 2026-01-23T15:19:38Z

website/src/services/backendClientSideApi.ts

+    });
+
+    if (!response.ok) {
+        const errorText = await response.text();


Error response parsing: The error detail is read as text, but the backend might return structured JSON error messages. Consider trying to parse the response as JSON first, falling back to text if that fails:

let detail = errorText; try { const errorJson = JSON.parse(errorText); detail = errorJson.message || errorJson.detail || errorText; } catch { // Keep as text if not JSON }

This would provide better error messages to users.

claude · 2026-01-23T15:19:43Z

backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt

+        description = "The group ID to download data for.",
+        required = true,
+    )
+    val groupId: Int,


Schema annotation: The groupId is marked as required = true in the @Schema annotation, but this is redundant since Kotlin's type system already makes it non-nullable. This is fine but could be simplified by removing the required = true parameter (it's the default for non-nullable types).

claude · 2026-01-23T15:19:50Z

website/src/components/SearchPage/DownloadDialog/DownloadOriginalDataButton.tsx

+            {error !== null && (
+                <div className='absolute top-full left-0 mt-1 text-sm text-red-600 bg-white p-2 rounded shadow-md z-10 max-w-xs'>
+                    {error}
+                    <Button className='ml-2 text-gray-500 hover:text-gray-700' onClick={() => setError(null)}>
+                        x
+                    </Button>
+                </div>


Accessibility issue: The error dismiss button only contains "x" text without proper accessibility attributes. Add an aria-label for screen readers:

<Button className='ml-2 text-gray-500 hover:text-gray-700' onClick={() => setError(null)} > x </Button>

theosanderson · 2026-01-26T10:32:11Z

@codex review pls

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e72fdb0a3f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-26T10:40:41Z

backend/src/main/kotlin/org/loculus/backend/controller/SubmissionController.kt

+            } catch (e: Exception) {
+                val duration = System.currentTimeMillis() - startTime
+                log.error(e) { "[get-original-data] Error after ${duration}ms: $e" }
+                throw e
+            }


Clear MDC on stream errors in get-original-data

If ZipOutputStream/DB streaming throws, this catch block rethrows the exception, so the MDC cleanup below never runs. Because StreamingResponseBody executes on a separate thread (outside the request filter that clears MDC), the request/organism values can leak into later log lines on that worker thread. Consider moving the MDC.remove calls into a finally or clearing them in the catch before rethrowing so errors don’t leave stale MDC entries.

Useful? React with 👍 / 👎.

claude bot added backend related to the loculus backend component website Tasks related to the web application labels Jan 19, 2026

chaoran-chen force-pushed the download-original-data branch 3 times, most recently from 8d11338 to b2d1706 Compare January 21, 2026 14:30

chaoran-chen added the preview Triggers a deployment to argocd label Jan 21, 2026

chaoran-chen force-pushed the download-original-data branch from 7fa72bd to cff80b1 Compare January 21, 2026 23:37

feat(backend, website): download of original data for revisions

29c0ce4

chaoran-chen force-pushed the download-original-data branch from 2e442b1 to 29c0ce4 Compare January 23, 2026 14:58

chaoran-chen marked this pull request as ready for review January 23, 2026 15:15

chaoran-chen requested review from anna-parker, corneliusroemer, maverbiest and theosanderson January 23, 2026 15:15

claude bot reviewed Jan 23, 2026

View reviewed changes

Improve tooltip

e72fdb0

chatgpt-codex-connector bot reviewed Jan 26, 2026

View reviewed changes

Conversation

chaoran-chen commented Jan 19, 2026 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

PR Checklist

Uh oh!

claude bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Summary

Key Observations

Testing & Linting

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

theosanderson commented Jan 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaoran-chen commented Jan 19, 2026 •

edited by loculus-bot

Loading

claude bot commented Jan 23, 2026 •

edited

Loading