feat(backend, website): download of original data for revisions#5856
feat(backend, website): download of original data for revisions#5856chaoran-chen wants to merge 2 commits intomainfrom
Conversation
8d11338 to
b2d1706
Compare
7fa72bd to
cff80b1
Compare
2e442b1 to
29c0ce4
Compare
|
Claude finished @chaoran-chen's task in 4m 10s —— View job Code Review CompleteI've completed a comprehensive review of PR #5856. This is a well-implemented feature that adds the ability to download original submission data for revisions. The implementation is solid with good test coverage and documentation. SummaryStrengths:
Areas for improvement:
Key ObservationsSecurity: ✅ Good
Performance:
Testing: ✅ Excellent
Documentation: ✅ Good
Testing & LintingI cannot run the full test suite or linter from this environment due to Docker requirements, but I recommend running: cd backend && ./gradlew test ktlintFormat
cd website && CI=1 npm run test && npm run check-types && npm run formatThe inline comments I've provided are actionable suggestions. Most are minor improvements and don't block merging, but the resource management issues in TsvWriter/FastaWriter should be addressed to prevent potential data loss. Overall, this is a well-executed feature that solves a real user need. Nice work! |
| organism, | ||
| body.groupId, | ||
| body.accessionsFilter, | ||
| ).toList() |
There was a problem hiding this comment.
Memory efficiency concern: Calling .toList() on the sequence loads all data into memory at once, which defeats the purpose of using a Sequence for streaming. The PR description mentions this is "not optimized for memory efficiency" and limited to 500 entries because of this.
While the 500-entry limit mitigates the risk, consider refactoring in the future to properly stream data:
- Write TSV headers first
- Stream and write rows one at a time
- Write FASTA entries as you go
This would allow handling larger datasets without the hard limit.
| } | ||
|
|
||
| override fun close() { | ||
| printer.flush() |
There was a problem hiding this comment.
Resource management issue: The close() method only flushes the printer but doesn't close the underlying writer. This could lead to incomplete writes if the buffered writer isn't properly flushed and closed.
| printer.flush() | |
| override fun close() { | |
| printer.flush() | |
| writer.close() | |
| } |
Since TsvWriter implements AutoCloseable and is used with .use {}, it's responsible for properly closing all resources it manages.
| outputStream: java.io.OutputStream, | ||
| isMultiSegmented: Boolean, | ||
| ) { | ||
| val metadataKeys = data.flatMap { it.originalData.metadata.keys }.toSet().sorted() |
There was a problem hiding this comment.
Potential performance issue: For large datasets (up to 500 entries), flatMap followed by toSet() could be expensive if entries have many metadata fields. Consider using flatMapTo(mutableSetOf()) for better performance:
val metadataKeys = data.flatMapTo(mutableSetOf()) { it.originalData.metadata.keys }.sorted()| const handleDownload = useCallback(async () => { | ||
| setIsDownloading(true); | ||
| setError(null); | ||
|
|
||
| try { | ||
| let accessionVersions: string[]; | ||
|
|
||
| if (sequenceCount !== undefined) { | ||
| const apiParams = sequenceFilter.toApiParams(); | ||
| const selectedVersions = apiParams.accessionVersion; | ||
| if (!Array.isArray(selectedVersions)) { | ||
| throw new Error('No accessions selected'); | ||
| } | ||
| accessionVersions = selectedVersions; | ||
| } else { | ||
| accessionVersions = await fetchAccessions(); | ||
| if (accessionVersions.length > MAX_DOWNLOAD_ENTRIES) { | ||
| throw new Error( | ||
| `Too many sequences (${accessionVersions.length}). Please filter to ${MAX_DOWNLOAD_ENTRIES} or fewer.`, | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| const accessions = extractAccessions(accessionVersions); | ||
| if (accessions.length === 0) { | ||
| throw new Error('No sequences to download'); | ||
| } | ||
|
|
||
| const result = await getOriginalData(backendUrl, organism, accessToken, { | ||
| groupId, | ||
| accessionsFilter: accessions, | ||
| }); | ||
|
|
||
| if (!result.ok) { | ||
| throw new Error( | ||
| `Download failed: ${result.error.status} ${result.error.statusText}. ${result.error.detail}`, | ||
| ); | ||
| } | ||
|
|
||
| if (result.blob.size === 0) { | ||
| throw new Error('No data to download'); | ||
| } | ||
|
|
||
| const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, 19); | ||
| const filename = `${organism}_original_data_${timestamp}.zip`; | ||
| downloadBlob(result.blob, filename); | ||
| } catch (err) { | ||
| const message = err instanceof Error ? err.message : 'Download failed'; | ||
| setError(message); | ||
| } finally { | ||
| setIsDownloading(false); | ||
| } | ||
| }, [sequenceFilter, backendUrl, accessToken, organism, groupId, sequenceCount, fetchAccessions]); |
There was a problem hiding this comment.
Missing error handling for fetchAccessions: The fetchAccessions async call on line 67 could fail (network error, API error), but there's no try-catch around it. If it throws before reaching the main try block's catch, the error state won't be updated and isDownloading will remain true, leaving the button in a stuck "Downloading..." state.
Consider wrapping the entire async logic in the try block or adding specific error handling for the fetchAccessions call.
| accessionVersions = selectedVersions; | ||
| } else { | ||
| accessionVersions = await fetchAccessions(); | ||
| if (accessionVersions.length > MAX_DOWNLOAD_ENTRIES) { |
There was a problem hiding this comment.
Client-side validation inconsistency: This check validates the number of accessions after fetching them, but the backend has already validated this (limited to 500). However, the check here uses accessionVersions.length while the backend counts unique accessions.
If some accessions have multiple versions, this could incorrectly reject downloads that would be accepted by the backend. Consider either:
- Extracting accessions first, then checking the count
- Removing this check and relying on the backend's validation (which provides a better error message anyway)
| @HiddenParam authenticatedUser: AuthenticatedUser, | ||
| @RequestBody body: GetOriginalDataRequest, | ||
| ): ResponseEntity<StreamingResponseBody> { | ||
| val entryCount = transaction { |
There was a problem hiding this comment.
Transaction wrapping concern: The entryCount query is wrapped in a transaction but the main data retrieval on line 519 is also wrapped in a separate transaction. This creates two separate database transactions for a single request.
Consider combining these into a single transaction to ensure data consistency (the count matches what will be downloaded) and reduce database overhead. The count could be calculated from the sequence itself if needed.
| for (entry in data) { | ||
| val id = "${entry.accession}.${entry.version}" | ||
| for ((originalFastaId, sequence) in entry.originalData.unalignedNucleotideSequences) { | ||
| if (sequence != null) { |
There was a problem hiding this comment.
Null sequence handling: The code checks if (sequence != null) but silently skips null sequences. This could lead to a mismatch between metadata and FASTA files if some sequences are null.
Consider:
- Adding logging when sequences are skipped
- Documenting why sequences might be null
- Validating that at least some sequences exist before creating the file
| }); | ||
|
|
||
| if (!response.ok) { | ||
| const errorText = await response.text(); |
There was a problem hiding this comment.
Error response parsing: The error detail is read as text, but the backend might return structured JSON error messages. Consider trying to parse the response as JSON first, falling back to text if that fails:
let detail = errorText;
try {
const errorJson = JSON.parse(errorText);
detail = errorJson.message || errorJson.detail || errorText;
} catch {
// Keep as text if not JSON
}This would provide better error messages to users.
| description = "The group ID to download data for.", | ||
| required = true, | ||
| ) | ||
| val groupId: Int, |
There was a problem hiding this comment.
Schema annotation: The groupId is marked as required = true in the @Schema annotation, but this is redundant since Kotlin's type system already makes it non-nullable. This is fine but could be simplified by removing the required = true parameter (it's the default for non-nullable types).
| {error !== null && ( | ||
| <div className='absolute top-full left-0 mt-1 text-sm text-red-600 bg-white p-2 rounded shadow-md z-10 max-w-xs'> | ||
| {error} | ||
| <Button className='ml-2 text-gray-500 hover:text-gray-700' onClick={() => setError(null)}> | ||
| x | ||
| </Button> | ||
| </div> |
There was a problem hiding this comment.
Accessibility issue: The error dismiss button only contains "x" text without proper accessibility attributes. Add an aria-label for screen readers:
<Button
className='ml-2 text-gray-500 hover:text-gray-700'
onClick={() => setError(null)}
>
x
</Button>|
@codex review pls |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e72fdb0a3f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } catch (e: Exception) { | ||
| val duration = System.currentTimeMillis() - startTime | ||
| log.error(e) { "[get-original-data] Error after ${duration}ms: $e" } | ||
| throw e | ||
| } |
There was a problem hiding this comment.
Clear MDC on stream errors in get-original-data
If ZipOutputStream/DB streaming throws, this catch block rethrows the exception, so the MDC cleanup below never runs. Because StreamingResponseBody executes on a separate thread (outside the request filter that clears MDC), the request/organism values can leak into later log lines on that worker thread. Consider moving the MDC.remove calls into a finally or clearing them in the catch before rethrowing so errors don’t leave stale MDC entries.
Useful? React with 👍 / 👎.
resolves #3167
This PR adds the feature to download original submission data (metadata and sequences) for released sequences, making it easier to prepare revisions in bulk. Previously, users could only download processed data from the website, but revisions require the original submitted data format.
A new backend endpoint
POST /{organism}/get-original-datareturns a zip file containing metadata.tsv and sequences.fasta in the same format expected for submissions. The download is limited to 500 entries as it has not been optimized for memory efficiency (no streaming or out-of-memory storage). On the website, a new button has been added to the released sequences page for downloading the file.The design of providing a zip download would be compatible with @theosanderson's proposal in #5821 (comment) to "provide a zip download which has the whole directory structure to upload in order to put in the existing sequences -- except that instead of each file actually containing the contents it just contains the text
md5:da23efor whatever, and that's used as a pointer to the existing file".Screenshot
PR Checklist
🚀 Preview: https://download-original-data.loculus.org