Skip to content

Fixed 1M cells's error in cell_velocity#737

Merged
Starlitnightly merged 5 commits intoaristoteleo:masterfrom
Starlitnightly:master
Dec 5, 2025
Merged

Fixed 1M cells's error in cell_velocity#737
Starlitnightly merged 5 commits intoaristoteleo:masterfrom
Starlitnightly:master

Conversation

@Starlitnightly
Copy link
Collaborator

@Starlitnightly Starlitnightly commented Dec 4, 2025

This pull request introduces a new external dependency, gtfparse and pyemsembl, into the dynamo/external directory. It adds a complete implementation for parsing GTF (Gene Transfer Format) files and avoids the installation of mygene, including attribute expansion, missing feature construction, robust error handling, and support for both Polars and Pandas DataFrames. The changes are grouped into the addition of new modules for GTF parsing functionality and the integration of these modules via an updated __init__.py.

New GTF parsing functionality:

  • Added read_gtf.py, which implements the main GTF parsing logic, including attribute expansion, flexible column handling, support for both Polars and Pandas DataFrames, and biotype inference. It also defines the required columns and default data types for GTF files.
  • Added attribute_parsing.py, providing the expand_attribute_strings function for parsing and expanding the GTF attribute column into separate columns.
  • Added create_missing_features.py, which allows for the construction of missing features (e.g., genes or transcripts) from available annotations in cases where they are absent in the GTF file.
  • Added parsing_error.py, defining a custom ParsingError exception for robust error handling during parsing.

Integration and module setup:

  • Updated __init__.py to expose all major functions and classes from the new modules, establish the module version, and define the public API for gtfparse.

Documentation update:

  • Updated the docs/tutorials/notebooks subproject commit, likely to reflect the new or updated tutorials related to GTF parsing.This pull request includes updates across several files to improve functionality, fix potential issues, and prepare for a new release. The most significant changes are an improved neighbor index calculation, a version bump for the upcoming release candidate, and minor formatting and submodule updates.

Core functionality improvements:

  • Improved handling of neighbor indices in get_neighbor_indices within dynamo/tools/utils.py: Now uses NumPy arrays for index management and more robustly handles NaN values when appending new neighbors, reducing the risk of errors during neighbor calculations.

Release and dependency updates:

  • Updated package version in setup.py from v1.4.3 to v1.4.4rc1 to mark a new release candidate.
  • Updated the submodule reference in docs/tutorials/notebooks to a newer commit, ensuring documentation is up to date.

Code style and formatting:

  • Reformatted the convert2gene_symbol function signature in dynamo/preprocessing/utils.py for improved readability and consistency.

@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 0.39331% with 2026 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.08%. Comparing base (4b9a620) to head (d1f5ee6).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
dynamo/external/pyensembl/genome.py 0.00% 354 Missing ⚠️
dynamo/external/pyensembl/database.py 0.00% 215 Missing ⚠️
dynamo/external/pyensembl/serializable.py 0.00% 191 Missing ⚠️
dynamo/external/pyensembl/transcript.py 0.00% 186 Missing ⚠️
dynamo/external/pyensembl/download_cache.py 0.00% 125 Missing ⚠️
dynamo/external/pyensembl/locus.py 0.00% 93 Missing ⚠️
dynamo/external/pyensembl/species.py 0.00% 93 Missing ⚠️
dynamo/external/gtfparse/read_gtf.py 0.00% 90 Missing ⚠️
dynamo/external/pyensembl/shell.py 0.00% 83 Missing ⚠️
dynamo/external/pyensembl/sequence_data.py 0.00% 77 Missing ⚠️
... and 20 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #737      +/-   ##
==========================================
- Coverage   28.24%   27.08%   -1.17%     
==========================================
  Files         297      324      +27     
  Lines       47431    49452    +2021     
==========================================
- Hits        13397    13392       -5     
- Misses      34034    36060    +2026     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Implemented a validation step to ensure the number of PCA components matches the count of genes marked for PCA usage in adata.var.
- Added a descriptive error message to guide users in resolving dimension mismatches, enhancing robustness of the perturbation function.
- Improved smart quote removal in `expand_attribute_strings` to handle both single and double quotes for better compatibility with various GTF sources.
- Added checks in `read_gtf` to only process existing columns in the DataFrame, with warnings for missing columns, enhancing robustness.
- Converted categorical columns to object dtype before applying converters to prevent issues with shared categories in Polars.
- Updated `convert2gene_symbol` to utilize `pyensembl` for gene ID conversion, supporting auto-detection of species and release selection.
@Starlitnightly Starlitnightly merged commit 641f7f3 into aristoteleo:master Dec 5, 2025
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant