-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context
In PR #347 (Houge-Janssens Syndrome), the maintainer (@cmungall) requested replacing a GO Biological Process (BP) term with a GO Molecular Function (MF) term:
Line 195 - "better to use MF terms like dephosphatase as the BPs may be obsoleted soon"
The specific case was:
- Current:
GO:0006470(protein dephosphorylation) - a BP term - Requested:
GO:0004722(protein serine/threonine phosphatase activity) - an MF term
This raises the broader question: How should the dismech schema represent molecular-level aberrations (enzyme activities, binding functions, etc.)?
Current State
Schema Structure
The BiologicalProcessDescriptor class (src/dismech/schema/dismech.yaml:1884-1889) is currently used for the biological_processes field in pathophysiology entries. It has minimal constraints:
BiologicalProcessDescriptor:
is_a: Descriptor
description: A descriptor for biological processes, bindable to Gene Ontology (GO)
slot_usage:
term:
description: Optional GO biological process term referenceNo explicit restriction to BP terms only - the description says "biological processes" but doesn't enforce BP term namespace.
Current Usage Patterns
Analysis of the knowledge base shows extensive use of MF terms in the biological_processes field:
59 unique GO MF terms are currently used across disorder files, including:
| GO ID | Label | Example Disorders |
|---|---|---|
| GO:0004672 | protein kinase activity | ALK_Rearranged_NSCLC, BRAF_V600E_Mutant_NSCLC, RET_Rearranged_NSCLC |
| GO:0004725 | protein tyrosine phosphatase activity | Noonan_Syndrome |
| GO:0004713 | protein tyrosine kinase activity | Chronic_Myeloid_Leukemia, FLT3_Mutant_AML, Ph_Positive_ALL |
| GO:0005088 | Ras guanyl-nucleotide exchange factor activity | Noonan_Syndrome |
| GO:0008543 | fibroblast growth factor receptor signaling pathway | Achondroplasia, Apert_Syndrome, Crouzon_Syndrome |
Example from Noonan_Syndrome.yaml (lines 43-47):
biological_processes:
- preferred_term: protein tyrosine phosphatase activity
term:
id: GO:0004725
label: protein tyrosine phosphatase activityThis pattern is widespread and established across the knowledge base, particularly for:
- Kinase/phosphatase activities (signaling pathology)
- GEF/GAP activities (RAS pathway disorders)
- Receptor signaling activities (growth factor pathway disorders)
The Problem
Namespace ambiguity: The biological_processes field name suggests BP terms only, but:
- Current practice: MF terms are actively and appropriately used for enzyme activities
- Biological accuracy: Many pathophysiology mechanisms are best represented as aberrant molecular activities (kinase activity, phosphatase activity) rather than processes
- GO term stability: The maintainer notes some BP terms may be deprecated, favoring MF terms for molecular activities
- No validation: The schema doesn't enforce BP-only usage, so MF terms validate successfully
Example conflict in Houge-Janssens Syndrome:
GO:0006470(protein dephosphorylation) is a BP term describing a processGO:0004722(protein serine/threonine phosphatase activity) is an MF term describing an enzymatic function- The notes mention GO:0004722 but use GO:0006470 in the structured annotation
Options for Resolution
Option 1: Rename biological_processes → biological_terms or go_terms (Minimal disruption)
Description: Allow both BP and MF terms in a single field with a namespace-agnostic name.
Pros:
- ✅ Minimal schema changes (rename slot, update description)
- ✅ No disruption to existing 59 MF term annotations
- ✅ Aligns with current practice
- ✅ Simple for curators (one field for GO terms)
Cons:
- ❌ Loses semantic distinction between processes and functions
- ❌ May mix conceptually different levels ("MAPK cascade" vs "kinase activity")
- ❌ Doesn't guide curators on which term type to prefer
Implementation:
# In Pathophysiology class
go_terms: # or biological_terms
multivalued: true
range: BiologicalTermDescriptor # rename class
description: GO terms (biological processes or molecular functions) relevant to this pathophysiological mechanismOption 2: Add separate molecular_functions field (Explicit separation)
Description: Keep biological_processes for BP terms, add new molecular_functions for MF terms.
Pros:
- ✅ Clear semantic distinction
- ✅ Curators can represent both levels explicitly
- ✅ Preserves existing
biological_processesfield - ✅ Enables validation rules (BP-only in one field, MF-only in the other)
Cons:
- ❌ Requires migrating 59 existing MF term annotations to new field
- ❌ More complex schema (two fields for GO terms)
- ❌ Potential confusion about which field to use
Implementation:
# In Pathophysiology class
biological_processes:
multivalued: true
range: BiologicalProcessDescriptor
description: GO biological process terms (BP namespace)
molecular_functions:
multivalued: true
range: MolecularFunctionDescriptor
description: GO molecular function terms (MF namespace)Migration needed: Move GO:0004xxx, GO:0005xxx, GO:0008xxx (MF range) from biological_processes to molecular_functions in ~20 disorder files.
Option 3: Use biological_processes for both but add term type annotation (Explicit but flexible)
Description: Keep the field name, but add a term_type slot to BiologicalProcessDescriptor indicating whether it's a BP, MF, or CC term.
Pros:
- ✅ Minimal migration (add optional
term_typemetadata) - ✅ Preserves namespace information for downstream tools
- ✅ Enables validation without breaking existing data
- ✅ Flexible for curators
Cons:
- ❌ Field name
biological_processesstill misleading - ❌ Requires tooling updates to respect
term_type - ❌ Adds complexity to descriptor class
Implementation:
BiologicalProcessDescriptor:
slots:
- preferred_term
- term
- modifier
- term_type # NEW: enum with BP, MF, CC values
slot_usage:
term_type:
range: GOTermTypeEnum
description: The GO namespace (Biological Process, Molecular Function, or Cellular Component)Option 4: Keep current practice but update documentation (Status quo)
Description: Accept that biological_processes can include MF terms. Update schema description and guidelines to clarify this is intentional.
Pros:
- ✅ Zero implementation work
- ✅ No data migration
- ✅ Aligns with established practice
Cons:
- ❌ Misleading field name persists
- ❌ No structured way to distinguish BP vs MF
- ❌ Doesn't address maintainer's preference for MF terms in certain contexts
Implementation:
Update description:
biological_processes:
description: >
GO terms (biological processes, molecular functions, or cellular components)
relevant to this pathophysiological mechanism. Prefer molecular function terms
(e.g., GO:0004722 phosphatase activity) over biological process terms
(e.g., GO:0006470 dephosphorylation) when representing enzyme activities or
binding functions that may be subject to GO term deprecation.Related Considerations
GO Term Stability
- Some BP terms for molecular activities (e.g., "protein dephosphorylation") may be deprecated in favor of MF terms
- MF terms for enzymatic activities (kinase, phosphatase, GEF) are more stable and semantically precise
- Current GO best practices favor MF terms for catalytic activities
ClinGen Evidence Mapping
The codebase includes infrastructure for mapping ClinGen functional evidence to GO terms (src/dismech/clingen/go_mapper.py):
# ClinGen functional evidence categories mapped to GO terms:
# - Biochemical Function → GO:0003674 (molecular_function) and descendantsThis suggests the project already recognizes MF terms as valid for representing biochemical functions.
Rendering and Display
- HTML rendering (src/dismech/render.py) links GO terms to external browsers (OLS, AmiGO)
- No current distinction in rendering between BP and MF terms
- Adding separate fields would enable differential rendering (e.g., "Disrupted Processes" vs "Aberrant Molecular Functions")
Recommendation Request
@cmungall @caufieldjh - Which option do you prefer? Considerations:
- How important is the semantic distinction between processes ("MAPK cascade") and molecular functions ("kinase activity") for downstream applications?
- Should the schema enforce namespace separation (BP-only, MF-only fields) or allow flexible mixing?
- What is the migration tolerance for updating existing entries (59 MF term uses currently in
biological_processes)? - Are there plans for GO term validation that would benefit from explicit namespace declaration?
I'm available to implement whichever approach is selected, including data migration scripts if needed.
References
- PR Add Houge-Janssens Syndrome #347: Houge-Janssens Syndrome (triggering discussion)
- Schema file: src/dismech/schema/dismech.yaml:1884-1889
- Current MF usage: 59 unique MF terms across ~20 disorder files
- GO term mapping: src/dismech/clingen/go_mapper.py