Skip to content

Use of GO Molecular Function terms in biological_processes field - schema design options #361

@github-actions

Description

@github-actions

Context

In PR #347 (Houge-Janssens Syndrome), the maintainer (@cmungall) requested replacing a GO Biological Process (BP) term with a GO Molecular Function (MF) term:

Line 195 - "better to use MF terms like dephosphatase as the BPs may be obsoleted soon"

The specific case was:

  • Current: GO:0006470 (protein dephosphorylation) - a BP term
  • Requested: GO:0004722 (protein serine/threonine phosphatase activity) - an MF term

This raises the broader question: How should the dismech schema represent molecular-level aberrations (enzyme activities, binding functions, etc.)?


Current State

Schema Structure

The BiologicalProcessDescriptor class (src/dismech/schema/dismech.yaml:1884-1889) is currently used for the biological_processes field in pathophysiology entries. It has minimal constraints:

BiologicalProcessDescriptor:
  is_a: Descriptor
  description: A descriptor for biological processes, bindable to Gene Ontology (GO)
  slot_usage:
    term:
      description: Optional GO biological process term reference

No explicit restriction to BP terms only - the description says "biological processes" but doesn't enforce BP term namespace.

Current Usage Patterns

Analysis of the knowledge base shows extensive use of MF terms in the biological_processes field:

59 unique GO MF terms are currently used across disorder files, including:

GO ID Label Example Disorders
GO:0004672 protein kinase activity ALK_Rearranged_NSCLC, BRAF_V600E_Mutant_NSCLC, RET_Rearranged_NSCLC
GO:0004725 protein tyrosine phosphatase activity Noonan_Syndrome
GO:0004713 protein tyrosine kinase activity Chronic_Myeloid_Leukemia, FLT3_Mutant_AML, Ph_Positive_ALL
GO:0005088 Ras guanyl-nucleotide exchange factor activity Noonan_Syndrome
GO:0008543 fibroblast growth factor receptor signaling pathway Achondroplasia, Apert_Syndrome, Crouzon_Syndrome

Example from Noonan_Syndrome.yaml (lines 43-47):

biological_processes:
- preferred_term: protein tyrosine phosphatase activity
  term:
    id: GO:0004725
    label: protein tyrosine phosphatase activity

This pattern is widespread and established across the knowledge base, particularly for:

  • Kinase/phosphatase activities (signaling pathology)
  • GEF/GAP activities (RAS pathway disorders)
  • Receptor signaling activities (growth factor pathway disorders)

The Problem

Namespace ambiguity: The biological_processes field name suggests BP terms only, but:

  1. Current practice: MF terms are actively and appropriately used for enzyme activities
  2. Biological accuracy: Many pathophysiology mechanisms are best represented as aberrant molecular activities (kinase activity, phosphatase activity) rather than processes
  3. GO term stability: The maintainer notes some BP terms may be deprecated, favoring MF terms for molecular activities
  4. No validation: The schema doesn't enforce BP-only usage, so MF terms validate successfully

Example conflict in Houge-Janssens Syndrome:

  • GO:0006470 (protein dephosphorylation) is a BP term describing a process
  • GO:0004722 (protein serine/threonine phosphatase activity) is an MF term describing an enzymatic function
  • The notes mention GO:0004722 but use GO:0006470 in the structured annotation

Options for Resolution

Option 1: Rename biological_processesbiological_terms or go_terms (Minimal disruption)

Description: Allow both BP and MF terms in a single field with a namespace-agnostic name.

Pros:

  • ✅ Minimal schema changes (rename slot, update description)
  • ✅ No disruption to existing 59 MF term annotations
  • ✅ Aligns with current practice
  • ✅ Simple for curators (one field for GO terms)

Cons:

  • ❌ Loses semantic distinction between processes and functions
  • ❌ May mix conceptually different levels ("MAPK cascade" vs "kinase activity")
  • ❌ Doesn't guide curators on which term type to prefer

Implementation:

# In Pathophysiology class
go_terms:  # or biological_terms
  multivalued: true
  range: BiologicalTermDescriptor  # rename class
  description: GO terms (biological processes or molecular functions) relevant to this pathophysiological mechanism

Option 2: Add separate molecular_functions field (Explicit separation)

Description: Keep biological_processes for BP terms, add new molecular_functions for MF terms.

Pros:

  • ✅ Clear semantic distinction
  • ✅ Curators can represent both levels explicitly
  • ✅ Preserves existing biological_processes field
  • ✅ Enables validation rules (BP-only in one field, MF-only in the other)

Cons:

  • ❌ Requires migrating 59 existing MF term annotations to new field
  • ❌ More complex schema (two fields for GO terms)
  • ❌ Potential confusion about which field to use

Implementation:

# In Pathophysiology class
biological_processes:
  multivalued: true
  range: BiologicalProcessDescriptor
  description: GO biological process terms (BP namespace)
  
molecular_functions:
  multivalued: true
  range: MolecularFunctionDescriptor
  description: GO molecular function terms (MF namespace)

Migration needed: Move GO:0004xxx, GO:0005xxx, GO:0008xxx (MF range) from biological_processes to molecular_functions in ~20 disorder files.


Option 3: Use biological_processes for both but add term type annotation (Explicit but flexible)

Description: Keep the field name, but add a term_type slot to BiologicalProcessDescriptor indicating whether it's a BP, MF, or CC term.

Pros:

  • ✅ Minimal migration (add optional term_type metadata)
  • ✅ Preserves namespace information for downstream tools
  • ✅ Enables validation without breaking existing data
  • ✅ Flexible for curators

Cons:

  • ❌ Field name biological_processes still misleading
  • ❌ Requires tooling updates to respect term_type
  • ❌ Adds complexity to descriptor class

Implementation:

BiologicalProcessDescriptor:
  slots:
    - preferred_term
    - term
    - modifier
    - term_type  # NEW: enum with BP, MF, CC values
    
  slot_usage:
    term_type:
      range: GOTermTypeEnum
      description: The GO namespace (Biological Process, Molecular Function, or Cellular Component)

Option 4: Keep current practice but update documentation (Status quo)

Description: Accept that biological_processes can include MF terms. Update schema description and guidelines to clarify this is intentional.

Pros:

  • ✅ Zero implementation work
  • ✅ No data migration
  • ✅ Aligns with established practice

Cons:

  • ❌ Misleading field name persists
  • ❌ No structured way to distinguish BP vs MF
  • ❌ Doesn't address maintainer's preference for MF terms in certain contexts

Implementation:
Update description:

biological_processes:
  description: >
    GO terms (biological processes, molecular functions, or cellular components) 
    relevant to this pathophysiological mechanism. Prefer molecular function terms 
    (e.g., GO:0004722 phosphatase activity) over biological process terms 
    (e.g., GO:0006470 dephosphorylation) when representing enzyme activities or 
    binding functions that may be subject to GO term deprecation.

Related Considerations

GO Term Stability

  • Some BP terms for molecular activities (e.g., "protein dephosphorylation") may be deprecated in favor of MF terms
  • MF terms for enzymatic activities (kinase, phosphatase, GEF) are more stable and semantically precise
  • Current GO best practices favor MF terms for catalytic activities

ClinGen Evidence Mapping

The codebase includes infrastructure for mapping ClinGen functional evidence to GO terms (src/dismech/clingen/go_mapper.py):

# ClinGen functional evidence categories mapped to GO terms:
# - Biochemical Function → GO:0003674 (molecular_function) and descendants

This suggests the project already recognizes MF terms as valid for representing biochemical functions.

Rendering and Display

  • HTML rendering (src/dismech/render.py) links GO terms to external browsers (OLS, AmiGO)
  • No current distinction in rendering between BP and MF terms
  • Adding separate fields would enable differential rendering (e.g., "Disrupted Processes" vs "Aberrant Molecular Functions")

Recommendation Request

@cmungall @caufieldjh - Which option do you prefer? Considerations:

  1. How important is the semantic distinction between processes ("MAPK cascade") and molecular functions ("kinase activity") for downstream applications?
  2. Should the schema enforce namespace separation (BP-only, MF-only fields) or allow flexible mixing?
  3. What is the migration tolerance for updating existing entries (59 MF term uses currently in biological_processes)?
  4. Are there plans for GO term validation that would benefit from explicit namespace declaration?

I'm available to implement whichever approach is selected, including data migration scripts if needed.


References

  • PR Add Houge-Janssens Syndrome #347: Houge-Janssens Syndrome (triggering discussion)
  • Schema file: src/dismech/schema/dismech.yaml:1884-1889
  • Current MF usage: 59 unique MF terms across ~20 disorder files
  • GO term mapping: src/dismech/clingen/go_mapper.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requestedschema

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions