Skip to content

feat: add reactome_searcher#176

Open
ChenZiHong-Gavin wants to merge 1 commit intomainfrom
feat/reactome_searcher
Open

feat: add reactome_searcher#176
ChenZiHong-Gavin wants to merge 1 commit intomainfrom
feat/reactome_searcher

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's bioinformatics capabilities by introducing a dedicated Reactome searcher. This new component allows users to seamlessly query the Reactome database for detailed information on biological pathways, reactions, and entities. It also supports advanced features like gene list enrichment analysis and provides a standardized output format, making it easier to integrate Reactome data into broader analytical workflows. The integration ensures that Reactome can be used as a data source within existing pipeline configurations.

Highlights

  • New Reactome Searcher: Introduced a new ReactomeSearcher class to interact with the Reactome database, enabling comprehensive querying of biological pathways and reactions.
  • Unified Search Interface: The ReactomeSearcher provides a unified search method that intelligently handles different query types, including Reactome stable IDs, keywords, and gene lists for enrichment analysis.
  • Gene Overrepresentation Analysis: Added functionality to perform overrepresentation analysis on gene lists, mapping genes to Reactome pathways and providing statistical enrichment results.
  • Integration and Examples: Integrated the new searcher into the existing search service and provided new example input and configuration files (search_reactome_demo.jsonl and search_reactome_config.yaml) to demonstrate its usage.
  • Robust API Interaction: Implemented robust API interaction with retry logic for GET and POST requests, handling network issues and timeouts gracefully.
Changelog
  • examples/input_examples/search_reactome_demo.jsonl
    • Added a new example input file to demonstrate various Reactome query types, including Reactome IDs, keywords, gene symbols, and gene lists.
  • examples/search/search_reactome/search_reactome_config.yaml
    • Added a new configuration file for setting up a Reactome search pipeline, defining read_step and search_step with reactome as the data source and configurable reactome_params.
  • graphgen/models/init.py
    • Modified to import ReactomeSearcher and add it to the _module_lookup dictionary, making it accessible within the graphgen.models module.
  • graphgen/models/searcher/db/reactome_searcher.py
    • Added a new Python module defining the ReactomeSearcher class, which encapsulates all logic for interacting with the Reactome Content and Analysis Services, including methods for ID lookup, keyword search, gene analysis, and result normalization.
  • graphgen/operators/search/search_service.py
    • Modified the _init_searcher method to dynamically instantiate the ReactomeSearcher when the data_source is specified as reactome in the search configuration.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ReactomeSearcher for querying the Reactome database, which is a valuable addition. The implementation is comprehensive, covering various features of the Reactome API. However, I've found a few critical issues in graphgen/models/searcher/db/reactome_searcher.py, including syntax errors that will prevent the code from running, and a logic bug that could lead to a crash. I've also pointed out some areas where the code can be made more robust and maintainable. Please review the detailed comments.

Comment on lines +217 to +219
"description": data.get("summation", [{}])[0].get("text", "")
if isinstance(data.get("summation"), list)
else "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code has two critical issues:

  1. SyntaxError: A conditional expression used as a dictionary value must be enclosed in parentheses if it spans multiple lines. As written, this will cause a syntax error.
  2. IndexError: Even after fixing the syntax, there's a logic bug. If data.get("summation") returns an empty list [], the condition isinstance([], list) is true, but data.get("summation", [{}]) also returns [], which will cause an IndexError when [0] is accessed. This will crash the application.

I've suggested a fix that resolves both the syntax and logic errors using a more robust approach.

            "description": (s[0].get("text", "") if (s := data.get("summation")) and isinstance(s, list) and s else ""),

Comment on lines +260 to +264
"reference_entities": [
ref.get("dbId") for ref in data.get("referenceEntity", [])
]
if isinstance(data.get("referenceEntity"), list)
else [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code has a critical syntax error because a multi-line conditional expression for a dictionary value must be enclosed in parentheses. Additionally, the logic is overly complex and difficult to read, which impacts maintainability. I'm suggesting a more concise and readable version that fixes the syntax error and simplifies the logic.

                    "reference_entities": ([ref.get("dbId") for ref in s] if isinstance(s := data.get("referenceEntity"), list) else []),

Comment on lines +47 to +51
def __init__(
self,
species: str = "Homo sapiens",
timeout: int = 30,
max_retries: int = 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The max_retries parameter is defined in the __init__ method but it is not used by any of the tenacity.retry decorators in this class. Instead, the number of retries is hardcoded (e.g., stop_after_attempt(3) in _get and _post, and stop_after_attempt(5) in search). This is misleading and can lead to unexpected behavior. Please either use the max_retries parameter to configure the retry logic or remove it to avoid confusion.

Comment on lines +598 to +605
elif "\n" in query or "," in query:
# Parse gene list
genes = [g.strip() for g in re.split(r"[\n,]", query) if g.strip()]
if len(genes) > 1 or (len(genes) == 1 and len(genes[0]) < 20):
# Likely a gene list
result = self.analyze_genes(
genes, projection=projection, include_disease=include_disease
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The heuristic used to distinguish a gene list from a keyword query is fragile. Splitting by comma or newline can misinterpret queries that are natural language sentences containing commas. For example, a query like "pathways related to TP53, a tumor suppressor" would be incorrectly parsed as a gene list ['pathways related to TP53', ' a tumor suppressor'], leading to an incorrect analysis. Consider making this logic more robust, for instance by checking if all split parts conform to a typical gene symbol format before deciding to treat it as a gene list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant