# Query genes for more info

## Overview

This how-to focuses on linking gene names from the NCBI databases.
Whilst not JUMP-specific, it is useful to fetch more information on
perturbations that our analysis deem important without having to
manually search them. We will use [Biopython](https://biopython.org/),
this only explores a subset of the options, the full Entrez
[documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/), which
contains all the options, is a useful reference to bookmark. \##
Procedure

In [1]:
import polars as pl
from Bio import Entrez
from broad_babel.query import get_mapper

We define the fields that we need and an email to provide to the server
we will query.

In [2]:
Entrez.email = "example@email.com"
fields = (
    "Name",
    "Description",
    "Summary",
    "OtherDesignations",  # This gives us synonyms
)

As an example, we will use a set of genes that we found in a JUMP
cluster.

In [3]:
genes = ("CHRM4", "SCAPER", "GPR176", "LY6K")

Get a dictionary that maps Gene symbols to Entrez IDs

In [4]:
ids = get_mapper(
    query=genes,
    input_column="standard_key",
    output_columns="standard_key,NCBI_Gene_ID",
)

# Fetch the summaries for these genes
entries = []
for id_ in ids.values():
    stream = Entrez.esummary(db="gene", id=id_)
    record = Entrez.read(stream)

    entries.append({
        k: record["DocumentSummarySet"]["DocumentSummary"][0][k] for k in fields
    })

In [5]:
# Show the resultant information in a human-readable format

In [6]:
with pl.Config(fmt_str_lengths=1000):
    print(pl.DataFrame(entries))

shape: (4, 4)
┌────────┬─────────────────────────────┬─────────────────────────────┬─────────────────────────────┐
│ Name   ┆ Description                 ┆ Summary                     ┆ OtherDesignations           │
│ ---    ┆ ---                         ┆ ---                         ┆ ---                         │
│ str    ┆ str                         ┆ str                         ┆ str                         │
╞════════╪═════════════════════════════╪═════════════════════════════╪═════════════════════════════╡
│ SCAPER ┆ S-phase cyclin A associated ┆ Predicted to enable nucleic ┆ S phase cyclin A-associated │
│        ┆ protein in the ER           ┆ acid binding activity and   ┆ protein in the endoplasmic  │
│        ┆                             ┆ zinc ion binding activity.  ┆ reticulum|zinc finger       │
│        ┆                             ┆ Acts upstream of or within  ┆ protein 291                 │
│        ┆                             ┆ retina development in       ┆       