\

Note that much more extensive documentation is available in Querying Ensembl.

Connecting

Ensembl provides access to their MySQL databases directly or users can download and run those databases on a local machine. To use the Ensembl’s UK servers for running queries, nothing special needs to be done as this is the default setting for PyCogent’s ensembl module. To use a different Ensembl installation, you create an account instance:

>>> from cogent.db.ensembl import HostAccount
>>> account = HostAccount('fastcomputer.topuni.edu', 'username',
...                       'canthackthis')

To specify a specific port to connect to MySQL on:

>>> from cogent.db.ensembl import HostAccount
>>> account = HostAccount('fastcomputer.topuni.edu', 'dude',
...                       'ucanthackthis', port=3306)

Species to be queried

To see what existing species are available

>>> from cogent.db.ensembl import Species
>>> print Species
================================================================================
       Common Name                   Species Name              Ensembl Db Prefix
--------------------------------------------------------------------------------
         A.aegypti                  Aedes aegypti                  aedes_aegypti
        A.clavatus           Aspergillus clavatus           aspergillus_clavatus...

If Ensembl has added a new species which is not yet included in Species, you can add it yourself.

>>> Species.amendSpecies('A latinname', 'a common name')

You can get the common name for a species

>>> Species.getCommonName('Procavia capensis')
'Rock hyrax'

and the Ensembl database name prefix which will be used for all databases for this species.

>>> Species.getEnsemblDbPrefix('Procavia capensis')
'procavia_capensis'

Species common names are used to construct attributes on PyCogent Compara instances). You can get the name that will be using the getComparaName method. For species with a real common name

>>> Species.getComparaName('Procavia capensis')
'RockHyrax'

or with a shortened species name

>>> Species.getComparaName('Caenorhabditis remanei')
'Cremanei'

Get genomic features

Find a gene by gene symbol

We query for the BRCA2 gene for humans.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> print human
Genome(Species='Homo sapiens'; Release='62')
>>> genes = human.getGenesMatching(Symbol='BRCA2')
>>> for gene in genes:
...     if gene.Symbol == 'BRCA2':
...         print gene
...         break
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')

Find a gene by Ensembl Stable ID

We use the stable ID for BRCA2.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> gene = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print gene
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')

Find genes matching a description

We look for breast cancer related genes that are estrogen induced.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> genes = human.getGenesMatching(Description='breast cancer estrogen')
>>> for gene in genes:
...     print gene
Gene(Species='Homo sapiens'; BioType='processed_transcript'; Description='breast cancer estrogen-induced...'; StableId='ENSG00000181097'; Status='NOVEL'; Symbol='RP11-429J17.2')

We can also require that an exact (case insensitive) match to the word(s) occurs within the description by setting like=False.

>>> genes = human.getGenesMatching(Description='breast cancer estrogen',
...                                  like=False)
>>> for gene in genes:
...     print gene
Gene(Species='Homo sapiens'; BioType='processed_transcript'; Description='breast cancer estrogen-induced...'; StableId='ENSG00000181097'; Status='NOVEL'; Symbol='RP11-429J17.2')

Get canonical transcript for a gene

We get the canonical transcripts for BRCA2.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> print transcript
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')

Get the CDS for a transcript

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> cds = transcript.Cds
>>> print type(cds)
<class 'cogent.core.sequence.DnaSequence'>
>>> print cds
ATGCCTATTGGATCCAAAGAGAGGCCA...

Look at all transcripts for a gene

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> for transcript in brca2.Transcripts:
...     print transcript
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889641; End=32907428; length=17787; Strand='+')...

Get the first exon for a transcript

We show just for the canonical transcript.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print brca2.CanonicalTranscript.Exons[0]
Exon(StableId=ENSE00001184784, Rank=1)

Get the introns for a transcript

We show just for the canonical transcript.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> for intron in brca2.CanonicalTranscript.Introns:
...     print intron
Intron(TranscriptId=ENST00000380152, Rank=1)
Intron(TranscriptId=ENST00000380152, Rank=2)
Intron(TranscriptId=ENST00000380152, Rank=3)...

Inspect the genomic coordinate for a feature

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print brca2.Location.CoordName
13
>>> print brca2.Location.Start
32889610
>>> print brca2.Location.Strand
1

Get repeat elements in a genomic interval

We query the genome for repeats within a specific coordinate range on chromosome 13.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> repeats = human.getFeatures(CoordName='13', Start=32879610, End=32889610, feature_types='repeat')
>>> for repeat in repeats:
...     print repeat.RepeatClass
...     print repeat
...     break
SINE/Alu
Repeat(CoordName='13'; Start=32879362; End=32879662; length=300; Strand='-', Score=2479.0)

Get CpG island elements in a genomic interval

We query the genome for CpG islands within a specific coordinate range on chromosome 11.

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> islands = human.getFeatures(CoordName='11', Start=2150341, End=2170833, feature_types='cpg')
>>> for island in islands:
...     print island
...     break
CpGisland(CoordName='11'; Start=2158951; End=2162484; length=3533; Strand='-', Score=3254.0)

Get SNPs

For a gene

We find the genetic variants for the canonical transcript of BRCA2.

Note

The output is significantly truncated!

>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> print transcript.Variants
(<cogent.db.ensembl.region.Variation object at ...
>>> for variant in transcript.Variants:
...     print variant
...     break
Variation(Symbol='rs55880202'; Effect=['2KB_upstream_variant', '5_prime_UTR_variant', '5KB_upstream_variant']; Alleles='C/T')...

Get a single SNP

We get a single SNP and print it’s allele frequencies.

>>> snp = list(human.getVariation(Symbol='rs34213141'))[0]
>>> print snp.AlleleFreqs
=============================
allele      freq    sample_id
-----------------------------
     A    0.0303         1162
     G    0.9697         1162
     G    1.0000        11437
     G    1.0000        11748
-----------------------------

What alignment types available

We create a Compara instance for human, chimpanzee and macaque.

>>> from cogent.db.ensembl import Compara
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=62,
...                  account=account)
>>> print compara.method_species_links
Align Methods/Clades
===================================================================================================================
method_link_species_set_id  method_link_id  species_set_id      align_method                            align_clade
-------------------------------------------------------------------------------------------------------------------
                       508              10           33558             PECAN           19 amniota vertebrates Pecan
                       510              13           33559               EPO               12 eutherian mammals EPO...

Get genomic alignment for a gene region

We first get the syntenic region corresponding to human gene BRCA2.

>>> from cogent.db.ensembl import Compara
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=62,
...                  account=account)
>>> human_brca2 = compara.Human.getGeneByStableId(StableId='ENSG00000139618')
>>> regions = compara.getSyntenicRegions(region=human_brca2, align_method='EPO', align_clade='primates')
>>> for region in regions:
...     print region
SyntenicRegions:
  Coordinate(Human,chro...,13,32889610-32973805,1)
  Coordinate(Chimp,chro...,13,32082473-32166143,1)
  Coordinate(Macaque,chro...,17,11686607-11778803,1)...

We then get a cogent Alignment object, requesting that sequences be annotated for gene spans.

>>> aln = region.getAlignment(feature_types='gene')
>>> print repr(aln)
3 x 99014 dna alignment: Homo sapiens:chromosome:13:3288...