\
Note that much more extensive documentation is available in Querying Ensembl.
Ensembl provides access to their MySQL databases directly or users can download and run those databases on a local machine. To use the Ensembl’s UK servers for running queries, nothing special needs to be done as this is the default setting for PyCogent’s ensembl module. To use a different Ensembl installation, you create an account instance:
>>> from cogent.db.ensembl import HostAccount
>>> account = HostAccount('fastcomputer.topuni.edu', 'username',
... 'canthackthis')
To specify a specific port to connect to MySQL on:
>>> from cogent.db.ensembl import HostAccount
>>> account = HostAccount('fastcomputer.topuni.edu', 'dude',
... 'ucanthackthis', port=3306)
To see what existing species are available
>>> from cogent.db.ensembl import Species
>>> print Species
================================================================================
Common Name Species Name Ensembl Db Prefix
--------------------------------------------------------------------------------
A.aegypti Aedes aegypti aedes_aegypti
A.clavatus Aspergillus clavatus aspergillus_clavatus...
If Ensembl has added a new species which is not yet included in Species, you can add it yourself.
>>> Species.amendSpecies('A latinname', 'a common name')
You can get the common name for a species
>>> Species.getCommonName('Procavia capensis')
'Rock hyrax'
and the Ensembl database name prefix which will be used for all databases for this species.
>>> Species.getEnsemblDbPrefix('Procavia capensis')
'procavia_capensis'
Species common names are used to construct attributes on PyCogent Compara instances). You can get the name that will be using the getComparaName method. For species with a real common name
>>> Species.getComparaName('Procavia capensis')
'RockHyrax'
or with a shortened species name
>>> Species.getComparaName('Caenorhabditis remanei')
'Cremanei'
We query for the BRCA2 gene for humans.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> print human
Genome(Species='Homo sapiens'; Release='62')
>>> genes = human.getGenesMatching(Symbol='BRCA2')
>>> for gene in genes:
... if gene.Symbol == 'BRCA2':
... print gene
... break
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')
We use the stable ID for BRCA2.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> gene = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print gene
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')
We look for breast cancer related genes that are estrogen induced.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> genes = human.getGenesMatching(Description='breast cancer estrogen')
>>> for gene in genes:
... print gene
Gene(Species='Homo sapiens'; BioType='processed_transcript'; Description='breast cancer estrogen-induced...'; StableId='ENSG00000181097'; Status='NOVEL'; Symbol='RP11-429J17.2')
We can also require that an exact (case insensitive) match to the word(s) occurs within the description by setting like=False.
>>> genes = human.getGenesMatching(Description='breast cancer estrogen',
... like=False)
>>> for gene in genes:
... print gene
Gene(Species='Homo sapiens'; BioType='processed_transcript'; Description='breast cancer estrogen-induced...'; StableId='ENSG00000181097'; Status='NOVEL'; Symbol='RP11-429J17.2')
We get the canonical transcripts for BRCA2.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> print transcript
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> cds = transcript.Cds
>>> print type(cds)
<class 'cogent.core.sequence.DnaSequence'>
>>> print cds
ATGCCTATTGGATCCAAAGAGAGGCCA...
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> for transcript in brca2.Transcripts:
... print transcript
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889641; End=32907428; length=17787; Strand='+')...
We show just for the canonical transcript.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print brca2.CanonicalTranscript.Exons[0]
Exon(StableId=ENSE00001184784, Rank=1)
We show just for the canonical transcript.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> for intron in brca2.CanonicalTranscript.Introns:
... print intron
Intron(TranscriptId=ENST00000380152, Rank=1)
Intron(TranscriptId=ENST00000380152, Rank=2)
Intron(TranscriptId=ENST00000380152, Rank=3)...
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> print brca2.Location.CoordName
13
>>> print brca2.Location.Start
32889610
>>> print brca2.Location.Strand
1
We query the genome for repeats within a specific coordinate range on chromosome 13.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> repeats = human.getFeatures(CoordName='13', Start=32879610, End=32889610, feature_types='repeat')
>>> for repeat in repeats:
... print repeat.RepeatClass
... print repeat
... break
SINE/Alu
Repeat(CoordName='13'; Start=32879362; End=32879662; length=300; Strand='-', Score=2479.0)
We query the genome for CpG islands within a specific coordinate range on chromosome 11.
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> islands = human.getFeatures(CoordName='11', Start=2150341, End=2170833, feature_types='cpg')
>>> for island in islands:
... print island
... break
CpGisland(CoordName='11'; Start=2158951; End=2162484; length=3533; Strand='-', Score=3254.0)
We find the genetic variants for the canonical transcript of BRCA2.
Note
The output is significantly truncated!
>>> from cogent.db.ensembl import Genome
>>> human = Genome('human', Release=62, account=account)
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
>>> transcript = brca2.CanonicalTranscript
>>> print transcript.Variants
(<cogent.db.ensembl.region.Variation object at ...
>>> for variant in transcript.Variants:
... print variant
... break
Variation(Symbol='rs55880202'; Effect=['2KB_upstream_variant', '5_prime_UTR_variant', '5KB_upstream_variant']; Alleles='C/T')...
We get a single SNP and print it’s allele frequencies.
>>> snp = list(human.getVariation(Symbol='rs34213141'))[0]
>>> print snp.AlleleFreqs
=============================
allele freq sample_id
-----------------------------
A 0.0303 1162
G 0.9697 1162
G 1.0000 11437
G 1.0000 11748
-----------------------------
We create a Compara instance for human, chimpanzee and macaque.
>>> from cogent.db.ensembl import Compara
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=62,
... account=account)
>>> print compara.method_species_links
Align Methods/Clades
===================================================================================================================
method_link_species_set_id method_link_id species_set_id align_method align_clade
-------------------------------------------------------------------------------------------------------------------
508 10 33558 PECAN 19 amniota vertebrates Pecan
510 13 33559 EPO 12 eutherian mammals EPO...
We first get the syntenic region corresponding to human gene BRCA2.
>>> from cogent.db.ensembl import Compara
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=62,
... account=account)
>>> human_brca2 = compara.Human.getGeneByStableId(StableId='ENSG00000139618')
>>> regions = compara.getSyntenicRegions(region=human_brca2, align_method='EPO', align_clade='primates')
>>> for region in regions:
... print region
SyntenicRegions:
Coordinate(Human,chro...,13,32889610-32973805,1)
Coordinate(Chimp,chro...,13,32082473-32166143,1)
Coordinate(Macaque,chro...,17,11686607-11778803,1)...
We then get a cogent Alignment object, requesting that sequences be annotated for gene spans.
>>> aln = region.getAlignment(feature_types='gene')
>>> print repr(aln)
3 x 99014 dna alignment: Homo sapiens:chromosome:13:3288...