Ortholog Resources

From GersteinInfo

Jump to: navigation, search
  • Current Status

Below are version 1 of some modENCODE ortholog resources. Its main page is http://www.modencode.org/ These are only for worm, fly and human. They do not yet completely use the Reference genes and transcripts. The paper referring these builds can be found here http://papers.gersteinlab.org/

Future plans involve making more use of the reference sets and incorporating yeast.

  • Cross Comparison Results (Use these!)

We were able to use the different ortholog pairs and triplets that we had in order to look at how much overlap there was between the different ortholog resources. The file human_fly.op.3 contains the human-fly orthologs that are reported by all three ortholog resources. Similarly human_fly.op.2 contains the human-fly ortholog pairs that have been reported in at least two of the aforementioned data resources. The file human_fly.op.1 has the human-fly ortholog pairs that were found in only one of the resources. The same analysis was done for the human-worm and worm-fly ortholog pairs, as well as the human-worm-fly triplets. Use the file containing all three resources for the list of most conservative ortholog pairs and triplets. The other files can be added if more pairs for the list are desired. The files are located below.

File Name Number of ortholog pairs
fly_worm.op.3 1,226 Use for most conservative fly-worm pairs
human_fly.op.3 1,955 Use for most conservative human-fly pairs
human_worm.op.3 2,538 Use for most conservative human-worm pairs
worm_human_fly.triplet.3 829 Use for most conservative worm-human-fly triplets
fly_worm.op.2 4,060
human_fly.op.2 5,962
human_worm.op.2 5,109
worm_human_fly.triplet.2 6856
fly_worm.op.1 9,927
human_fly.op.1 12,653
human_worm.op.1 99,260
worm_human_fly.triplet.1 34,744















Example of how the files were created: Suppose in Inparanoid we see an ortholog group that consists of 5 genes human-1, human-2, human-3, worm-1, fly-1 In TreeFam we see 4 of the 5 genes, human-1, human-2, worm-1, fly-1 in the same ortholog group. In OrthoMCL none of the 5 genes exists in an ortholog group. These pairs and triplets would be assigned to the op.2 files

We would build the following:

2 human-worm-fly ortholog triplets assigned to op.2: (human-1, worm-1, fly-1) (human-2, worm-1, fly-1)

2 human-worm ortholog pairs assigned to op.2: (human-1, worm-1) (human-2, worm-1)

2 human-fly ortholog pairs assigned to op.2: (human-1, fly-1) (human-2, fly-1)

1 worm-fly ortholog pair assigned to op.2: (worm-1, fly-1) In this case of worm-fly pair, the number of triplets is larger then the number of pairs

If we change the assumption and an ortholog group exists in OrthoMCL that consists of human-1, human-2, worm-2. There will be some changes in assignments from op.2 to op.3.

We would build the following: 2 human-worm ortholog pairs, assigned to op.3. There will be 0 pairs in op.2: (human-1, worm-1) (human-2, worm-1)

2 human-worm-fly triplets, assigned to op.2: (human-1, worm-1, fly-1) (human-2, worm-1 , fly-1)

2 human-fly ortholog pairs assigned to op.2: (human-1, fly-1) (human-2, fly-1)

1 worm-fly ortholog pair assigned to op.2: (worm-1, fly-1)

In summary, the "op.3" files means the ortholog pair exists in Inparanoid, OrthoMCL, and TreeFam, three independent ortholog resources, which would be the most conservative. If the op.1, op.2, op.3 were combined together, there would be more ortholog pairs.

Resources used to build the ortholog pairs and triplets

  • InParanoid

Data Resource and Website: http://inparanoid.sbc.su.se/download/current/sqltables/

Files downloaded:
sqltable.ensHOMSA.fa-modDROME.fa File containing the huamand and fly proteins
ensHOMSA.fa-modCAEEL.fa File containing human and worm proteins
modCAEEL.fa-modDROME.fa File containing worm and fly proteins

Date of Download: April 6, 2009

Analysis of downloaded files:
1. In table sqltable.ensHOMSA.fa-modDROME.fa 9,516 human proteins and 6,351 fly proteins are assigned into 5,586 ortholog groups.
2. In table ensHOMSA.fa-modCAEEL.fa 8,900 human proteins and 5,825 worm proteins are assigned into 4,658 ortholog groups.
3. In table modCAEEL.fa-modDROME.fa 5,296 fly proteins and 5,367 worm proteins are assigned into 4,333 ortholog groups.

File Name Human Proteins Worm Proteins Fly Protein Ortholog Groups
sqltable.ensHOMSA.fa-modDROME.fa 9,516 n/a 6,351 5,586
ensHOMSA.fa-modCAEEL.fa 8,900 5,825 n/a 4,658
modCAEEL.fa-modDROME.fa n/a 5,367 5,296 4,333






4. Based on these 3 tables, we built 12,336 human-fly ortholog pairs in file Inparanoid_raw_human_fly, 113,805 human-worm ortholog pairs in file Inparanoid_raw_human_worm and 10,321 fly-worm ortholog pairs in file Inparanoid_raw_fly_worm. We also built 10594 worm_human_fly ortholog triplets in file IP.worm_human_fly.triplet
5. There is an issue with the worm data file. It contains many more protein IDs, which leads to the higher number of worm ortholog pairs. Mapping to Reference Protein IDs and Summary:
To allow cross comparison, human and fly proteins IDs are mapped to the current Ensembl (53) protein IDs, and WormBase IDs are mapped to WormPep IDs, by using BioMart. If a human protein ID no longer exists in Ensembl, all ortholog pairs involved are removed. The ID mapping files are located under the head ID mapping at the end of the page. After mapping, we got 10,834 human-fly ortholog pairs in file Inparanoid_human_fly_pairs, 96,724 human-worm ortholog pairs in file Inparanoid_human_worm_pairs and 8,876 fly-worm ortholog pairs in file InParanoid_fly_worm_pairs.

  • OrthoMCL

Data Resource and Website: http://www.orthomcl.org/common/downloads/2/

Files Downloaded: groups_orthomcl-2.txt.gz File containing the different ortholog groups

Date of Download: April 06, 2009

Analysis of downloaded files:
1. From groups_orthomcl-2.txt.gz, we retrieved the following:

Species Number of proteins Attached File Name
Human 19,635 OrthoMCL_raw_human
Fly 11,158 OrthoMCL_raw_fly
Worm 17,411 OrthoMCL_raw_worm
In total 48,204







2. Human proteins are identified by Ensembl protein ID. Fly protein IDs are from FlyBase and worm protein IDs are from WormBase.
3. Selecting from groups_orthomcl-2.txt.gz for the groups which contain proteins from at least two human, fly and worm species, we retrieved 6,467 ortholog groups, including 9,505 human proteins, 7,258 fly proteins and 5,931 worm proteins OrthoMCL_raw_human_fly_worm_groups.
4. Based on these 6,467 ortholog groups, we built 14,556 human-fly ortholog pairs OrthoMCL_raw_human_fly, 11,623 human-worm ortholog pairs OrthoMCL_raw_human_worm and 9,255 fly-worm ortholog pairs OrthoMCL_raw_fly_worm. We also built 29481 human-worm-fly triplets OM.worm_human_fly.triplet

Mapping to Reference Protein IDs and Summary:
To allow cross comparison, human and fly protein IDs are mapped to the current Ensembl (53) protein IDs, and WormBase IDs are mapped to WormPep IDs, using BioMart. If a human protein ID no longer exists in Ensembl, all ortholog pairs involved are removed. The ID mapping files are located under the heading ID mapping at the end of the page. After mapping, we got 12,784 human-fly ortholog pairs OrthoMCL_human_fly_pairs, 9,979 human-worm ortholog pairs OrthoMCL_human_worm_pairs and 8,047 fly-worm ortholog pairs OrthoMCL_fly_worm_pairs

  • TreeFam

Data Resource and Website: ftp://ftp.sanger.ac.uk/pub/treefam/release-7.0/MySQL/

Files Downloaded:
genes.txt.table.gz Table containing the genes
species.txt.table.gz Table containing the different species
fam_genes.txt.table.gz Table showing the association between the genes' families
ortholog.txt.table.gz Table containing the orthologs

Date of Download: April 06, 2009

Analyses of downloaded Files:
1. From genes.txt.table.gz, we retrieved the following:

Species Number of proteins Attached File Name
Human 46,810 human.genes.raw
Fly 19,789 fly.genes.raw
Worm 20,151 tf.worm.genes.raw
In total 86,750







TreeFam was built on protein alignments. Based on the three files, our understanding is that each record is one transcription, and one gene could correspond to multiple transcripts, thus multiple records. We interpret one transcription as corresponding to one protein.

2. fam_genes.txt.table.gz presents the association of the genes’ family (both TreeFamA and TreeFamB). 803,408 genes were assigned to 16,141 gene families. Members of the same family should then be regarded as orthologs (together with in-paralogs). However, we found 204,845 IDs in this table that have no records in table genes.txt.table. We attached these 204,845 IDs in the file question_gfam.ids.

3. Ortholog pairs.

3.1 From table fam_genes.txt.table.gz
Selecting from table genes.txt.table.gz, the families containing human, fly and worm records (records are the total of 86,750 IDs), we got 8436 TreeFam families, and 7501 of them contains records from only one of the three species (this 8436 families are in file human_fly_worm.fam_genes.raw. The 935 families left containing records from more than one the three species are in file human_fly_worm.fam_genes.all. From these 935 families, we got 2861 human-fly ortholog pairs in file tf.fam_genes.human_fly, 194 human-worm ortholog pairs fam_genes.human_worm and only 27 fly-worm ortholog pairs are in file tf.fam_genes.fly_worm. We also built 10872 human-worm-fly triplets TF.worm_human_fly.triplet.

3.2 From ortholog.txt.table.gz
7,111,245 ortholog pairs are listed in this table, although we do not know how this table was created. From this table, we retrieved 15,004 human-fly ortholog pairs in file tf.ortho_table.human_fly, 14, 768 human-worm ortholog pairs in file tf.ortho_table.human_worm and 12,544 fly-worm ortholog pairs in file tf.ortho_table.fly_worm. We did not do further filtering by the provided bootstrap value.

Mapping to Ensembl Protein IDs and Summary:
We used data from 3.2 as ortholog pairs from TreeFam: 15,004 human-fly, 14,768 human-worm and 12,544 fly-worm gene pairs. This includes 12,146 human IDs, 8,557 fly IDs, and 7,778 worm IDs from table genes.txt.table.gz. Mapping the 12,146 human IDs to Ensembl protein ID (Ensembl 53, BioMart, transcript ID to protein ID), we got 11,433 Ensembl protein IDs in file tf.id_mapping.human. From 8,557 fly IDs, we got 4,588 fly Ensembl protein IDs (Ensembl 53, BioMart, Associated Transcript Name to protein ID, attached as tf.id_mapping.fly. From 7,778 worm IDs, we got 6,308 WormBase peptide IDs (Ensembl 53, BioMart, transcript ID to WormPep ID, attached tf.id_mapping.worm. After mapping, we got 6,824 human-fly ortholog pairs TF.human_fly.pairs, 10,389 human-worm ortholog pairs TF.human_worm.pairs and 4,802 fly-worm ortholog pairs TF.fly_worm.pairs.

  • ID mapping

In order to allow for cross comparison, we mapped the different protein IDs using the Ensemble protein IDs and the WormBase ID using BioMart

fly_ensembl_idmapping ID mapping using fly Ensemble ID and the Associated Transcript Name
flybase_ensembl_idmapping ID mapping using fly Ensembl gene IDs and FlyBase protein IDs
worm_ensembl_idmapping ID mapping using worm Ensemble IDs and the Associated Transcript Name
wormbase_ensembl_idmapping ID mapping using Ensembl gene IDs and WormBase protein IDs
human_ensembl_idmapping ID mapping using human Ensemble IDs and the Associated Transcript Name

  • Documentation

These are some powerpoints from previous presentation about the ortholog resources

ID_mapping.ppt

  • Contact People

Gang Fang, Rebecca Robilotto, Lincoln Stein, Mark Gerstein

Personal tools