Load novel genome

Here is a quick guide to describe how to load a novel genome into GenIE-Sys database.

Download

Let’s assume we need to integrate Populus tremula v2.0 genome into GenIE-System. First, we need to download the required files. The latest version of the GFF3 and FASTA files are available on PlantGenIE FTP.

$ curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/gff/Potra02_genes.gff.gz
$ curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/fasta/Potra02_genome.fasta.gz
$ gzip -d Potra02_genes.gff.gz
$ gzip -d Potra02_genome.fasta.gz

$ awk '!/##/' Potra02_genes.gff  |head
chr1	maker	gene	8865	11259	.	-	.	ID=Potra2n1c1;Name=Potra2n1c1
chr1	maker	mRNA	8865	10802	.	-	.	ID=Potra2n1c1.3;Parent=Potra2n1c1;Name=Potra2n1c1.3;_AED=0.39;_eAED=0.37;_QI=192|0.66|0.75|1|0|0|4|0|115
chr1	maker	CDS	8865	9054	.	-	1	ID=Potra2n1c1.3:cds;Parent=Potra2n1c1.3
chr1	maker	CDS	9487	9559	.	-	2	ID=Potra2n1c1.3:cds;Parent=Potra2n1c1.3
chr1	maker	CDS	9669	9753	.	-	0	ID=Potra2n1c1.3:cds;Parent=Potra2n1c1.3
chr1	maker	exon	9669	9764	.	-	.	ID=Potra2n1c1.3:exon:300;Parent=Potra2n1c1.3
chr1	maker	five_prime_UTR	9754	9764	.	-	.	ID=Potra2n1c1.3:five_prime_utr;Parent=Potra2n1c1.3
chr1	maker	exon	10622	10802	.	-	.	ID=Potra2n1c1.3:exon:299;Parent=Potra2n1c1.3
chr1	maker	five_prime_UTR	10622	10802	.	-	.	ID=Potra2n1c1.3:five_prime_utr;Parent=Potra2n1c1.3
chr1	maker	mRNA	8865	11259	.	-	.	ID=Potra2n1c1.1;Parent=Potra2n1c1;Name=Potra2n1c1.1;_AED=0.22;_eAED=0.21;_QI=896|0.66|0.75|1|0|0|4|0|115

$ head Potra02_genome.fasta
>chr1
AGAGAGCTCTGTGGGTCATTACTGTCACAACTCCTAGCCAGCTTGAATAT
TCCATATAGCACATATCCTGGATGGGAAAGTTTGGTTAATGTGTGCTATT
CTTGCTCGCCTTCAACACGATTATTTCGTTCATACCACAAGAAATAAACA
GTAGTGGATAGTAGAAGGCGAGCTAGCATGTGATCACTGTTATTCTTCTT
CGTGTAGTGAGTGACTGACCAATGAAGCAATTGTGTCCACGGTTTGCATG
GCCAATAATGGTTGGCTCTGCGACAAATGGACTTCCAAACCAAGCTGGTG
TAACTGCATTCAAAAAAGAGGTGTTCATATGTTTCCATGTAAATTCCATA
TAGTATGCAAGTTGTATCTGTGACTCCTCCATGCAATCTATCCATCGTTC
TTAGTCTACCAAGGCTGGCTAACCAGAGTATAAATGAGTGACGAGGGATA

Parse genome

Now we need to parse GFF3 and FASTA files into required formats. There are two primary tables(transcript_info and gene_info) in the database.

## generate file for gene_info table
awk '/gene/{split($9,a,"ID=");split(a[2],b,";");print b[1],$1,$4,$5,$7}' FS='\t' OFS='\t' Potra02_genes.gff > gene_info.txt
$ head gene_info.txt
Potra2n1c1	chr1	8865	11259	-
Potra2n1c2	chr1	21121	21603	+
Potra2n1c3	chr1	22295	24697	-
Potra2n1c4	chr1	30731	32811	+
Potra2n1c5	chr1	33508	33833	+
Potra2n1c6	chr1	50823	54726	-
Potra2n1c7	chr1	50901	51116	+
Potra2n1c8	chr1	54928	62450	-
Potra2n1c9	chr1	69471	73884	-
Potra2n1c10	chr1	74717	75583	+

## create file for transcript_info table
awk 'BEGIN{ OFS = "\t"; }$3~/gene/{g=$4"\t"$5}$3~/RNA$/{split($9,a,/[;=]/);for(i=1;i in a;i+=2)k[a[i]]=a[i+1]; print k["Name"], k["Parent"], "desc", $1, $7, g, "PAC", "PEP", $4,$5}' Potra02_genes.gff > transcript_info.txt
$ head transcript_info.txt
Potra2n1c1.3	Potra2n1c1	desc	chr1	-	8865	11259	PAC	PEP	8865	10802
Potra2n1c1.1	Potra2n1c1	desc	chr1	-	8865	11259	PAC	PEP	8865	11259
Potra2n1c1.2	Potra2n1c1	desc	chr1	-	8865	11259	PAC	PEP	8865	11259
Potra2n1c2.1	Potra2n1c2	desc	chr1	+	21121	21603	PAC	PEP	21121	21603
Potra2n1c3.1	Potra2n1c3	desc	chr1	-	22295	24697	PAC	PEP	22295	24697
Potra2n1c4.1	Potra2n1c4	desc	chr1	+	30731	32811	PAC	PEP	30731	32811
Potra2n1c5.1	Potra2n1c5	desc	chr1	+	33508	33833	PAC	PEP	33508	33833
Potra2n1c6.1	Potra2n1c6	desc	chr1	-	50823	54726	PAC	PEP	50823	54726
Potra2n1c7.1	Potra2n1c7	desc	chr1	+	50901	51116	PAC	PEP	50901	51116
Potra2n1c8.3	Potra2n1c8	desc	chr1	-	54928	62450	PAC	PEP	54928	61609

Create a database

Now we need to create a database. To do this, you need a MySQL username and password. If you use the MAMP installation default username and password would be root.

## Download all required scripts and dump database
$ git clone https://github.com/irusri/scripts.git
## Create database for default root user and root password
$ mysql -u root -proot
mysql> create database my_genie_sys_database;
Query OK, 1 row affected (0.01 sec)
mysql> use my_genie_sys_database;
Database changed
mysql> source scripts/dump.sql;

Loading primary tables

Now we need to load above two files(gene_info.txt and transcript_info.txt) into the newly created database. There is a script(load_data.sh) to do this. We can download the script and enter the correct username, password and database information to DB_USER, DB_PASS and DB parameters respectively.

$ nano scripts/load_data.sh
#load_data.sh script
#!/bin/bash
#load_data.sh
#USAGE: sh load_data.sh [table_name] [filename]
#sh load_data.sh transcript_info_x /tmp/transcript_info.tsv

DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'

mysql  --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
TRUNCATE TABLE $1;
ALTER TABLE $1 AUTO_INCREMENT = 1;
load data local infile '$2' ignore INTO TABLE $1 CHARACTER SET UTF8 fields terminated by '\t' LINES TERMINATED BY '\n' ignore 0 lines;
EOFMYSQL

Run following commands to load gene_info.txt and transcript_info.txt into respective tables.

#Load above generated source file into gene_info table
sh scripts/load_data.sh gene_info gene_info.txt

#Load previously generated source file into transcript_info table
sh scripts/load_data.sh transcript_info transcript_info.txt

Now we need to update the gene_i parameter in transcript_info table. There is a script(update_gene_i.sh) in the scripts directory, we just need to enter the correct username, password and database information to DB_USER, DB_PASS and DB parameters respectively as we did in previous step.

$ nano scripts/update_gene_i.sh
#update_gene_i.sh script
#!/bin/bash
#update_gene_i.sh

DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'

#USAGE: sh update_gene_i.sh

mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
create temporary table add_gene_i(gene_i MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY, genename VARCHAR(40));
ALTER TABLE add_gene_i AUTO_INCREMENT = 1;
INSERT INTO add_gene_i(genename) select DISTINCT(gene_id) from transcript_info;
UPDATE transcript_info INNER join add_gene_i ON add_gene_i.genename = transcript_info.gene_id SET transcript_info.gene_i = add_gene_i.gene_i;
drop temporary table add_gene_i;
EOFMYSQL

Let’s run the following command to update gene_i in transcript_info table.

#Finally update the gene_i in transcript_info table using update_gene_i.sh.
sh scripts/update_gene_i.sh

If above script takes time please try following command on MySQL. This will update the gene_i column in transcript_info table.

$ nano scripts/update_gene_i.sh
#update_gene_i_dev.sh script
#!/bin/bash
#update_gene_i_dev.sh

DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'

#USAGE: sh update_gene_i_dev.sh

mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
update transcript_info,gene_info set transcript_info.gene_i=gene_info.gene_i where gene_info.gene_id=transcript_info.gene_id;
EOFMYSQL

Run following command to execute the above script(update_gene_i_dev.sh)

sh scripts/update_gene_i_dev.sh

Great! We have loaded transcript and gene infortmation properly into the database. Now can we load additional information. For example; description to the transcript_info table.

$ curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/annotation/blast2go/Potra22_blast2go_description.txt
$ head Potra22_blast2go_description.txt
Potra2n765s36715.1	UniRef90_B9GWJ3F-box domain-containing protein n=2 Tax=Populus TaxID=3689 RepID=B9GWJ3_POPTR
Potra2n765s36713.1	Populus trichocarpa uncharacterized LOC112326797 (LOC112326797), ncRNA
Potra2n765s36713.2	Populus trichocarpa uncharacterized LOC112326797 (LOC112326797), ncRNA
Potra2n765s36714.1	UniRef90_A0A2K2BA33FAD-binding PCMH-type domain-containing protein n=40 Tax=Populus TaxID=3689 RepID=A0A2K2BA33_POPTR
Potra2n1433s37070.1	UniRef90_U7E173Protein kinase domain-containing protein (Fragment) n=1 Tax=Populus trichocarpa TaxID=3694 RepID=U7E173_POPTR
Potra2n581s36023.1	UniRef90_UPI00057ABC08probable LRR receptor-like serine/threonine-protein kinase At4g08850 isoform X1 n=1 Tax=Populus euphratica TaxID=75702 RepID=UPI00057ABC08
Potra2n581s36025.1	UniRef90_UPI00057B3C83probable LRR receptor-like serine/threonine-protein kinase At4g08850 n=1 Tax=Populus euphratica TaxID=75702 RepID=UPI00057B3C83
Potra2n581s36024.1	UniRef90_U5GE99Zeta-carotene desaturase n=10 Tax=fabids TaxID=91835 RepID=U5GE99_POPTR
Potra2n707s36547.1	UniRef90_A0A2K1X8T3AMPKBI domain-containing protein n=5 Tax=Populus TaxID=3689 RepID=A0A2K1X8T3_POPTR
Potra2n409s35556.1	UniRef90_UPI000B5D6D9FE3 ubiquitin-protein ligase SHPRH isoform X3 n=1 Tax=Manihot esculenta TaxID=3983 RepID=UPI000B5D6D9F

Here is the script to load description into transcript_info column.

#!/bin/bash
#update_description.sh

DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'

# if less than two arguments supplied, display  error message
        if [  $# -le 0 ]
        then
                start='\033[0;33m'
                start_0='\033[0;33m'
                start_2='\033[0;31m'
                end='\033[0m'
                echo  "\nUsage:\n$0 ${start}[gene_info/transcript_info] [file_name]${end}\nEx: ${start_2}sh update_description.sh transcript_info/gene_info potra_description.tsv${end}\n\nWhat it does?\n${start_0}This script will create a two columns(ids, description) temporary table and load the [file_name] into it.\nThen it will match ids column in temporary table with transcript_ids/gene_ids and update the gene/transcript description.\nFinally delete the temporary table.\n${end}"
                exit 1
        fi

table_name=$(echo $1 | awk '{split($0,a,"_");print a[1]}');
tmp_field_name=$table_name"_id"
/usr/bin/mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB<<EOFMYSQL
CREATE TEMPORARY TABLE tmp_tb(gene_name VARCHAR(60),annotation VARCHAR(1000));
load data local infile '$2' replace INTO TABLE tmp_tb fields terminated by '\t' LINES TERMINATED BY '\n' ignore 0 lines;
UPDATE $1 INNER JOIN tmp_tb on tmp_tb.gene_name = $1.$tmp_field_name SET $1.description = tmp_tb.annotation;
DROP TEMPORARY TABLE tmp_tb;
EOFMYSQL

We just need to run the script to load description into transcript_info table.

sh scripts/update_description.sh transcript_info Potra22_blast2go_description.txt

Loading secondary tables

Following are the tables available with GenIE-Sys default database. However, it is easy to add more tables depending on the user demands. Secondary table conatins annotation related to the primary tables.

$ curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/annotation/blast2go/Potra22_blast2go_GO.txt
$ head Potra22_blast2go_GO.txt
Sequence Name	Annotation GO ID-Annotation GO Term
Potra2n765s36715.1	GO:0005515-protein binding
Potra2n765s36714.1	GO:0009690-cytokinin metabolic process
Potra2n765s36714.1	GO:0016021-integral component of membrane
Potra2n765s36714.1	GO:0019139-cytokinin dehydrogenase activity
Potra2n765s36714.1	GO:0055114-oxidation-reduction process
Potra2n765s36714.1	GO:0071949-FAD binding
Potra2n1433s37070.1	GO:0004674-protein serine/threonine kinase activity
Potra2n1433s37070.1	GO:0005509-calcium ion binding
Potra2n1433s37070.1	GO:0005524-ATP binding

$ awk 'BEGIN{FS="\t";OFS="\t"}{a[$1]=a[$1]?a[$1]";"$2:$2;}END{for (i in a)print i"\t"a[i];}'   Potra22_blast2go_GO.txt > Potrav22_go_desc.txt
$ head Potrav22_go_desc.txt
Potra2n5c11384.4	GO:0019904-protein domain specific binding
Potra2n1c2900.1	GO:0006118-obsolete electron transport;GO:0009055-electron transfer activity;GO:0016021-integral component of membrane;GO:0022900-electron transport chain
Potra2n12c24161.1	GO:0005789-endoplasmic reticulum membrane;GO:0016021-integral component of membrane
Potra2n6c13118.1	GO:0003677-DNA binding;GO:0004724-magnesium-dependent protein serine/threonine phosphatase activity;GO:0005963-magnesium-dependent protein serine/threonine phosphatase complex;GO:0006470-protein dephosphorylation;GO:0046872-metal ion binding
Potra2n6c13118.2	GO:0003677-DNA binding;GO:0004724-magnesium-dependent protein serine/threonine phosphatase activity;GO:0005963-magnesium-dependent protein serine/threonine phosphatase complex;GO:0006470-protein dephosphorylation;GO:0046872-metal ion binding
Potra2n6c13118.3	GO:0003677-DNA binding;GO:0004724-magnesium-dependent protein serine/threonine phosphatase activity;GO:0005963-magnesium-dependent protein serine/threonine phosphatase complex;GO:0006470-protein dephosphorylation;GO:0046872-metal ion binding
Potra2n9c19679.1	GO:0016021-integral component of membrane;GO:0016117-carotenoid biosynthetic process;GO:0016166-phytoene dehydrogenase activity;GO:0016757-transferase activity, transferring glycosyl groups;GO:0055114-oxidation-reduction process
Potra2n14c27340.1	GO:0046872-metal ion binding
Potra2n9c19679.2	GO:0016021-integral component of membrane;GO:0016117-carotenoid biosynthetic process;GO:0016166-phytoene dehydrogenase activity;GO:0016757-transferase activity, transferring glycosyl groups;GO:0055114-oxidation-reduction process
Potra2n14c27340.2	GO:0046872-metal ion binding

As you see the annotation are based on transcript IDs. Therefore, Following script can be used to load secondary table into transcript_go table. Then update transcript_i column using another script as described below.

sh scripts/load_data.sh transcript_go Potrav22_go_desc.txt

Then update the gene_i or transcript_i depending on the primary usint of the annotation dataset using following script.

#!/bin/bash
#update_annotation_gene.sh

DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'

#USAGE sh update_annotation_gene_i.sh transcript_go
display_usage() {
        echo  "\nUsage:\n$0 [table_name] \n"
        }

# if less than one arguments supplied, display usage
        if [  $# -le 0 ]
        then
                display_usage
                exit 1
        fi

count=$(mysql --host=localhost --user=$DB_USER --password=$DB_PASS --database=$DB -sse "SHOW COLUMNS FROM $1 LIKE 'transcript_id';")
if [ ${#count} -gt 0 ]
then
     mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
UPDATE $1 INNER JOIN transcript_info on transcript_info.transcript_id = $1.transcript_id SET $1.transcript_i = transcript_info.transcript_i;
EOFMYSQL
else
     mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
UPDATE $1 INNER JOIN transcript_info on transcript_info.gene_id = $1.gene_id SET $1.gene_i = transcript_info.gene_i;
EOFMYSQL
fi

Let’s run following command to fill the transcript_i or gene_i column.

sh scripts/update_annotation_gene_i.sh transcript_go

Similalrly when we have annotation based on gene IDs, we have to fill gene_annotation tables.