[maker-devel] Status check?

Barry Moore barry.moore at genetics.utah.edu
Thu Oct 22 10:09:38 MDT 2009


Xavier,

RepeatMasker doesn't build a library of repeats by blasting the genome  
against itself.  RepeatMasker uses a curated nucleotide library of  
repeats called RepBase - you would have had to register and download  
RepBase at some point when you were installing RepeatMasker.  In  
addition MAKER uses a tool called RepeatRunner that uses as curated  
protein library of proteins associated with known transposable  
elements.  RepeatMasker uses blastn or crossmatch to find the RepBase  
repeats in your genome and RepeatRunner uses blastx for find the  
te_proteins.fasta genes in  your genome.  If you want to find novel  
repeats in your genome - that is repeats that aren't already in  
RepBase or RepeatRunner's protein database then you'll need to run  
tools for that purpose.  We've used the PALS/PILER/MUSCLE suite of  
programs from Bob Edgar for this purpose, but there are other tools as  
well - I've heard good things about RepeatScout and intend to try it  
at some point as well.  It's usually a good idea to run RepeatMasker  
and RepeatRunner on your genome first to  make PALS job easier.  After  
you get you species specific repeat library prepares you pass it to  
MAKER about it (the rmlib option) in your maker_opts config file.   
MAKER doesn't incorporate the process of building a custom repeat  
library for you.

Barry

On Oct 22, 2009, at 5:38 AM, Xavier Watkins wrote:

> Hi,
> Thanks everyone for your help. I'm already running mpi_maker, planning
> on using more CPUs next time...
>
> I'm using the following:
>
> RepeatMasker with cross_match (running cross match seems to be the bit
> that's taking ages)
>
> snap version 2006-07-28
> GeneMarkS
> NCBI blastall 2.2.20 for blastx
>
> Not really sure what happens in RepeatMasker but from what I
> understand it tries to blast the genome against itself to build a
> library of possible repeats? Is there a way of building this library
> of repeats in a more efficient way?
>
> Many thanks,
> Xavier
>
>
> On 21 Oct 2009, at 16:07, Mark Yandell wrote:
>
>>
>> Hi Xavier,
>>
>> I agree: This seems way to long. I can basically reproduce flybase's
>> annotations and blast data in about 3 days on my laptop-- so 2 weeks
>> on 5 processor's seems way too long.
>>
>> Is there some. special, really huge dataset you are running? Are you
>> doing TBLASTX to align hits from a large database of sequences?
>>
>> --mark
>>
>> Mark Yandell
>> Associate Professor of Human Genetics
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> ph:801-587-7707
>> ________________________________________
>> From: maker-devel-bounces at yandell-lab.org [maker-devel- 
>> bounces at yandell-lab.org
>> ] On Behalf Of Carson Holt
>> Sent: Wednesday, October 21, 2009 8:57 AM
>> To: Xavier Watkins; maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] Status check?
>>
>> The time spent depends primarily on the size of the protein, EST,
>> and repeat protein databases provided.  BLAST actually makes up
>> about 90% of the run time for MAKER.  If your using 5 processors, I
>> suggest using mpi_maker instead of regular maker.  It gets better
>> performance on multiprocessor systems.  The number of slices is
>> dependant on what you set max_dna_len to be in the maker_opt.ctl
>> file.  Increasing the number increases memory usage.  Just divide
>> the contig length by that number.  Doing a test run on the entire
>> Drosophila genome could take a while especially if you used large
>> protein and EST databases for the analysis.  It is 120 Megabases in
>> size, and with the default max_dna_len of 100,000,  it would be
>> divided into 1,200 chunks.  It could take anywhere from 4 days to 3
>> weeks depending on the BLAST databases used.
>>
>> I guess Barry already answered the question on how to check on run
>> status.  Individual contigs also create a file called run.log.
>> These will be under theVoid directory for each individual contig in
>> the MAKER datastore directory.  These files also contain entries
>> with labels like STARTED and FINISHED for each individual analysis.
>> The master_datastore_index.log file has status tags for entire
>> contigs as apposed to individual analyses.
>>
>> I hope that helps.  Let us know how it goes.
>>
>> Thanks,
>> Carson
>>
>>
>> On 10/21/09 3:21 AM, "Xavier Watkins" <xavier at flymine.org> wrote:
>>
>> Hi,
>> I'm currently doing a test run of Maker on the D. mel genome and I
>> would like to estimate the time it takes to run on our system (it
>> has now been running for 2 weeks on 5 processors).
>> Is there a way to know how many processes are left to run when
>> running MAKER, or to know which contigs (chromosomes in my case)
>> have finished running? From what I see it chops up the contigs into
>> slices when running RepeatMasker (currently on .151) is there a way
>> to know the total number of slices?
>>
>> Apologies if I've missed this info in the documentation, I couldn't
>> find it.
>>
>> All the best,
>> Xavier
>>
>>
>>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at yandell-lab.org
> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org




More information about the maker-devel mailing list