[maker-devel] Status check?
Xavier Watkins
xavier at flymine.org
Thu Oct 22 10:20:06 MDT 2009
Thanks Barry,
I understand better now, I think my mistake was to think that
RepeatRunner was doing the custom library. I did download and install
RepBase when installing RepeatMasker, so I guess it's running with
that at the moment. I will have a look at the tools you have suggested
for building the custom library.
Xavier
On 22 Oct 2009, at 17:09, Barry Moore wrote:
> Xavier,
>
> RepeatMasker doesn't build a library of repeats by blasting the
> genome against itself. RepeatMasker uses a curated nucleotide
> library of repeats called RepBase - you would have had to register
> and download RepBase at some point when you were installing
> RepeatMasker. In addition MAKER uses a tool called RepeatRunner
> that uses as curated protein library of proteins associated with
> known transposable elements. RepeatMasker uses blastn or crossmatch
> to find the RepBase repeats in your genome and RepeatRunner uses
> blastx for find the te_proteins.fasta genes in your genome. If you
> want to find novel repeats in your genome - that is repeats that
> aren't already in RepBase or RepeatRunner's protein database then
> you'll need to run tools for that purpose. We've used the PALS/
> PILER/MUSCLE suite of programs from Bob Edgar for this purpose, but
> there are other tools as well - I've heard good things about
> RepeatScout and intend to try it at some point as well. It's
> usually a good idea to run RepeatMasker and RepeatRunner on your
> genome first to make PALS job easier. After you get you species
> specific repeat library prepares you pass it to MAKER about it (the
> rmlib option) in your maker_opts config file. MAKER doesn't
> incorporate the process of building a custom repeat library for you.
>
> Barry
>
> On Oct 22, 2009, at 5:38 AM, Xavier Watkins wrote:
>
>> Hi,
>> Thanks everyone for your help. I'm already running mpi_maker,
>> planning
>> on using more CPUs next time...
>>
>> I'm using the following:
>>
>> RepeatMasker with cross_match (running cross match seems to be the
>> bit
>> that's taking ages)
>>
>> snap version 2006-07-28
>> GeneMarkS
>> NCBI blastall 2.2.20 for blastx
>>
>> Not really sure what happens in RepeatMasker but from what I
>> understand it tries to blast the genome against itself to build a
>> library of possible repeats? Is there a way of building this library
>> of repeats in a more efficient way?
>>
>> Many thanks,
>> Xavier
>>
>>
>> On 21 Oct 2009, at 16:07, Mark Yandell wrote:
>>
>>>
>>> Hi Xavier,
>>>
>>> I agree: This seems way to long. I can basically reproduce flybase's
>>> annotations and blast data in about 3 days on my laptop-- so 2 weeks
>>> on 5 processor's seems way too long.
>>>
>>> Is there some. special, really huge dataset you are running? Are you
>>> doing TBLASTX to align hits from a large database of sequences?
>>>
>>> --mark
>>>
>>> Mark Yandell
>>> Associate Professor of Human Genetics
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> ph:801-587-7707
>>> ________________________________________
>>> From: maker-devel-bounces at yandell-lab.org [maker-devel-bounces at yandell-lab.org
>>> ] On Behalf Of Carson Holt
>>> Sent: Wednesday, October 21, 2009 8:57 AM
>>> To: Xavier Watkins; maker-devel at yandell-lab.org
>>> Subject: Re: [maker-devel] Status check?
>>>
>>> The time spent depends primarily on the size of the protein, EST,
>>> and repeat protein databases provided. BLAST actually makes up
>>> about 90% of the run time for MAKER. If your using 5 processors, I
>>> suggest using mpi_maker instead of regular maker. It gets better
>>> performance on multiprocessor systems. The number of slices is
>>> dependant on what you set max_dna_len to be in the maker_opt.ctl
>>> file. Increasing the number increases memory usage. Just divide
>>> the contig length by that number. Doing a test run on the entire
>>> Drosophila genome could take a while especially if you used large
>>> protein and EST databases for the analysis. It is 120 Megabases in
>>> size, and with the default max_dna_len of 100,000, it would be
>>> divided into 1,200 chunks. It could take anywhere from 4 days to 3
>>> weeks depending on the BLAST databases used.
>>>
>>> I guess Barry already answered the question on how to check on run
>>> status. Individual contigs also create a file called run.log.
>>> These will be under theVoid directory for each individual contig in
>>> the MAKER datastore directory. These files also contain entries
>>> with labels like STARTED and FINISHED for each individual analysis.
>>> The master_datastore_index.log file has status tags for entire
>>> contigs as apposed to individual analyses.
>>>
>>> I hope that helps. Let us know how it goes.
>>>
>>> Thanks,
>>> Carson
>>>
>>>
>>> On 10/21/09 3:21 AM, "Xavier Watkins" <xavier at flymine.org> wrote:
>>>
>>> Hi,
>>> I'm currently doing a test run of Maker on the D. mel genome and I
>>> would like to estimate the time it takes to run on our system (it
>>> has now been running for 2 weeks on 5 processors).
>>> Is there a way to know how many processes are left to run when
>>> running MAKER, or to know which contigs (chromosomes in my case)
>>> have finished running? From what I see it chops up the contigs into
>>> slices when running RepeatMasker (currently on .151) is there a way
>>> to know the total number of slices?
>>>
>>> Apologies if I've missed this info in the documentation, I couldn't
>>> find it.
>>>
>>> All the best,
>>> Xavier
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at yandell-lab.org
>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>
>
More information about the maker-devel
mailing list