[maker-devel] long running maker question
cwilks
cwilks at stanford.edu
Thu Nov 5 18:57:05 MST 2009
Thanks for the help. Given what you and Mark have said and the lesser
importance of the cross species alignments, we'll probably just skip
doing them in Maker. We have other data we'll want to run through it
(and already have) though, so we may have more questions.
Thanks again,
Chris
Carson Holt wrote:
> Yes. Filling up /tmp is the likely culprit. Setting TMP in the control
> files to another location should fix that. Also you may want to clean
> the /tmp directory up manually at least once. MAKER is supposed to
> clean everything up as it goes, or if the user cancels the operation
> with ^C. However, sometimes when MAKER dies (depending on how it dies)
> the process does not get a chance to clean up temporary files. You can
> check this by looking for directories starting with the name “maker”
> in your /tmp directory.
>
> Unfortunately TBLASTX is very slow. TBLASTX does six frame
> translations for both the query and subject strands and then aligns
> each one against the other. So alignment time is 6x6 times longer (36
> times longer) than a corresponding BLASTP alignment. Plus there is
> overhead from sorting and filtering all the extra results. When you
> take into account the extra overhead, TBLASTX will take ~38 times
> longer than BLASTP and ~7-8 times longer than BLASTX. With that in
> mind, you may want to filter your alt-EST database if it is quite
> large (1.8 Gb does sound quite large). Also if you already have really
> good ESTs from the species being annotated, there is not as much to be
> gained from cross-species alignments. The alt-EST option is really
> better suited as a supplement for species with less EST quality and
> depth. That being said; sometimes you just want to do the cross
> species alignments so that they will be included in the final GFF3 and
> can be added as a database track and not because you are improving the
> annotations that much. This can generally be said for emerging model
> organism genomes where you want to have homology information from
> related species in the final genome database regardless of the quality
> of your same species EST database. Also if you already have existing
> cross species alignments, you can provide them as a GFF3 file in the
> altest_gff option in the control files. This would have the effect of
> getting all the benefits of cross species alignments without having to
> generate them from scratch via TBLASTX.
>
> There is really no way around TBLASTX being slow. It takes just as
> long running inside of MAKER as outside of MAKER (well technically it
> runs slightly faster inside of MAKER, and I could explain why later if
> you really care). On the bright side, MAKER should not have to
> recompute data it’s already built. So repeat-masking and protein
> alignment steps should just fly by.
>
> One final note, you may need to set retry in the control files to a
> higher number when recovering from a failed MAKER run. Otherwise MAKER
> might skip over previously failed contigs.
>
> Thanks
> Carson
>
> On 11/4/09 7:45 PM, "Chris Wilks" <cwilks at stanford.edu> wrote:
>
> Hi,
>
> We're running Maker right now for the alignment of cross species
> est and cdna sequences to the Arabidopsis thaliana genome.
> We've successfully run it recently on protein sequences (both same
> species and plant species in general).
>
> However, with the cross species nucleotide sequences we were
> experiencing a very long running time for the assignments on
> chromosome 1 (> 96 hours @ 100% cpu, on 8 concurrent threads),
> much of it appearing to be tblastx.
> Then, the job finished, and posted that all chromosomes had been
> started and died (and retried and died).
>
> From what I've seen, the job dying part is probably due to running
> out of /tmp space, so I reset that parameter to not use the system
> /tmp and use one which should have much more space to fill.
> However, I'm still concerned that long running behavior I saw with
> tblastx will still occur.
>
> I should also note that we've already generated the repeats, so
> this job is not re-running them (we're feeding them in from a
> maker generated gff file using the rm_gff setting in the
> maker_opts.ctl file).
>
> I picked up the latest version (as of October 6th) and I have the
> quick and dirty installation of Bioperl 1.6.
> Our cross species fasta file is ~1.8 Gigabytes large and we're
> running it against a ~125Mbps genome.
> There are 2,684,575 sequences in the cross species fasta file.
>
> So I'm wondering 1) if this is just normal running times for large
> files and we have to live with it, or 2) whether we can shorten
> this up and/or there's something anomalous going on?
>
> Thanks,
> Chris Wilks
> TAIR
>
>
>
More information about the maker-devel
mailing list