Trinity 'best transcript set' from many libraries

Method for processing a large set of RNA-seq libraries to produce a common "best set". The series of commands described below is quite long, but goes all the way from the initial Trinity assembly through final TSA submission. It is from the AAAREADME.TXT file in the transcriptome_tools subdirectory of lemonade assemble.

#external software  references:
#   https://sourceforge.net/projects/lemonade-assemble/
#      for a tar.gz including most of the needed pieces.
#      Includes modified Last (lastal) used here.
#
#   (next is for extract, binorder, and many other small utilities used below)
#   https://sourceforge.net/projects/drmtools/
#   (programs from elsewhere, versions used are indicated)
#   NCBI toolbox+ (ncbi-blast-2.7.1+-1.x86_64)
#   tbl2asn 25.7 (version probably does not matter)
#   blat (apply the patch from lemonade-assemble or minIdentity will not work.)
#   Hmmer  3.1b1
#   Trinity 2.8.3
#   TransDecoder v5.3.0
#   InterProscan 5.30-69.0 
#   cdhit-2017_10_18
#   jellyfish 2.2.6
#   bowtie2 2.3.4.
#CDHIT_PATH is to cdhit directory

Most of the sh scripts need to be modified with site specific information before you run them!!!!!! Some of the perl scripts might need that as well. Many of the commands need a lot of space in a work directory, up to hundreds of Gb, and most are currently set to place that work directory under /tmp. If your root directory is small and /tmp is part of that file system these commands will fail with an assortment of mysterious error messages, mostly along the lines of "could not create" or "could not write".

Paths are NOT shown before scripts below. If the script/program is not in your PATH then add them as needed.

This example obtained data from the NCBI as SRR libraries. If your data is not yet stored in that form temporarily name your files as SRR999001.fasta and so forth, as there might be "SRR" dependencies in the code. In that situation the steps in scripts which use ascp to download data, and other programs to unpack it, would not be relevant and should be disabled.

Note, for Purple sea urchin TransDecoder was run early, then the adapter/chimera removal scripts were run on the subset of transcripts which had a CDS region called. This reduced the number of transcripts processed by a factor of about 4 which should have speeded up most of the subsequent steps. However, at the cost of longer run times it might be better to do steps 11-13 before step 2, as TransDecoder might then pick up some CDS regions on the cleaned transcripts that it missed when they were "raw". The analysis has not yet been done both ways, so a definitive recommendation isn't possible.

Step 0.

Set up some background information, used in various places below. CPUS is used by many scripts. Library information goes into the stages.txt and SRR_all.desc files. This example is from the Purple sea urchin project.

export CPUS=40  #or as many as you have or want to use.
cd /tmp
cat >pre_stages.txt <<EOD
SRR531952|embryo 0h
SRR531949|embryo 10h
SRR531860|embryo 18h
SRR531853|embryo 24h
SRR531948|embryo 24h
SRR532074|embryo 30h
SRR531956|embryo 40h
SRR531964|embryo 48h
SRR531954|embryo 56h
SRR531996|embryo 64h
SRR531950|embryo 72h
SRR532151|larva four-arm stage
SRR532055|larva vestibular invagination stage
SRR532143|larva pentagonal disc stage
SRR533746|larva tube-foot protrusion stage
SRR531957|post-metamorphosis
SRR531843|young juvenile
SRR531953|adult tissue coelomocyte
SRR531955|adult tissue gut
SRR531951|adult tissue axial gland
SRR532046|adult tissue radial nerve
SRR532121|adult tissue testes
SRR531958|adult tissue ovary
EOD

tr '|' '\t' <pre_stages.txt >stages.txt
rm pre_stages.txt
extract -in stages.txt -mt -dl '\t' -fmt '[1]' >SRR_all.desc

Step 1.

Run trinity 2.8.3 on each of the SRRs. Names produced are all like "283_trinity_SRR531952.Trinity.fasta". This can take a long time - if multiple machines are available split up the work by making subsets of SRR.desc.

cd $PATH1    
cp /tmp/SRR_all.desc ./SRR.desc
do_many_283.sh >do_many_283.log 2>&1 # this could take days to weeks

Step 2.

Set up a directory to run TransDecoder, make links with with names like "283_SRR531952.fasta"

mkdir $PATH2
cd $PATH2
ls -1 $PATH1/283_trinity_SRR*fasta \
   | extract -mt -dl './_' -fmt 'ln -s /[1,] 283_[-3].fasta' \
   | execinput

Step 3.

Still in $PATH2, run first part of TransDecoder on each of the input files.

ls -1 283_SRR*.fasta \
   | extract -mt -dl '_.' -fmt 'nice TransDecoder.LongOrfs -t [1,] >pass1_[2].log 2>&1 &' \
   | execinput

Step 4.

Still in $PATH2, run the blastp, and hmm parts and the final part of TransDecoder for the final protein calls. Produces files with names like: "283_SRR531952.fasta.transdecoder.pep".

do_many_transdecoder.sh

Step 5. Make a directory for reducing/merging this data

mkdir $PATH3
cd $PATH3

Step 6.

Copy files in while placing the library name on the front of each sequence name. That is, this >TRINITY_DN0_c0_g1_i4.p5 becomes >SRR531843_TRINITY_DN0_c0_g1_i4.p5.

ls -1 $PATH2/283*er.pep \
   | extract -mt -dl '_.' -fmt 'extract -in [1,] -out 283_[-4].tmp.pep -wl 100000000 -if ">" -fmt ">[-4]_[[2,]]"' \
   | execinput

Step 7.

Make mRNA files with all Trinity results with SRR*_ prefixes.

(cd $PATH1;
   ls -1 283_trinity_SRR*fasta \
   | extract -mt -dl '_.' -fmt 'extract -wl 10000000 -in [mc:1,] -out $PATH3/283_[3].tmp.fasta -if ">" -fmt ">[3]_[[2,]]"' \
   | execinput )

Step 8.

In PATH3, make a file with all of these mRNAs and another with all the peptides

    cat 283_SRR*.tmp.fasta >Transcriptome.fasta
    cat 283_SRR*.tmp.pep   >all_transdecoder.pep

Step 9.

In PATH3, copy in stages.txt and SRR_all.desc from step 0.

    cp /tmp/stages.txt .
    cp /tmp/SRR_all.desc .
    cp SRR_all.desc SRR.desc

Step 10. Pick a set of 6 libraries well distributed by stage/tissue type. More could be used but it would take longer to build the database and to search it. For Purple sea urchin this was:

cat >SRR_lib6.desc <<EOD
SRR531949
SRR531956
SRR531950
SRR532055
SRR531843
SRR531958
EOD

Step 11.

Find and split chimeric transcripts.

# Retrieve the 6 SRRs, unpack to fasta
# Build a lastal database from it.
# Map the mRNAs which have a called peptide CDS (uses a modified lastal
#   to find >=93% identity, full read maps, output in Blast table format) 
# Splits any mRNA found to be chimeric.  
# Keeps track of the splits in the ranges file so that it can be back applied
#    to the peptides after this and adapter trimming.
# Creates output files like:
#    transcriptome_oneset_SRR531949.supported.fasta
#    transcriptome_oneset_SRR531949.supported.ranges
# The lastal database is not deleted, and if it is found at start up
#   it is used.  So if there are problems after that point a restart
#   can skip the slow database build step.

nice do_many_read_mappings_unpaired_6libs.sh >do_many_read_mappings_unpaired_6libs.log 2>&1 &

Step 12.

Make a blast database holding just the relevant adapters.

# an initial scan with a subset of these mRNAs of UniVec showed that 
# the adapters were NGB00360:58 NGB00362:61.
# Make a blast database with just those two, for faster processing of the entire set.
# Use values corresponding to your adapters.

cd $UNIVEC # wherever your UniVec database is
(echo 'gnl|uv|NGB00360.1:1-58'; echo 'gnl|uv|NGB00362.1:1-61') \
   | fastaselecth -sel - -in UniVec -ht ' ' -hi ' ' >TwoNGB
formatdb -p F -i TwoNGB -o T  #important to use that form

Step 13.

Find and remove adapters.

cd $PATH3
nice scan_many_fasta_for_two_contaminants.sh \
   'transcriptome_oneset*.supported.fasta' $CPUS \
   >scan_many_for_two_contaminants.log 2>&1

nice trim_many_adapters.sh \
   'transcriptome_oneset_*.supported' \
   'NGB00360:58 NGB00362:61'  \
   >trim_many_adapters.log 2>&1
    
# keep repeating until no more adapters are trimmed.  For sea urchin two passes was enough.
# Note that the input is the output of the preceding round.
# Every round adds one more ".trimmed".
nice scan_many_fasta_for_two_contaminants.sh 'transcriptome_oneset*.supported.trimmed.fasta' $CPUS \
   >scan_many_for_two_contaminants2.log 2>&1

nice trim_many_adapters.sh 'transcriptome_oneset_*.supported.trimmed' 'NGB00360:58 NGB00362:61'  \
   >trim_many_adapters2.log 2>&1

Step 14.

Process the original peptide calls, adjusting them to match the split/trimmed mRNA sequences. Note that the rangefile name here contains "trimmed.trimmed", that is because it took 2 cycles at the preceding step to remove all adapters. The number may vary depending on how the assembly trimming goes.

This constructs a pair of commands for each SRR entry and runs them, sequentially. The first produces a good and bad pep output (good is not modified; bad is modified). The second just concatenates the two files two a single growing result file.

rm -f all_good_badC.pep  # Must not exist
extract -in SRR_all.desc \
   -fmt "identify_broken_cds.pl \
   -pepfile 283_[1,].tmp.pep  \
   -rangefile transcriptome_oneset_[1,].supported.trimmed.trimmed.ranges \
   -goodfile tmp.good.pep \
   -badfile tmp.bad.pep; \
cat tmp.good.pep tmp.bad.pep >> all_good_badC.pep" \
    | execinput >all_good_badC.log 2>&1

Step 15.

Use CD-hit to cluster the peptide sequences. -b 180 allows many exon size gaps to align together, otherwise they end up in different clusters.

$CD_HIT_PATH/cd-hit \
   -i all_good_badC.pep \
   -o all_good_badC.pep_pass0 -d 0 -T $CPUS -M 0 -b 180 \
   >all_good_badC.pep_pass0.log 2>&1

Step 16.

Make a flat file version of the cluster results too.

extract -in all_good_badC.pep_pass0.clstr -vB "Cluster" -mt -dl '>.,\t ' \
   -sect BEFORE \
   -op 'hide; poB' \
   -sect MAIN \
   -op 'xA;+A;poA; ? lgcf(scmn(B[1,1],A[1,1]))' \
   -pm '?=:-' ' !; xC; +C<2; brk' \
   -op 'xD;+D<[jr:fw3:1];+D<[fw8:jr:2];+D<= ;+D<[3,];<=Cluster; <[fw7:jr:vC:]; <=,  ;<[vD:];<=\n' \
   > all_good_badC.pep_pass0.clstr_flat.txt

Step 17.

Pick the "best" peptide sequence in each group. Keep this for reference, if you want to see how much of a difference the upcoming recluster operation made. Optionally, use the "sbc_good_badC.pep" file it produces instead of "all_good_badC.pep_pass0" file in the next few commands.

select_best_cdhit.pl \
   -infile all_good_badC.pep_pass0.clstr \
   -outroot sbc_good_badC \
   -pepfile all_good_badC.pep \
   -workdir /tmp/sbc_work  \
   -Nprocs $CPUS \
   -maxvariants 1 \
   -maxaltvariants 1 >all_good_badC_sbc.log 2>&1

Step 18.

Make a lastal database of the all_good_badC peptides. These will be searched with the reference (marked with a "*") sequence from each cluster.

tr -d '*' <all_good_badC.pep | lastdb -p agbC

Step 19. Run blat and lastal searches of all "best" peptides. Get sequence sizes into files.

many_blat_1cpu.sh \
   all_good_badC.pep \
   all_good_badC.pep_pass0 \
   pre_map_all_good_badC.pep_pass0B.txt \
   $CPUS \
   '-prot -minIdentity=90 -minScore=55 -out=blast8' >blat.log 2>&1

(lastal -P${CPUS} -m250 -I80 -Y95 -O50 -8 2 -9 .02 -f BlastTab agbC all_good_badC.pep_pass0 \
   | grep -v '^#' \
   | sort -k 1,1 -k 2,2 -k 7,7n \
   > pre_map_all_good_badC.pep_pass0A.txt 2>pre_map_all_good_badC.pep_pass0A.log)

fastasizes <all_good_badC.pep       >all_good_badC.pep.sizes
fastasizes <all_good_badC.pep_pass0 >all_good_badC.pep_pass0.sizes

Step 20.

Merge the two sets of matches (maps) into one final table. This merges a pair of hits like 1->100, 120->200 into a single synthetic hit of 1->200.

merge_lastal_to_final_map.pl \
   -file_align1  pre_map_all_good_badC.pep_pass0A.txt \
   -file_align2  pre_map_all_good_badC.pep_pass0B.txt \
   -ctg_sizes    all_good_badC.pep_pass0.sizes \
   -trg_sizes    all_good_badC.pep.sizes \
   -file_out     map_all_good_badC.pep_pass0_final.txt \
   -ltest_skip   1 \
   -min_aln      20 >merge_lastal_to_final_map.log 2>&1

Step 21.

Recluster based on those searches. (Overcomes some limits in CD-HITs original clustering method, reduces the number of clusters some.)

recluster_cdhit.pl \
   -infile all_good_badC.pep_pass0.clstr \
   -mapfile map_all_good_badC.pep_pass0_final.txt \
   -outroot all_good_badD >recluster.log 2>&1 &

Step 22.

Make a flat file version of this cluster too.

extract -in all_good_badD.clstr -vB "Cluster" -mt -dl '>.,\t ' \
   -sect BEFORE \
   -op 'hide; poB' \
   -sect MAIN \
   -op 'xA;+A;poA; ? lgcf(scmn(B[1,1],A[1,1]))' \
   -pm '?=:-' ' !; xC; +C<2; brk' \
   -op 'xD;+D<[jr:fw3:1];+D<[fw8:jr:2];+D<= ;+D<[3,];<=Cluster; <[fw7:jr:vC:]; <=,  ;<[vD:];<=\n' \
   > all_good_badD.clstr_flat.txt

Step 23.

First select a new set of "best" peptide sequences, one per cluster group.

Next select the "best" peptide sequence from groups using a representation constraint: keep clusters only if there members from at least 2 libraries and at least 3 different sequences. -statfile uses the "best" rather than the "reference", although these are often the same.

(This will drop some mRNAs which are only briefly expressed, or are very difficult to assemble. It also drops really egregious misassemblies, since those tend not to happen in more than one library.)

select_best_cdhit.pl \
   -infile all_good_badD.clstr \
   -outroot sbc_good_badD \
   -pepfile all_good_badC.pep \
   -workdir /tmp/sbc_work  \
   -Nprocs $CPUS \
   -maxvariants 1 \
   -maxaltvariants 1 >all_good_badD_sbc.log 2>&1

select_group_cdhit.pl \
   -infile all_good_badD.clstr \
   -outfile_ref all_good_badD_ref.txt \
   -outfile_all all_good_badD_all.txt \
   -statfile sbc_good_badD.stat \
   -min_members 3 \
   -min_types 2 \
   -min_chunks 3

cat all_good_badD_ref.txt \
   | fastaselecth -sel - -in sbc_good_badD.fasta \
   | tr -d '*' > all_good_badD_atleast_3_no_asterisk.pep

Step 24.

Extract the corresponding mRNAs from the processed subset of mRNAs.

cat transcriptome_oneset_*.supported.trimmed.trimmed.fasta >/tmp/all.supported.trimmed.trimmed.fasta
extract -in all_good_badD_atleast_3_no_asterisk.pep \
   -wl 1000000 -if '>' -ifonly \
   -mt -dl '>.' -fmt '[1]' \
   | fastaselecth -sel - -in /tmp/all.supported.trimmed.trimmed.fasta \
   > all_good_badD_mRNA.fasta 2>/tmp/oops.log
cat /tmp/oops.log
# It should say something like this, where selectors == emitted:
#
#    fastaselecth: status: selectors: 31134, records read: 1094436, emitted: 31134
#
rm /tmp/all.supported.trimmed.trimmed.fasta

Step 25.

Gather information needed to assign names to the products. Make a peptide blast database using sequences from the NCBI. Later scripts will depend on that header format. To speed things up the database includes only proteins taxonomically close to the target organism. Here "Eleutherozoa" is used. A different method might be better for an organism in a branch which has seen more work (and so many more sequences).

# Browser to:  https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
# Enter 133551 (or as appropriate for your organism)
# Click on the "proteins" link.
# Click on "send to"
# Select "file"
# Select "fasta"
# Download as "eleutherozoa_proteins.fasta" to $PATH4
  
cd $PATH4
makeblastdb -in eleutherozoa_proteins.fasta -dbtype prot -out eleutherozoa_proteins

# Same thing for uniprot set
# eleutherozoa is taxonomy code 133551 
# Browser to:  https://www.uniprot.org/uniprot/?query=taxonomy%3A133551&sort=score
# Click on "Download" in the line just above the table.
# Set "Download all", "Dompressed", "fasta (canonical)" and click "go"
# When the download completes unpack it as "uniprot_eleutherozoa.fasta" in $PATH4

cd $PATH4
makeblastdb -in uniprot_eleutherozoa.fasta -dbtype prot -out uniprot_eleutherozoa

Step 26.

Get information for naming peptides by running interpro and blastp. The blast database used is from the NCBI, not swiss-prot. Later scripts depend on that for parsing the header.

interproscan-5.30-69.0/interproscan.sh -appl PfamA -iprlookup \
   -goterms -f tsv -dp -i all_good_badD_atleast_3_no_asterisk.pep  \
   >interproscan.log 2>&1 &

blastp \
   -query all_good_badD_atleast_3_no_asterisk.pep \
   -db $PATH4/eleutherozoa_proteins \
   -num_threads $CPUS \
   -evalue 1e-10 \
   -max_hsps 1 \
   -num_alignments 1 \
   -outfmt 6 \
   > all_good_badD_atleast_3_no_asterisk_vs_eleutherozoa.outfmt6 \
   2> all_good_badD_atleast_3_no_asterisk_vs_eleutherozoa.log

blastp \
   -query all_good_badD_atleast_3_no_asterisk.pep \
   -db $PATH4/uniprot_eleutherozoa \
   -num_threads $CPUS \
   -evalue 1e-10 \
   -max_hsps 1  \
   -num_alignments 1 \
   -outfmt 6   \
    > all_good_badD_atleast_3_no_asterisk_vs_uniprot_eleutherozoa.outfmt6 \
    2> all_good_badD_atleast_3_no_asterisk_vs_uniprot_eleutherozoa.log

Step 27.

Remove contaminant sequences. This does not require that the identities of the contaminating organisms be known before hand, but it does require a large machine. This method uses a 100X (or so) set of Illumina reads for the target genome.
These are used to mask the transcript regions which have low or no coverage, and then transcripts are discarded if too much is masked. Preparing the tuple dump can take quite a while and requires a very large machine. The first script needs up to 100Gb of RAM to complete. Consider changing DMIN in the first script to 3 to drop tuples which are only present once or twice per genome. This greatly reduces the size of the final file, but not the memory usage.

cd $PATH5  # where 100X reads are, in fasta format, called "unpacked.fasta"
make_tuple23_alt.sh >make_tuple23_alt.log 2>&1 #makes GEC_23_tuples.dat
#
#  It will emit the number of records produced.  Assume it is 7101378205.
#  Need to add two more bytes to each record, each with a 1.  First make a "skinny"
#  file of 2 X 7101378205 = 10 X 1420275641 bytes.  Using longer records will
#  write faster.  This record format was developed for the genome assembly 
#  part of this package and were just reused here, where some of the fields 
#  are not relevant.
#
binload -out pad.bin
{ 16 z 1 u 1 1 1 1 1 1 1 1 1 1 } 1420275641 e
#
# now merge the 8 byte records from the first file with 2 byte records from the second file.
#
binorder -in GEC_23_tuples.dat,pad.bin -r 8 -sr 2 -combine -out counts.sorted.bin
binorder -preindex 24 -in counts.sorted.bin -r 10 -k 1,6
rm pad.bin
rm unpacked.fasta
rm GEC_23_tuples.dat 
     
cd $PATH4
#
#  17 is minimum acceptable count
#  2147483647 is maximum (== 32 bit int_max)
#
mask_fasta_ploidy.sh  \
   all_good_badD_mRNA.fasta \
   /tmp/masked_all_good_badD_mRNA.fasta 17 2147483647  \
   $PATH5/counts.sorted.bin \
   >/tmp/all_good_badD_mRNA_masking.log 2>&1

fasta_mask_statistics \
   -in /tmp/masked_all_good_badD_mRNA.fasta \
   -out /tmp/masked_all_good_badD_mRNA_summary.txt \
   -delim ' '

histogram_column.pl -infile   masked_all_good_badD_mRNA_summary.txt  \
   -outfile /tmp/histo_cds.txt -col 3 -max 100 -min 0 -bins 101

gnuplot
set yrange [0:1000]
plot   "/tmp/histo_cds.txt" using 1:2 with lines;

The plot should start high on the left fall down quickly, be relatively flat across the middle and start coming back up again at around 90%. Trying various cutoff values starting at 90% and working down for sea urchin it was found that the BUSCO score on the extracted subset was like this:

  All:   C:98.5%[S:92.3%,D:6.2%],F:0.2%,M:1.3%,n:978
  cut90: C:98.6%[S:93.7%,D:4.9%],F:0.2%,M:1.2%,n:978
  cut80: C:98.6%[S:93.8%,D:4.8%],F:0.2%,M:1.2%,n:978
  cut60 :C:98.6%[S:93.8%,D:4.8%],F:0.2%,M:1.2%,n:978
  cut50  C:98.3%[S:93.5%,D:4.8%],F:0.2%,M:1.5%,n:978
  cut40  C:97.6%[S:93.1%,D:4.5%],F:0.2%,M:2.2%,n:978

So the cutoff used was 60%, which was above the point where the missing score started to increase.

extract -in /tmp/masked_all_good_badD_mRNA_summary.txt \
   -sr 2 -mt -dl ' ' \
   -op 'hide; xA; +A; poA; ? lt(A[3],60.0)' \
   -pm '?>:-' '<[vA:-1]; <=\n' \
   > all_good_badD_below_cutoff60.txt

fastaselecth -sel all_good_badD_below_cutoff60.txt \
   -in all_good_badD_atleast_3_no_asterisk.pep -hi '. ' \
   > all_good_badD_atleast_3_no_asterisk_pass_masktest60.pep
# Status should have same number of selectors and emitted, like so:
# fastaselecth: status: selectors: 29744, records read: 31134, emitted: 29744

fastaselecth -sel all_good_badD_below_cutoff60.txt \
   -in all_good_badD_mRNA.fasta -hi '. ' \
   > all_good_badD_mRNA_pass_masktest60.fasta
# Status should have same number of selectors and emitted, and these
# should be the same as for the previous command, like so:
#fastaselecth: status: selectors: 29744, records read: 31134, emitted: 29744

# so far so good, number of mRNA and peptides still match.

Step 28.

Make a comment file for TSA submission

# Start at  https://submit.ncbi.nlm.nih.gov/structcomment/nongenomes/
# and fill in the web page.  At the end it will download a .cmt file,
# save it as assembly.cmt.  It is apparently OK to add comments BEFORE
# the text that gives you.  This one seems to have been acceptable,
# at least at the automated stages:

cat >assembly.cmt <<EOD
RNA-seq libraries were separately assembled with Trinity 
and proteins predicted with TransDecoder. Lemonade assembler 
transcriptome scripts split chimeric sequences, removed adapters, 
clustered the proteins, and selected the best sequence from 
each cluster group.  Tuples from 100X genomic reads were
mapped onto the transcripts and sequence below 17% coverage was masked.  
Probable contaminant transcripts, with 60% or more masked sequence,
were dropped.  Transcripts were reverse complemented as
needed so that the protein translation frame was on the positive strand.

StructuredCommentPrefix ##Assembly-Data-START##
Assembly Method Trinity 2.8.3; TransDecoder 5.3.0; Lemonade assembler transcriptome scripts 1/28/2019
Assembly Name   SPU_TRANSCRIPTOME_C
Sequencing Technology   Illumina GAIIx
EOD

Step 29.

Make an sbt file.

# browser to:   https://submit.ncbi.nlm.nih.gov/genbank/template/submission/
# fill in the form
# save as: template.sbt

Step 30.

Make a list of entries which are known to be problems. In practice you probably don't know this information yet, but the NCBI will happily point out all the shortcomings of your transcriptome assembly when you submit it! In the sea urchin there were just two. One was an rRNA which has a CDS (but that isn't real), and the other is an mRNA which contaminated both the RNA-Seq and Genomic reads (which were both done in the same facility/lab.) Format of the file is "entry\tcomment".

cat >/tmp/pre_bad.txt <<EOD
SRR532121_TRINITY_DN41715_c0_g1_i1_f1|mouse contaminant in both genomic and RNA-seq data
SRR531951_TRINITY_DN74_c0_g1_i13_f1|ribosomal RNA
EOD
tr '|' '\t' </tmp/pre_bad.txt >all_good_badD_mRNA_problems.txt
rm /tmp/pre_bad.txt

Step 31.

Generate files for tbl2asn

PATH6=/tmp/tbl2asn
mkdir $PATH6
annotate_transcripts.pl \
   -infile      all_good_badD_mRNA_pass_masktest60.fasta \
   -pepfile     all_good_badD_atleast_3_no_asterisk_pass_masktest60.pep \
   -reffile     /home/mathog/etc/eleutherozoa_proteins.fasta \
   -mapreffile  all_good_badD_atleast_3_no_asterisk_vs_eleutherozoa.outfmt6 \
   -unifile     /home/mathog/etc/uniprot_eleutherozoa.fasta \
   -mapunifile  all_good_badD_atleast_3_no_asterisk_vs_uniprot_eleutherozoa.outfmt6 \
   -tsvfile     all_good_badD_atleast_3_no_asterisk.pep.tsv \
   -outdir      $PATH6 \
   -exprfile    stages.txt \
   -clstrfile   all_good_badD.clstr \
   -rejfile     all_good_badD_mRNA_problems.txt \
   -organism    "Strongylocentrotus purpuratus" \
   -biology     1 \
   -outmap      all_good_badD_map_to_tbl2asn_submission.txt \
   -dbg         0 \
   >annotate_transcripts.log 2>&1
cp assembly.cmt template.sbt $PATH6

Step 32.

Make one big SQN file

cd $PATH6
tbl2asn -p ./ -r ./ -t template.sbt \
   -n 'Strongylocentrotus purpuratus' \
   -Y assembly.cmt \
   -M t -V t \
   -o all_mRNA.sqn >tbl2asn.log 2>&1

Look at the log file and err*val files for problems. Resolve all of them before you try to submit. There will be some fix files produced. Those are automatic renaming (because input names found did not match NCBI requirements) and are harmless.

Step 33.

Submit it.

     # browser to https://submit.ncbi.nlm.nih.gov/subs/tsa/
     # Sign in.
     # Fill in the web pages and upload the .sqn on the fourth tab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trinity 'best transcript set' from many libraries

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally