Genome assembly 6 – making sense of the BLAST output

Post number 4 talked about the process of BLASTing the new assemblies against the reference genes chs.fa, and sorting the resulting BLAST output file into a .csv file. This post will describe the continuation of the process.

A python script CSV_parse.py was written with the intent of generating a short sorted list of the top BLAST hits based on E-values. An interesting observation made was how multiple contigs seemed to be hits for same CHS genes; therefore the list includes the number of times each CHS gene was a hit for each contig:

The decision was made to use the assembly generated by Maria using the AbySS assembler with k-mer 30 which had the longest N50 of 1204. This assembly is however much larger than my previous assembly, and requires the computing prowess of the cluster to complete BLAST. As such my python script CSV_parse.py was transferred to the cluster in an attempt to keep all data on the cluster. This script however requires packages such as pandas and sys, which the default python package on Rackham does not have. The instructions on https://www.uppmax.uu.se/support/user-guides/python-modules-guide/ were followed to enable installation of these packages.

Because local modifications were made to the .csv blast output file to include headers, the following bash command [sed -i.bak 1i”QSEQID,SSEQID,PIDENT, LENGTH,MISMATCH,GAPOPEN,QSTART,QEND,SSTART,SEND,EVALUE,BITSCORE” blast3_sorted_c11.csv] was used to append the headers.

The following screenshot shows the list of hits from the top 100 E-values of the abyss assembly:

The next screenshot shows the list of top 300 E-values:

This list gives evidence as to which CHS genes are most closely related to the contigs from the AbySS assembly, which would be chs_Pram1 and chs_Pinf.

/Daryl