regex - Converting tab-delimited text file into HTML/PDF/latex/knitr report -
this tab-delimited file:
chr start end ref alt func.refgene gene.refgene genedetail.refgene exonicfunc.refgene aachange.refgene snp138 clinvar_20140929 sift_score sift_pred polyphen2_hdiv_score polyphen2_hdiv_pred polyphen2_hvar_score polyphen2_hvar_pred lrt_score lrt_pred mutationtaster_score mutationtaster_pred mutationassessor_score mutationassessor_pred fathmm_score fathmm_pred radialsvm_score radialsvm_pred lr_score lr_pred vest3_score cadd_raw cadd_phred gerp++_rs phylop46way_placental phylop100way_vertebrate siphy_29way_logodds chr13 52523808 52523808 c t exonic atp7b nonsynonymous snv atp7b:nm_000053:exon12:c.2855g>a:p.r952k,atp7b:nm_001243182:exon13:c.2522g>a:p.r841k rs732774 clinsig=non-pathogenic|non-pathogenic;clndbn=wilson's_disease|not_specified;clnrevstat=single|single;clnacc=rcv000029357.1|rcv000078044.1;clndsdb=genereviews:medgen:omim:orphanet:snomed_ct|.;clndsdbid=nbk1512:c0019202:277900:orpha905:88518009|. 0.99 t 0.04 b 0.03 b 0.000 n 0.000 p -1.04 n -3.73 d -0.965 t 0.000 t 0.214 1.511 11.00 6.06 1.111 2.781 12.356 chr13 52523867 52523867 t g exonic atp7b synonymous snv atp7b:nm_000053:exon12:c.2796a>c:p.s932s,atp7b:nm_001243182:exon13:c.2463a>c:p.s821s
i have bash script takes abi file input , uses annovar annotating variants. tab-delimited text file produced contains annotated variants. everytime bash script executed different abi files, number of columns fixed in tab-delimited file number of rows individual annotations may vary each resulting variant.
attempts far-->
i have tried write bash script extracts [for first variant] different fields tab-delimited text file, saves text file, combines resulting text individual files , using awk script assigns different variables each of fields in combined text file. have created html page using awk , have used these variables in awk script print in respective tags in html , works fine file follows same pattern in tab-delimited text file. when particular field not present other annotated results different pattern, script prints different fields variable has been assigned for.
if first variant contains clinically significant mutation, there annotation present in "clinvar" column , needs reported in different section along other details.
the order of combined text file not same each variant, hence report generated not correct.
expected result-->
since format of tab-delimited file not uniform, there way each row can set multiple conditions wherein example if specific column [for ex:clinvar] has value, print in between html tags , if not present, check column [for ex: rsid] , if value present print in other html tags, , on other columns well!
variant position:chr13:52523808c>t
variant type: nonsynonymous-snv
rsid: rs732774
amino acid change: p.r952k
gene name:atp7b
disease:wilsons disease
result: non-pathogenic
the format of html page , values in should this:
<html> <title></title><head> <style type="text/css"> body {background-color:lightgray} h1 {background-color:slategray} </style> </head><body bgcolor="lightgray"> <table border=1><th align=>test code</th><th align=>gene name</th><th align=>condition tested</th><th align=>result</th> <tr><td width=750 align=></td><td width=750 align=>atp7b(refseq id: nm_000053)</td><td width=750 align=>wilson's_disease</td><td width=750 align=>non-pathogenic</td></tr> <h1 align=>test details</h1> <table border=1><th align=centre>genomic location of mutation</th><th align=centre>mutation type</th><th align=centre>dbsnp identifier</th><th align=centre>amino acid change</th><th align=centre>omim identifier</th> <h1 align=>significant findings</h1> <tr><td width=750 align=>chr13:52523808c>t</td><td width=750 align=>nonsynonymous-snv</td><td width=750 align=>rs732774</td><td width=750 align=>p.r952k</td><td width=750 align=>http://www.omim.org/entry/277900</td></tr> <p> identified variant located in <strong> exonic </strong> region of <strong> chr13 </strong> chromosome , <strong> nonsynonymous-snv </strong> causes amino acid change <strong> arginine </strong> <strong> lysine </strong>. mutation has been reported in dbsnp database (http://www.ncbi.nlm.nih.gov/snp/) accession number of <strong> rs732774 </strong>. </p> </table></body> </html>
in similar manner, when there novel variant wherein exonicfunc.refgene column contains "non-synonymous" , there no value in snp138 column, should print sift_score along other details in between html tags. these of conditions needed, if can give idea how go this, helpful!!!
thank reading such long issue , on problem appreciated.
the awk program show here, splits headers , data in corresponding rows. think can modify customize needs have. bear in mind prickly rules have -when doesn't appear, show instead- better implement asking implementation.
# # processor.awk # begin { ignorecase = 1; header = ""; html_template = "<tr><td>##fieldname</td><td>##fieldvalue</td></tr>" } { if( header == "" && $0 != "" ) { # first not empty line header header = $0; # put every element of header array split( header, fields, "\t" ); # debug: print fields found #for( elem in fields ) # print "field" elem ": " fields[elem]; } # if else { # normal lines # split line elements split( $0, content, "\t" ); # every element in content line.... for( elem = 1; fields[elem] !=""; elem++ ) { print elem; out_line = html_template; out_line = gensub( /##fieldname/, fields[elem], "g", out_line ); out_line = gensub( /##fieldvalue/, content[elem], "g", out_line ); # print result print out_line; } # } # if } end { }
Comments
Post a Comment