regex - Converting tab-delimited text file into HTML/PDF/latex/knitr report -


this tab-delimited file:

 chr start   end ref alt func.refgene    gene.refgene    genedetail.refgene  exonicfunc.refgene  aachange.refgene    snp138  clinvar_20140929    sift_score  sift_pred   polyphen2_hdiv_score    polyphen2_hdiv_pred polyphen2_hvar_score    polyphen2_hvar_pred lrt_score   lrt_pred    mutationtaster_score    mutationtaster_pred mutationassessor_score  mutationassessor_pred   fathmm_score    fathmm_pred radialsvm_score radialsvm_pred  lr_score    lr_pred vest3_score cadd_raw    cadd_phred  gerp++_rs   phylop46way_placental   phylop100way_vertebrate siphy_29way_logodds chr13   52523808    52523808    c   t   exonic  atp7b       nonsynonymous snv   atp7b:nm_000053:exon12:c.2855g>a:p.r952k,atp7b:nm_001243182:exon13:c.2522g>a:p.r841k    rs732774    clinsig=non-pathogenic|non-pathogenic;clndbn=wilson's_disease|not_specified;clnrevstat=single|single;clnacc=rcv000029357.1|rcv000078044.1;clndsdb=genereviews:medgen:omim:orphanet:snomed_ct|.;clndsdbid=nbk1512:c0019202:277900:orpha905:88518009|.    0.99    t   0.04    b   0.03    b   0.000   n   0.000   p   -1.04   n   -3.73   d   -0.965  t   0.000   t   0.214   1.511   11.00   6.06    1.111   2.781   12.356 chr13   52523867    52523867    t   g   exonic  atp7b       synonymous snv  atp7b:nm_000053:exon12:c.2796a>c:p.s932s,atp7b:nm_001243182:exon13:c.2463a>c:p.s821s                                                                                                              

i have bash script takes abi file input , uses annovar annotating variants. tab-delimited text file produced contains annotated variants. everytime bash script executed different abi files, number of columns fixed in tab-delimited file number of rows individual annotations may vary each resulting variant.

attempts far-->

i have tried write bash script extracts [for first variant] different fields tab-delimited text file, saves text file, combines resulting text individual files , using awk script assigns different variables each of fields in combined text file. have created html page using awk , have used these variables in awk script print in respective tags in html , works fine file follows same pattern in tab-delimited text file. when particular field not present other annotated results different pattern, script prints different fields variable has been assigned for.

if first variant contains clinically significant mutation, there annotation present in "clinvar" column , needs reported in different section along other details.

the order of combined text file not same each variant, hence report generated not correct.

expected result-->

since format of tab-delimited file not uniform, there way each row can set multiple conditions wherein example if specific column [for ex:clinvar] has value, print in between html tags , if not present, check column [for ex: rsid] , if value present print in other html tags, , on other columns well!

variant position:chr13:52523808c>t

variant type: nonsynonymous-snv

rsid: rs732774

amino acid change: p.r952k

gene name:atp7b

disease:wilsons disease

result: non-pathogenic

the format of html page , values in should this:

<html> <title></title><head> <style type="text/css"> body {background-color:lightgray} h1   {background-color:slategray} </style> </head><body bgcolor="lightgray"> <table border=1><th align=>test code</th><th align=>gene name</th><th align=>condition tested</th><th align=>result</th> <tr><td width=750 align=></td><td width=750 align=>atp7b(refseq id: nm_000053)</td><td width=750 align=>wilson's_disease</td><td width=750 align=>non-pathogenic</td></tr> <h1 align=>test details</h1> <table border=1><th align=centre>genomic location of mutation</th><th align=centre>mutation type</th><th align=centre>dbsnp identifier</th><th align=centre>amino acid change</th><th align=centre>omim identifier</th> <h1 align=>significant findings</h1> <tr><td width=750 align=>chr13:52523808c>t</td><td width=750 align=>nonsynonymous-snv</td><td width=750 align=>rs732774</td><td width=750 align=>p.r952k</td><td width=750 align=>http://www.omim.org/entry/277900</td></tr> <p> identified variant located in <strong> exonic </strong> region of <strong> chr13 </strong> chromosome , <strong> nonsynonymous-snv </strong> causes amino acid change <strong> arginine </strong> <strong> lysine </strong>. mutation has been reported in dbsnp database (http://www.ncbi.nlm.nih.gov/snp/) accession number of <strong> rs732774 </strong>. </p> </table></body> </html> 

in similar manner, when there novel variant wherein exonicfunc.refgene column contains "non-synonymous" , there no value in snp138 column, should print sift_score along other details in between html tags. these of conditions needed, if can give idea how go this, helpful!!!

thank reading such long issue , on problem appreciated.

the awk program show here, splits headers , data in corresponding rows. think can modify customize needs have. bear in mind prickly rules have -when doesn't appear, show instead- better implement asking implementation.

# # processor.awk #   begin   {         ignorecase = 1;          header = "";          html_template = "<tr><td>##fieldname</td><td>##fieldvalue</td></tr>"         }         {         if( header == "" && $0 != "" )         {   # first not empty line header             header = $0;              # put every element of header array             split( header, fields, "\t" );             # debug: print fields found             #for( elem in fields )              #   print "field" elem ": " fields[elem];         } # if          else         {             # normal lines             # split line elements              split( $0, content, "\t" );              # every element in content line....             for( elem = 1; fields[elem] !=""; elem++ )             {                 print elem;                 out_line = html_template;                  out_line = gensub( /##fieldname/, fields[elem], "g", out_line );                 out_line = gensub( /##fieldvalue/, content[elem], "g", out_line );                  # print result                 print out_line;             } #          } # if          } end     {         } 

Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -