python - How to extract start and end sites based on capital letter in a sequence? -


i extract start , end site information in capital letter. counting sequence length using code below not able return sequence information accurately. p-match result need process given start site based on first alphabet start site need first capital letter occur in every site. how can retrieve accurate start , end site? can me?

text file a.txt

scanning sequence id:   best1_human            150 (-)  1.000  0.997  ggaaaggccc                                   r05891           354 (+)  0.988  0.981  gtgtagacatt                                  r06227 v$crel_01c-relv$evi1_05evi-1  scanning sequence id:   4f2_human            365 (+)  1.000  1.000  gggacctaca                                   r05884            789 (-)  1.000  1.000  gcgcgaaa                                       r05828; r05834; r05835; r05838; r05839 v$crel_01c-relv$e2f_02e2f 

expected output:

sequence id start end

best1_human 150 155 best1_human 358 363 4f2_human   370 370 4f2_human   792 797 

file b.txt

scanning sequence id: hg17_ct_er_er_142                512 (-)  0.988  0.981  tatagctaagc                        evi-1          r06227 v$evi1_05  scanning sequence id: hg17_ct_er_er_1                213 (-)  1.000  0.989  aggggcaggggtca                     coup-tf, hnf-4 r07445 v$coup_01 

expected output:

hg17_ct_er_er_142 514 519 hg17_ct_er_er_1 222 227 

example code:

output_file = open('output.bed','w') open('a.txt') f:     text = f.read()     chunks = text.split('scanning sequence id:')     chunk in chunks:         if chunk:             lines = chunk.split('\n')             sequence_id = lines[0].strip()             line in lines:                 if line.startswith('              '):                     start = int(line.split()[0].strip())                     sequence = line.split()[-2].strip()                     stop = start + len(sequence)                     #print sequence_id, start, stop                     seq='%s\t%i\t%i\n' % \                          (sequence_id,start,stop)                     output_file.write(seq) output_file.close() 

this code label , start values:

import re  p = "scanning sequence id\:\s*(?p<label>[a-z0-9]+\_[a-z0-9]+).*?(?p<start_value>\d+)"  open("a.txt", "r") f:     s = f.read()  re.findall(p,s, re.dotall) 

sample output:

[('best1_human', '150'), ('4f2_human', '365')] 

then there's calculation of second number ("end site"). in code in opening post see: sequence = line.split()[-2].strip(); stop = start + len(sequence). hence conclude thatyou want increment value start string length of second last column (ggaaaggccc etc.).

i can capture column well, using following modified regexp:

p = "scanning sequence id\:\s*(?p<label>[a-z0-9]+\_[a-z0-9]+).*?(?p<start_value>\d+)\s+\s+\s+\s+\s+\s+\s+(?p<sequence>\s+)" re.findall(p,s, re.dotall) 

sample output:

[('best1_human', '150', 'ggaaaggccc'), ('4f2_human', '365', 'gggacctaca')] 

now want handle situation 1 label has more 1 data line. this, need drop re.findall , go iteration:

import re open("a.txt", "r") f:     lines = f.readlines()  label_ptrn = re.compile("^scanning sequence id\\:\\s*(?p<label>[a-z0-9]+\\_[a-z0-9]+)$") line_ptrn = re.compile("^\s+(?p<start_value>\\d+)\\s+\\s+\\s+\\s+\\s+\\s+\\s+(?p<sequence>\\s+).*$") inner_ptrn = re.compile("[a-z]+")  all_matches = [] line in lines:     m = label_ptrn.match(line)     if m:         label = m.groupdict().get("label")         continue     m = line_ptrn.match(line)     if m:         start = m.groupdict().get("start_value")         sequence = m.groupdict().get("sequence")         mi = inner_ptrn.search(sequence)         if not mi:             continue         span = mi.span()         all_matches.append((label, int(start)+span[0], int(start)+span[1])) 

then can print matches follows:

with open("output.bed", "w+b") f:     m in all_matches:         f.write('%s\t%i\t%i\n' % m) 

sample output:

best1_human 150 155 best1_human 358 363 4f2_human   370 375 4f2_human   792 797 

i think problem solved ;)


Comments

Popular posts from this blog

How to connect android app to App engine -

gcc - MinGW's ld cannot perform PE operations on non PE output file -

php - display validation error message next to the textbox in codeigniter -