python - How to extract start and end sites based on capital letter in a sequence? -
i extract start , end site information in capital letter. counting sequence length using code below not able return sequence information accurately. p-match result need process given start site based on first alphabet start site need first capital letter occur in every site. how can retrieve accurate start , end site? can me?
text file a.txt
scanning sequence id: best1_human 150 (-) 1.000 0.997 ggaaaggccc r05891 354 (+) 0.988 0.981 gtgtagacatt r06227 v$crel_01c-relv$evi1_05evi-1 scanning sequence id: 4f2_human 365 (+) 1.000 1.000 gggacctaca r05884 789 (-) 1.000 1.000 gcgcgaaa r05828; r05834; r05835; r05838; r05839 v$crel_01c-relv$e2f_02e2f expected output:
sequence id start end
best1_human 150 155 best1_human 358 363 4f2_human 370 370 4f2_human 792 797 file b.txt
scanning sequence id: hg17_ct_er_er_142 512 (-) 0.988 0.981 tatagctaagc evi-1 r06227 v$evi1_05 scanning sequence id: hg17_ct_er_er_1 213 (-) 1.000 0.989 aggggcaggggtca coup-tf, hnf-4 r07445 v$coup_01 expected output:
hg17_ct_er_er_142 514 519 hg17_ct_er_er_1 222 227 example code:
output_file = open('output.bed','w') open('a.txt') f: text = f.read() chunks = text.split('scanning sequence id:') chunk in chunks: if chunk: lines = chunk.split('\n') sequence_id = lines[0].strip() line in lines: if line.startswith(' '): start = int(line.split()[0].strip()) sequence = line.split()[-2].strip() stop = start + len(sequence) #print sequence_id, start, stop seq='%s\t%i\t%i\n' % \ (sequence_id,start,stop) output_file.write(seq) output_file.close()
this code label , start values:
import re p = "scanning sequence id\:\s*(?p<label>[a-z0-9]+\_[a-z0-9]+).*?(?p<start_value>\d+)" open("a.txt", "r") f: s = f.read() re.findall(p,s, re.dotall) sample output:
[('best1_human', '150'), ('4f2_human', '365')] then there's calculation of second number ("end site"). in code in opening post see: sequence = line.split()[-2].strip(); stop = start + len(sequence). hence conclude thatyou want increment value start string length of second last column (ggaaaggccc etc.).
i can capture column well, using following modified regexp:
p = "scanning sequence id\:\s*(?p<label>[a-z0-9]+\_[a-z0-9]+).*?(?p<start_value>\d+)\s+\s+\s+\s+\s+\s+\s+(?p<sequence>\s+)" re.findall(p,s, re.dotall) sample output:
[('best1_human', '150', 'ggaaaggccc'), ('4f2_human', '365', 'gggacctaca')] now want handle situation 1 label has more 1 data line. this, need drop re.findall , go iteration:
import re open("a.txt", "r") f: lines = f.readlines() label_ptrn = re.compile("^scanning sequence id\\:\\s*(?p<label>[a-z0-9]+\\_[a-z0-9]+)$") line_ptrn = re.compile("^\s+(?p<start_value>\\d+)\\s+\\s+\\s+\\s+\\s+\\s+\\s+(?p<sequence>\\s+).*$") inner_ptrn = re.compile("[a-z]+") all_matches = [] line in lines: m = label_ptrn.match(line) if m: label = m.groupdict().get("label") continue m = line_ptrn.match(line) if m: start = m.groupdict().get("start_value") sequence = m.groupdict().get("sequence") mi = inner_ptrn.search(sequence) if not mi: continue span = mi.span() all_matches.append((label, int(start)+span[0], int(start)+span[1])) then can print matches follows:
with open("output.bed", "w+b") f: m in all_matches: f.write('%s\t%i\t%i\n' % m) sample output:
best1_human 150 155 best1_human 358 363 4f2_human 370 375 4f2_human 792 797 i think problem solved ;)
Comments
Post a Comment