python - Reading a CSV file into Pandas Dataframe with invalid characters (accents) -
i trying read csv file pandas dataframe. however, csv contains accents. using python 2.7
i've ran unicodedecodeerror
because there accent in first column. i've read on bunch of sites this question utf-8 in csv files, this blog post on csv errors related newlines, , this blog post on utf-8 issues in python 2.7.
i used answers i've found there try modify code. had:
import pandas pd #create dataframe data interested in df = pd.dataframe.from_csv('mydata.csv') mode = lambda ts: ts.value_counts(sort=true).index[0] cols = df['companyname'].value_counts().index df['calls'] = df.groupby('companyname')['companyname'].transform(pd.series.value_counts)
excetera. worked, passing in "nÍ" , "nê" customer name giving error:
unicodedecodeerror: 'utf8' codec can't decode byte 0xea in position 7: invalid continuation byte
i tried changing line df =pd.read_csv('mydata.csv',encoding ='utf-8') gives same error.
so tried suggestions found researching, not working either, , getting same error.
import pandas pd import csv def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) row in csv_reader: yield [unicode(cell, 'utf-8') cell in row] reader = unicode_csv_reader(open('mydata.csv','ru'), dialect = csv.reader) #create dataframe data interested in df =pd.dataframe(reader)
i feel should not difficult read csv data pandas dataframe. know of easier way?
edit: strange if delete row accented characters still error
unicodedecodeerror: 'utf8' codec can't decode byte 0xd0 in position 960: invalid continuation byte.
this strange test csv has 19 rows , 27 columns. hope if decode utf8 entire csv fix problem.
try adding top of script:
import sys reload(sys) sys.setdefaultencoding('utf8')
Comments
Post a Comment