python - pySpark - DataFrame groupBy error -
i have spark 1.3 set in virtualbox unbuntu 14 32 bit vm. have taken csv file spark dataframe , attempting operations giving me error messages can't troubleshoot.
the pyspark code below
from pyspark.sql import sqlcontext, row pyspark.sql.types import * datetime import * dateutil.parser import parse sqlcontext = sqlcontext(sc) elevfile = sc.textfile('file:////sharefolder/jones lake.csv') header = elevfile.first() schemastring = header.replace('"','') fields = [structfield(field_name, stringtype(), true) field_name in schemastring.split(',')] fields[0].datatype = integertype() fields[1].datatype = timestamptype() fields[2].datatype = floattype() schema = structtype(fields) elevheader = elevfile.filter(lambda l: "hour" in l) elevheader.collect() elevnoheader = elevfile.subtract(elevheader) print elevnoheader.take(5) elev_df = (elevnoheader.map(lambda k: k.split(",")) .map(lambda p: (int(p[0]), parse(p[1]), float(p[2]))) .todf(schema)) everything works fine point. prints out top 5 rows of new dataframe no problem:
print elev_df.head(5) [row(hour=6, date=datetime.datetime(1989, 9, 19, 0, 0), value=641.6890258789062), row(hour=20, date=datetime.datetime(1992, 4, 30, 0, 0), value=633.7100219726562), row(hour=10, date=datetime.datetime(1987, 7, 26, 0, 0), value=638.6920166015625), row(hour=1, date=datetime.datetime(1991, 2, 26, 0, 0), value=634.2100219726562), row(hour=2, date=datetime.datetime(1984, 7, 28, 0, 0), value=639.8779907226562)] but when try simple group , count, getting errors can't troubleshoot.
elev_df.groupby("hour").count().show() gives error (top few lines of error below).
--------------------------------------------------------------------------- py4jjavaerror traceback (most recent call last) <ipython-input-209-6533c596fac9> in <module>() ----> 1 elev_df.groupby("hour").count().show() /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py in show(self, n) 271 5 bob 272 """ --> 273 print self._jdf.showstring(n).encode('utf8', 'ignore') 274 275 def __repr__(self): /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 temp_arg in temp_args: /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise py4jjavaerror( 299 'an error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise py4jerror( any ideas on troubleshooting further?
it seems ur csv has blank value. can see replacing blank values groupby not accepting believe. handle ur csv blank values using spark dataframe easy way-
fillna(value, subset=none) replace null values, alias na.fill(). dataframe.fillna() , dataframenafunctions.fill() aliases of each other. parameters: value – int, long, float, string, or dict. value replace null values with. if value dict, subset ignored , value must mapping column name (string) replacement value. replacement value must int, long, float, or string. subset – optional list of column names consider. columns specified in subset not have matching data type ignored. example, if value string, , subset contains non-string column, non-string column ignored.
Comments
Post a Comment