scala - Spark: How to transform a RDD to Seq to be used in pipeline -
i want use implementation of pipeline in mllib. before, had rdd file , pass model creation, use pipeline, there should sequence of labeleddocument passed pipeline.
i have rdd created follows:
val data = sc.textfile("/test.csv"); val parseddata = data.map { line => val parts = line.split(',') labeledpoint(parts(0).todouble, vectors.dense(parts.tail)) }.cache()
in pipeline example spark programming guide, pipeline needs following data:
// prepare training documents, labeled. val training = sparkcontext.parallelize(seq( labeleddocument(0l, "a b c d e spark", 1.0), labeleddocument(1l, "b d", 0.0), labeleddocument(2l, "spark f g h", 1.0), labeleddocument(3l, "hadoop mapreduce", 0.0), labeleddocument(4l, "b spark who", 1.0), labeleddocument(5l, "g d y", 0.0), labeleddocument(6l, "spark fly", 1.0), labeleddocument(7l, "was mapreduce", 0.0), labeleddocument(8l, "e spark program", 1.0), labeleddocument(9l, "a e c l", 0.0), labeleddocument(10l, "spark compile", 1.0), labeleddocument(11l, "hadoop software", 0.0)))
i need way change rdd (parseddata) sequence of labeleddocuments (like training in example).
i appreciate help.
i found answer question.
i can transform rdd (parseddata) schemardd sequnce of labeleddocuments following code:
val rddschema = parseddata.toschemardd;
now problem changed! want split new rddschema training (80%) , test (20%). if use randomsplit, returns array[rdd[row]] instead of schemardd.
new problem: how transform array[rdd[row]] schemardd -- or -- how split schemardd, in results schemardds?
Comments
Post a Comment