scala - Spark: How to transform a RDD to Seq to be used in pipeline -


i want use implementation of pipeline in mllib. before, had rdd file , pass model creation, use pipeline, there should sequence of labeleddocument passed pipeline.

i have rdd created follows:

val data = sc.textfile("/test.csv"); val parseddata = data.map { line =>         val parts = line.split(',')         labeledpoint(parts(0).todouble, vectors.dense(parts.tail))         }.cache() 

in pipeline example spark programming guide, pipeline needs following data:

// prepare training documents, labeled. val training = sparkcontext.parallelize(seq(   labeleddocument(0l, "a b c d e spark", 1.0),   labeleddocument(1l, "b d", 0.0),   labeleddocument(2l, "spark f g h", 1.0),   labeleddocument(3l, "hadoop mapreduce", 0.0),   labeleddocument(4l, "b spark who", 1.0),   labeleddocument(5l, "g d y", 0.0),   labeleddocument(6l, "spark fly", 1.0),   labeleddocument(7l, "was mapreduce", 0.0),   labeleddocument(8l, "e spark program", 1.0),   labeleddocument(9l, "a e c l", 0.0),   labeleddocument(10l, "spark compile", 1.0),   labeleddocument(11l, "hadoop software", 0.0))) 

i need way change rdd (parseddata) sequence of labeleddocuments (like training in example).

i appreciate help.

i found answer question.

i can transform rdd (parseddata) schemardd sequnce of labeleddocuments following code:

val rddschema = parseddata.toschemardd; 

now problem changed! want split new rddschema training (80%) , test (20%). if use randomsplit, returns array[rdd[row]] instead of schemardd.

new problem: how transform array[rdd[row]] schemardd -- or -- how split schemardd, in results schemardds?


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -