- before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.
- collection of domain-specific objects
- DataSet[Row] = DataFrame
- transformations and actions: actions -> trigger computation and return results, like count, show, writing data out to file systems.
transformations -> produce new Datasets, like map, filter, select aggregate(groupBy).
- Datasets are "lazy". triggered when an action is invoked.
- A Dataset represents a logical plan that describes the computation required to produce the data. Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.
- use explain function to explore the logical plan as well as optimized physical plan.
- To efficiently support domain-specific objects, an Encoder is required.
Subscribe to think it do it
Get the latest posts delivered right to your inbox