• before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.
  • collection of domain-specific objects
  • DataSet[Row] = DataFrame
  • transformations and actions: actions -> trigger computation and return results, like count, show, writing data out to file systems.
    transformations -> produce new Datasets, like map, filter, select aggregate(groupBy).
  • Datasets are "lazy". triggered when an action is invoked.
  • A Dataset represents a logical plan that describes the computation required to produce the data. Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.
  • use explain function to explore the logical plan as well as optimized physical plan.
  • To efficiently support domain-specific objects, an Encoder is required.