This document provides a summary of Apache Spark, including:
- Spark is a framework for large-scale data processing across clusters that is faster than Hadoop by relying more on RAM and minimizing disk IO.
- Spark transformations operate on resilient distributed datasets (RDDs) to manipulate data, while actions return results to the driver program.
- Spark can receive data from various sources like files, databases, sockets through its datasource APIs and process both batch and streaming data.
- Spark streaming divides streaming data into micro-batches called DStreams and integrates with messaging systems like Kafka. Structured streaming is a newer API that works on DataFrames/Datasets.