Apache Spark

Definition

Apache Spark is an open-source, distributed computing engine designed for large-scale data processing. It was created at UC Berkeley's AMPLab in 2009 and donated to the Apache Software Foundation, where it became the dominant framework for big data analytics. Spark executes computations in memory across a cluster of machines, dramatically outperforming disk-based predecessors like Hadoop MapReduce for iterative workloads — a property that makes it especially well-suited to machine learning.

Spark provides a unified API across four primary workloads: batch processing, SQL analytics, streaming (Structured Streaming), and machine learning (MLlib). This unification means teams can use a single cluster and a single programming model for the entire ML data pipeline — from raw log ingestion and feature engineering at petabyte scale all the way to distributed model training with MLlib. PySpark, the Python API, is the most widely used interface in the data science community.

The programming model is built around transformations and actions on distributed collections. Transformations (map, filter, join, groupBy) are lazy — they build an execution plan but do not run until an action (count, collect, write) is called. Spark's optimizer (Catalyst) rewrites and optimizes this plan before execution, often outperforming hand-tuned SQL. The result is an expressive, high-level API that scales from a laptop (local mode) to thousands of cores without code changes.

How it works

RDDs and DataFrames

Resilient Distributed Datasets (RDDs) are Spark's low-level abstraction: immutable, fault-tolerant, partitioned collections of records distributed across a cluster. RDDs support arbitrary transformations in Python, Scala, or Java but offer no schema information, so the optimizer has limited visibility into the data. DataFrames (and their typed counterpart, Datasets in Scala/Java) add a named schema on top of RDDs and expose a SQL-like API. The Catalyst optimizer can apply predicate pushdown, column pruning, and join reordering to DataFrame queries in ways that are not possible with raw RDDs. In practice, use DataFrames unless you need functionality that only RDDs expose.

Spark SQL

Spark SQL lets you query DataFrames with standard SQL syntax, mixing SQL and DataFrame API calls in the same program. It connects to Hive metastore, Delta Lake, Iceberg, and other table formats, enabling Spark to act as a query engine over a data lakehouse. In ML pipelines, Spark SQL is used for feature aggregation queries — rolling windows, user-level aggregations, and join operations across large tables — that would be prohibitively slow on a single machine.

MLlib

MLlib is Spark's distributed machine learning library. It provides algorithms for classification, regression, clustering, collaborative filtering, and feature engineering, all implemented to run in parallel across the cluster. The Pipeline API (pyspark.ml) mirrors scikit-learn's design: Transformer steps (scalers, encoders) and Estimator steps (model fitting) are chained into a Pipeline object that can be fit and serialized. MLlib is best used when training data is too large to fit in memory on a single machine, or when you need distributed hyperparameter tuning via CrossValidator.

Driver and executor architecture

A Spark application has one driver process and one or more executor processes. The driver runs the user's program, constructs the logical plan, negotiates resources with the cluster manager (YARN, Kubernetes, or Spark Standalone), and divides the work into tasks. Executors are JVM processes running on worker nodes; each executor holds a configurable number of CPU cores and memory slots (task slots). Tasks are serialized and shipped to executors, which process their assigned data partitions and write results back to memory, disk, or an output sink. Fault tolerance is achieved by recomputing lost partitions from their lineage (the sequence of transformations that produced them).

When to use / When NOT to use

Use when	Avoid when
Dataset does not fit in a single machine's RAM (> ~50 GB)	Data fits comfortably in memory — pandas or Polars will be faster
Feature engineering requires joins or aggregations across billions of rows	Your team lacks experience with distributed systems and cluster management
You need distributed ML training (MLlib) or hyperparameter search	Job startup overhead (JVM, cluster allocation) is unacceptable for latency-sensitive tasks
You already have a Spark cluster (Databricks, EMR, Dataproc)	You need real-time, sub-second processing (use Flink or Kafka Streams)
You want a unified engine for batch, SQL, and streaming on the same data	Your transformations are complex custom logic that benefits from single-threaded debugging

Comparisons

Criterion	Apache Spark	Pandas
Data scale	Petabytes — partitioned across a cluster	Gigabytes — limited by single-machine RAM
Execution model	Distributed, parallel, lazy evaluation	In-process, eager, single-threaded by default
Setup complexity	High — cluster, JVM, dependency management	Low — `pip install pandas` and run
Performance on small data	Slow due to serialization and task scheduling overhead	Fast — minimal overhead, cache-friendly
API familiarity	PySpark is similar to pandas but with distributed semantics	Widely known; the standard for data science in Python
ML integration	MLlib for distributed training; integrates with XGBoost on Spark	scikit-learn ecosystem; best-in-class for single-machine ML

Pros and cons

Pros	Cons
Scales to petabytes without code changes	Significant infrastructure and operational complexity
Unified engine for batch, SQL, and streaming	JVM startup and task scheduling add latency — not suitable for small jobs
In-memory processing dramatically faster than disk-based MapReduce	Memory management (spills, GC pressure) requires careful tuning
Rich ecosystem: Delta Lake, Iceberg, Hudi, MLflow integration	PySpark serializes Python UDFs via Py4J — can be slow for custom logic
Fault tolerance via lineage recomputation	Debugging distributed jobs is harder than single-machine pandas code

Code examples

"""
PySpark examples:
  1. DataFrame operations for feature engineering at scale
  2. MLlib Pipeline for training a logistic regression model

Requires: pyspark >= 3.4
Run locally with: spark-submit spark_ml_example.py
  or in a notebook with a SparkSession already available.
"""

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator


# --- 1. Create a SparkSession (local mode for demo) ---
spark = (
    SparkSession.builder
    .appName("MLPipelineExample")
    .master("local[*]")           # Use all local CPU cores
    .config("spark.sql.shuffle.partitions", "8")  # Reduce for small data
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")


# --- 2. Load raw data ---
# In production: spark.read.parquet("s3://bucket/path/") or .format("delta")
raw_df = spark.createDataFrame(
    [
        (1, 25.0, "engineer", 55000.0, 0),
        (2, 32.0, "manager", 85000.0, 1),
        (3, 28.0, "engineer", 62000.0, 0),
        (4, 45.0, "manager", 110000.0, 1),
        (5, 38.0, "analyst", 72000.0, 1),
        (6, 22.0, "analyst", 48000.0, 0),
        (7, 51.0, "manager", 125000.0, 1),
        (8, 29.0, "engineer", 59000.0, 0),
    ],
    schema=["id", "age", "role", "salary", "high_earner"],
)

print("=== Raw data ===")
raw_df.show()


# --- 3. DataFrame feature engineering ---
feature_df = (
    raw_df
    # Log-transform salary to reduce skew
    .withColumn("log_salary", F.log1p(F.col("salary")))
    # Normalized age within role group (window function)
    .withColumn(
        "age_norm",
        (F.col("age") - F.mean("age").over(
            __import__("pyspark.sql.window", fromlist=["Window"])
            .Window.partitionBy("role")
        )).cast(DoubleType())
    )
    # Drop the original salary column (replaced by log_salary)
    .drop("salary")
)

print("=== Engineered features ===")
feature_df.show()


# --- 4. MLlib Pipeline: encode → assemble → scale → train ---

# 4a. Encode categorical column 'role' to numeric index
role_indexer = StringIndexer(inputCol="role", outputCol="role_idx")

# 4b. Assemble feature columns into a single vector
assembler = VectorAssembler(
    inputCols=["age", "log_salary", "role_idx"],
    outputCol="raw_features",
)

# 4c. Standardize feature vector (zero mean, unit variance)
scaler = StandardScaler(
    inputCol="raw_features",
    outputCol="features",
    withMean=True,
    withStd=True,
)

# 4d. Logistic regression classifier
lr = LogisticRegression(
    featuresCol="features",
    labelCol="high_earner",
    maxIter=100,
    regParam=0.01,
)

# 4e. Build and fit the pipeline
pipeline = Pipeline(stages=[role_indexer, assembler, scaler, lr])

# Use the raw_df (with salary) because VectorAssembler picks log_salary after transform
train_df, test_df = raw_df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_df)

# --- 5. Evaluate ---
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="high_earner")
auc = evaluator.evaluate(predictions)
print(f"=== AUC on test set: {auc:.4f} ===")

predictions.select("id", "high_earner", "prediction", "probability").show()

# --- 6. Save model for serving ---
model.write().overwrite().save("/tmp/spark_lr_model")
print("Model saved to /tmp/spark_lr_model")

spark.stop()

Practical resources

Apache Spark documentation — Official reference for all APIs including PySpark, SQL, Streaming, and MLlib
PySpark API reference — Complete Python API docs for DataFrames, SQL, and MLlib
Databricks — Spark tutorials — Hands-on notebooks covering Spark at scale with Delta Lake
Learning Spark, 2nd edition (O'Reilly) — Comprehensive book covering Spark 3.x, Structured Streaming, and Delta Lake

Definition​

How it works​

RDDs and DataFrames​

Spark SQL​

MLlib​

Driver and executor architecture​

When to use / When NOT to use​

Comparisons​

Pros and cons​

Code examples​

Practical resources​

See also​