4 September 2023
Apache Spark has emerged as a top choice for developing machine learning applications thanks to its speed, scalability and versatile APIs. Combined with the managed Databricks platform, Spark allows data scientists to quickly build complete ML pipelines for preparing data, training models and deploying into production.
In this comprehensive guide, we’ll walk through the end-to-end steps for constructing a ML workflow with Databricks - from ingesting data to automating training to deployment. We’ll specifically focus on leveraging the capabilities of Apache Spark MLlib throughout.
If you are considering apache spark training for yourself or your team please feel free to get in touch and we'd be delighted to find a solution for you.
The first task when developing any analytics application is getting access to data in our workspace. Databricks provides several options for ingesting datasets:
Import Files - For small to medium sized datasets, we can directly upload files from our local machines into DBFS using the Databricks UI. This is handy for initial prototyping.
Cloud Storage - For larger production data, we should load from cloud storage like Amazon S3 buckets. Databricks provides high-speed connectors for S3 and other object stores.
Streaming - Spark Streaming allows ingesting continuous real-time data over HTTP, Kafka, Kinesis and other streaming sources.
For this guide, we’ll upload a sample labeled dataset in CSV format to DBFS:
# Load labeled data for model training data = spark.read.format("csv")\ .option("header", "true")\ .load("/FileStore/tables/training_data.csv")
We can optionally do any cleansing, preprocessing or feature engineering at this stage using Spark DataFrame transformations:
from pyspark.sql.functions import * # Apply any cleaning logic clean_data = data.withColumn("Label", trim(col("Label"))) # Extract feature columns features = clean_data.select([c for c in clean_data.columns if c != "Label"])
Now our data is ready for ML model building.
Spark MLlib provides a suite of machine learning algorithms for common tasks like classification, regression, clustering, collaborative filtering etc.
Some popular algorithms include:
LogisticRegression- Binary classification
RandomForestClassifier- Ensemble trees for classification
GBTRegressor- Gradient boosted trees for regression
KMeans- Clustering for unsupervised learning
We can train a model by calling the
fit() method on a dataset:
# Import model class from pyspark.ml.classification import LogisticRegression # Define logistic regression model lr = LogisticRegression() # Fit model to training data model = lr.fit(features, clean_data["Label"])
To find the best model, we should tune hyperparameters via cross-validation:
# Import tuning module from pyspark.ml.tuning import CrossValidator # Define parameter grid grid = (ParamGridBuilder() .addGrid(lr.regParam, [0.01, 0.1, 1.0]) .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) .build()) # Build 5-fold cross validator cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=MulticlassClassificationEvaluator(), numFolds=5) # Find best model best_model = cv.fit(features, clean_data["Label"])
We can track experiments with MLflow to compare models:
# Import MLflow import mlflow # Start run with mlflow.start_run(run_name="logistic_reg") as run: # Log model metrics lr = LogisticRegression() model = lr.fit(features, clean_data["Label"]) mlflow.log_metric("accuracy", evaluator.evaluate(model.transform(features))) # Log model artifact mlflow.spark.log_model(model, "model")
Now we have a trained model ready for deployment.
Databricks allows deploying models via batch scoring jobs, real-time streaming predictions, or as MLflow model serving endpoints.
For batch deployments, we can schedule a job to run our scoring logic on new data:
# Load new scoring data test_data = spark.read.... # Make predictions predictions = model.transform(test_data) # Save predictions to output predictions.write.format(...).save(...)
For streaming data, we can use Structured Streaming:
streaming_data = (spark.readStream... .withColumn(...)) # Make predictions predictions = model.transform(streaming_data) # Stream predictions query = (predictions.writeStream .format(...) .outputMode(...) .start())
To serve real-time predictions via REST API, we can use MLflow model serving:
# Log MLflow model mlflow.spark.log_model(model, "model") # Deploy model to staging mlflow.set_tracking_uri(...) mlflow.spark.deploy(.., "staging") # Transition to prod mlflow.spark.transition_model_version(...)
Monitoring deployed models helps catch any drift or degradation.
To repeatably retrain and deploy models on new data, we can setup scheduled jobs:
Chaining these jobs creates an automated ML pipeline. We can also implement a delta architecture - separate batch and streaming paths that merge before deployment.
For more advanced workflows, we can utilize Airflow, MLflow Pipelines and other orchestration frameworks.
In this guide, we built an end-to-end ML application on Apache Spark's Databricks - from ingesting data to training models to deployment. Here are some ways we can enhance our pipeline further:
Databricks provides data scientists a feature-rich and scalable platform for realizing complete ML lifecycle workflows on Apache Spark. I hope this tutorial provided useful patterns for developing your own intelligent applications. Let me know if you have any other questions!
Here are some common FAQs about building machine learning pipelines on Apache Spark's Databricks:
Q: Does Databricks support other ML frameworks like TensorFlow or PyTorch?
A: Yes, Databricks offers integration with TensorFlow, PyTorch, Keras and other frameworks in addition to Spark MLlib.
Q: Can I build real-time streaming ML applications on Databricks?
A: Yes, you can leverage Spark Structured Streaming to build real-time predictive applications that act on streaming data.
Q: How do I manage dependencies and environments for ML projects?
A: Use MLflow for managing dependencies, versioning models and packaging environments to deploy.
Q: Does Databricks offer MLOps capabilities?
A: Databricks supports the full MLOps stack - data management, experiment tracking, model management, deployment, monitoring etc.
Q: Can I use third-party orchestration tools like Airflow or Kubeflow?
A: Definitely, you can integrate Databricks ML workflows with various orchestration platforms.
If you found this article useful you might enjoy Getting Started with Apache Spark’s Databricks or Optimizing Apache Spark Jobs and Workloads on Databricks or Discover how to leverage the potential of Apache Spark and Databricks through our tailored courses at JBI Training, designed to empower you with the skills to construct and deploy machine learning pipelines effectively.
Embark on your journey to construct a powerful machine learning pipeline with the combined might of Apache Spark and Databricks. Equip yourself with the tools, techniques, and certifications needed to harness the full potential of machine learning and make data-driven decisions that drive innovation and success.