spark

31
Aug
Connect Jupyter to Remote Spark Clusters With Apache Toree

Connect Jupyter to Remote Spark Clusters With Apache Toree

Scala [https://www.scala-lang.org/] is a fun language which gives you all the power of Java [https://www.
3 min read
21
Oct
Transpose data with Spark

Transpose data with Spark

A short user defined function written in Scala which allows you to transpose a dataframe without performing aggregation functions.
1 min read
16
Sep
Convert Spark Vectors to DataFrame Columns

Convert Spark Vectors to DataFrame Columns

Vectors are typically required for Machine Learning tasks, but are otherwise not commonly used. Sometimes you end up with an
2 min read
15
Sep
Pivoting data with Spark

Pivoting data with Spark

One of the common data engineering tasks is taking a deep dataset and turning into a wide dataset with some
3 min read
29
Aug
Renaming All Columns In A Spark DataFrame

Renaming All Columns In A Spark DataFrame

Here's an easy example of how to rename all columns in an Apache Spark DataFrame. Tehcnically, we'
1 min read
21
Aug
Using Spark, Scala and XGBoost On The Titanic Dataset from Kaggle

Using Spark, Scala and XGBoost On The Titanic Dataset from Kaggle

The Titanic: Machine Learning from Disaster [https://www.kaggle.com/c/titanic] competition on Kaggle [https://www.kaggle.com/] is
11 min read
10
Aug
List All Additional Jars Loaded in Spark

List All Additional Jars Loaded in Spark

Once in a while, you need to verify the versions of your jars which have been loaded into your Spark
1 min read
09
Aug
Spark Vector of Vectors

Spark Vector of Vectors

I recently ran into a problem with creating a features vector for a machine learning project. If the number of
6 min read
09
Aug
Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column
2 min read
08
Aug
Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)

Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)

I often need to perform an inverse selection of columns in a dataframe, or exclude some columns from a query.
1 min read