spark

Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)

James Conner

08 Aug 2017 — 1 min read

I often need to perform an inverse selection of columns in a dataframe, or exclude some columns from a query. This is a very easy method, and I use it frequently when arranging features into vectors for machine learning tasks.

import org.apache.spark.sql.Column

// Create an example dataframe
val dataDF = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 10, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 10, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("colToExclude", "col1", "col2", "col3", "col4", "col5", "col6")

// Get an array of all columns in the dataframe, then
// filter out the columns you want to exclude from the 
// final dataframe.
val colsToSelect = dataDF.columns.filter(_ != "colToExclude")

// Take a look at the array as comma separate values.
colsToSelect.mkString(",")

// This method allows you to perform a simple selection
dataDF.select(colsToSelect.head, colsToSelect.tail: _*).show()

// This method creates a new dataframe using your column list
// Filter dataDF using the colsToSelect array, and map
// the results into columns.
dataDF.select(dataDF.columns.filter(colName => colsToSelect.contains(colName)).map(colName => new Column(colName)): _*).show()

In the simple selection method, note that we had to use colsToSelect.head and colsToSelect.tail: _*. The reason for this is that the overloaded dataframe.select() method for multiple columns requires at least 2 column names. If you just put in the array name without using .head and .tail, you'll get an overloaded method error.

Controlling NVIDIA GPU Fans on a headless Ubuntu system

When performing data science on an Ubuntu Linux machine remotely, the fans on NVIDIA GPUs may not spin up in response to increased load. This is due to fact that the NVIDIA controlling software generally requires logging into the GUI Desktop. Simply forwarding the xwindow with SSH has little effect,

Google's 'Coral' Edge TPU Dev Boards

I received a pair of Google's 'Coral' Edge TPU Dev Boards today. They have the exact same dimensions as Raspberry Pi devices, including the location of the mounting post holes. This is convenient if you have project boxes or mounting plates that you normally use for

Google's 'Coral' Edge TPU Accelerator

Feeling like a kid on Christmas, today I finally received a couple of the Google 'Coral' Edge TPU Accelerators in the mail. First impressions upon unboxing is that they're not as physically robust as the Intel Neural Compute Stick 2. The entire Intel NCS2 case is

Pandas - Opening and Selecting Data

Data assets used in these examples * Titanic [https://www.kaggle.com/c/titanic] * City population [http://ezlocal.com/blog/post/Top-5000-US-Cities-by-Population.aspx] * USGS ShakeMap Atlas [https://earthquake.usgs.gov/data/shakemap/] -------------------------------------------------------------------------------- Reading data from CSV files Reading CSV files and using one of the columns as the index Basic

Read more

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator

Pandas - Opening and Selecting Data