spark

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

James Conner

09 Aug 2017 — 2 min read

When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column name errors?

1) Let's start off by preparing a couple of simple example dataframes

// Create first example dataframe
val firstDF = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 7, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 8, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("uid", "col1", "col2", "col3", "col4", "col5", "col6")

// Display the dataframe
firstDF.show()

// Create second example dataframe
val secondDF = spark.createDataFrame(Seq(
  (1, 3, 2, 0, 4, 2, 8),
  (2, 3, 3, 2, 6, 5, 4),
  (3, 8, 5, 1, 2, 3, 5),
  (4, 9, 8, 2, 4, 9, 2),
  (5, 3, 4, 8, 0, 6, 2),
  (6, 3, 9, 8, 8, 9, 3)
)).toDF("uid", "colA", "colB", "colC", "colD", "colE", "colF")

// Display the dataframe
secondDF.show()

2) Next let's perform a simple inner join.

// Perform simple inner join on uid column
val joinedDF = firstDF.join(secondDF)

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- uid: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)


joinedDF.select("uid","col1","colA")
org.apache.spark.sql.AnalysisException: Reference 'uid' is ambiguous, could be: uid#298, uid#337.;

In the schema, notice that there are two "uid" columns, which is what causes the "ambiguous column error" in the following select statement.

3) And finally let's perform a join that removes the ambiguous column error

// Perform inner join on uid column
val joinedDF = firstDF.join(secondDF, Seq("uid"))

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.show()

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)


joinedDF.show()
+---+----+----+----+----+----+----+----+----+----+----+----+----+
|uid|col1|col2|col3|col4|col5|col6|colA|colB|colC|colD|colE|colF|
+---+----+----+----+----+----+----+----+----+----+----+----+----+
|  1|   1|   2|   3|   8|   4|   5|   3|   2|   0|   4|   2|   8|
|  2|   4|   3|   8|   7|   9|   8|   3|   3|   2|   6|   5|   4|
|  3|   6|   1|   9|   2|   3|   6|   8|   5|   1|   2|   3|   5|
|  4|   7|   8|   6|   9|   4|   5|   9|   8|   2|   4|   9|   2|
|  5|   9|   2|   7|   8|   7|   3|   3|   4|   8|   0|   6|   2|
|  6|   1|   1|   4|   2|   8|   4|   3|   9|   8|   8|   9|   3|
+---+----+----+----+----+----+----+----+----+----+----+----+----+


joinedDF.select("uid","col1","colA").show()
+---+----+----+
|uid|col1|colA|
+---+----+----+
|  1|   1|   3|
|  2|   4|   3|
|  3|   6|   8|
|  4|   7|   9|
|  5|   9|   3|
|  6|   1|   3|
+---+----+----+

Notice the lack of duplicated "uid" columns (yay)! Now if you perform a select using the "uid" column, there is no issue with "ambiguous column name" errors.

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

James Conner

1) Let's start off by preparing a couple of simple example dataframes

2) Next let's perform a simple inner join.

Results:

3) And finally let's perform a join that removes the ambiguous column error

Results:

Read more

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator

Pandas - Opening and Selecting Data