Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column name errors?

1) Let's start off by preparing a couple of simple example dataframes

// Create first example dataframe
val firstDF = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 7, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 8, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("uid", "col1", "col2", "col3", "col4", "col5", "col6")

// Display the dataframe
firstDF.show()

// Create second example dataframe
val secondDF = spark.createDataFrame(Seq(
  (1, 3, 2, 0, 4, 2, 8),
  (2, 3, 3, 2, 6, 5, 4),
  (3, 8, 5, 1, 2, 3, 5),
  (4, 9, 8, 2, 4, 9, 2),
  (5, 3, 4, 8, 0, 6, 2),
  (6, 3, 9, 8, 8, 9, 3)
)).toDF("uid", "colA", "colB", "colC", "colD", "colE", "colF")

// Display the dataframe
secondDF.show()



2) Next let's perform a simple inner join.

// Perform simple inner join on uid column
val joinedDF = firstDF.join(secondDF)

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- uid: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)

joinedDF.select("uid","col1","colA") org.apache.spark.sql.AnalysisException: Reference 'uid' is ambiguous, could be: uid#298, uid#337.;

In the schema, notice that there are two "uid" columns, which is what causes the "ambiguous column error" in the following select statement.



3) And finally let's perform a join that removes the ambiguous column error

// Perform inner join on uid column
val joinedDF = firstDF.join(secondDF, Seq("uid"))

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.show()

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)

joinedDF.show() +---+----+----+----+----+----+----+----+----+----+----+----+----+ |uid|col1|col2|col3|col4|col5|col6|colA|colB|colC|colD|colE|colF| +---+----+----+----+----+----+----+----+----+----+----+----+----+ | 1| 1| 2| 3| 8| 4| 5| 3| 2| 0| 4| 2| 8| | 2| 4| 3| 8| 7| 9| 8| 3| 3| 2| 6| 5| 4| | 3| 6| 1| 9| 2| 3| 6| 8| 5| 1| 2| 3| 5| | 4| 7| 8| 6| 9| 4| 5| 9| 8| 2| 4| 9| 2| | 5| 9| 2| 7| 8| 7| 3| 3| 4| 8| 0| 6| 2| | 6| 1| 1| 4| 2| 8| 4| 3| 9| 8| 8| 9| 3| +---+----+----+----+----+----+----+----+----+----+----+----+----+
joinedDF.select("uid","col1","colA").show() +---+----+----+ |uid|col1|colA| +---+----+----+ | 1| 1| 3| | 2| 4| 3| | 3| 6| 8| | 4| 7| 9| | 5| 9| 3| | 6| 1| 3| +---+----+----+

Notice the lack of duplicated "uid" columns (yay)! Now if you perform a select using the "uid" column, there is no issue with "ambiguous column name" errors.

Read more