Question
How to get the name of column with second highest value in pyspark dataframe
I have a PySpark dataframe looking like this:
id | trx_holiday | trx_takeout | trx_pet | max_value | MAX |
---|---|---|---|---|---|
1 | 12.5 | 5.5 | 9.5 | 12.5 | trx_holiday |
2 | 3.0 | 14.0 | 6.7 | 14.0 | trx_takeout |
I want to create a second column MAX_2
that contains the category on which the client spent the second most.
I want to tweak the code below (see: how to get the name of column with maximum value in pyspark dataframe) by excluding the column_name in the MAX
column from the withColumn
statement creating the max_value
column:
cond = "psf.when" + ".when".join(["(psf.col('" + c + "') == psf.col('max_value'), psf.lit('" + c + "'))" for c in df.columns])
df = df.withColumn("max_value", psf.greatest(*[c for c in columns])) \
.withColumn("MAX", when(cond, 1).otherwise(0))
Ideally, the new dataframe should look like this:
id | trx_holiday | trx_takeout | trx_pet | max_value | MAX | max_value_2 | MAX_2 |
---|---|---|---|---|---|---|---|
1 | 12.5 | 5.5 | 9.5 | 12.5 | trx_holiday | 9.5 | trx_pet |
2 | 3.0 | 14.0 | 6.7 | 14.0 | trx_takeout | 6.7 | trx_pet |