Error about column literals using udf function on a dataframe with pyspark

2305 views dataframe
9

I'm trying to use a udf function on a dataframe with pyspark but getting error about column literals and suggesting I use 'lit', 'array', 'struct' or 'create_map' function. I´m not clear how to do this.

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def compareElem(elem):
    return elem[1]

def getSmallest(type, final_list):
  final_list.sort(key=compareElem)
  print(final_list)
  l = final_list[0][0]
  print('idx=', l)
  if type == 1:
    l = (((l/4)+1)*4)-1

  return l

Function works ok on List input

getSmallest(0, [ ( 0, 1), (1, 1.1), (2, 0.5) ])

returns

[(2, 0.5), (0, 1), (1, 1.1)] 
('idx=', 2)

But fails here used with udf and dataframe columns

func_udf = udf(getSmallest, IntegerType())

raw_dataset_df = raw_dataset_df.withColumn('result',func_udf( 
  raw_dataset_df['type'], [ ( 0, raw_dataset_df['Icorr_LBT01_R'] ), (1, raw_dataset_df['Icorr_LBT01_S']) ] ));

I get following error

TypeError: Invalid argument, not a string or column: [(0, Column<Icorr_LBT01_R>), (1, Column<Icorr_LBT01_S>)] of type <type 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Not sure what this means or how to fix. I tried wrapping lit around each Column element but not clear what this should do and it doesn't work for me.

answered question

1 Answer

8

any argument in your UDF should be a column. In your case [ ( 0, raw_dataset_df['Icorr_LBT01_R'] ), (1, raw_dataset_df['Icorr_LBT01_S']) ] is not a column, it is a tuple (python object) and you cannot use is in your udf.

posted this

Have an answer?

JD

Please login first before posting an answer.