Why does combine_first() display this behavior, when substituting values from one column into another column in the same DataFrame?

1415 views python
3

I am new to stackoverflow.

I noticed this behavior of pandas combine_first() and would simply like to understand why. When I have the following dataframe,

df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[1]: 
0    6
1     
2    7
3     
Name: A, dtype: object

Whereas initiating with np.nan instead of ' ' gives the expected behavior of combine_first()

df = pd.DataFrame({'A':[6,np.nan,7,np.nan], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[2]: 
0    6.0
1    3.0
2    7.0
3    3.0
Name: A, dtype: float64

And also replacing the ' ' with np.nan and then applying combine_first() doesn't seem to work either.

df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df.replace('', np.nan)
df['A'].combine_first(df['B'])
Out[3]: 
0    6
1     
2    7
3     
Name: A, dtype: object

I would like to understand why this happens before using an alternate method for this purpose.

answered question

df=df.replace('', np.nan), assign it back

In the first case, because the empty string is not a null value recognized by pandas, so every value in column A is prioritized over the values in column B

I see..That makes sense now.

1 Answer

0

This seemed to have been pretty obvious for people here. But thank-you for posting the comments!

My mistake in the 3rd dataframe I posted, pointed out by @W-B

df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df = df.replace('', np.nan)
df['A'].combine_first(df['B'])

Also as @ALollz pointed out, df['A'] has empty strings ' ' are not null values. It does sound simple in hind-sight. But I couldn't figure it out earlier!

Thank-you!

posted this

Have an answer?

JD

Please login first before posting an answer.