Question

How to compare lists in two Pandas dataframes to get the common elements?

I want to compare lists from columns set_1 and set_2 in df_2 with ins column in df_1 to find all common elements.

I've started doing it for one row and one column but I have no idea how to compare all rows between two dfs to get the desired result.

Here is my code comparing set_1 and ins in the first row:

import pandas as pd

d1 = {'chr': [1, 1], 'start': [64, 1000], 'end': [150, 2000], 'family': ['a', 'b'],
      'ins': [['P1-12', 'P1-22', 'P1-25', 'P1-28', 'P1-90'],
              ['P1-6', 'P1-89', 'P1-92', 'P1-93']]}

df1 = pd.DataFrame.from_dict(data=d1)

d2 = {'set_1': [['P1-12', 'P1-25', 'P1-28'], ['P1-6', 'P1-89', 'P1-93']],
      'set_2': [['P1-89', 'P1-92', 'P1-93'], ['P1-25', 'P1-28', 'P1-90']]}

df2 = pd.DataFrame.from_dict(data=d2)

matches = [x for x in df2.iloc[0, 0] if x in df1.iloc[0, 4]]

There is a tiny part of my input data (in original input, df1 contains ~13k rows and df2 ~90):

df1:

   chr  start   end family                                  ins
0    1     64   150      a  [P1-12, P1-22, P1-25, P1-28, P1-90]
1    1   1000  2000      b          [P1-6, P1-89, P1-92, P1-93]

df2:

                   set_1                  set_2
0  [P1-12, P1-25, P1-28]  [P1-89, P1-92, P1-93]
1   [P1-6, P1-89, P1-93]  [P1-25, P1-28, P1-90]

The desired output should look like this:

   chr  start   end family  df2_index               ins_set1               ins_set2
0    1     64   150      a          0  [P1-12, P1-25, P1-28]                     []
1    1     64   150      a          1                     []  [P1-25, P1-28, P1-90]
2    1   1000  2000      b          0                     []  [P1-89, P1-92, P1-93]
3    1   1000  2000      b          1   [P1-6, P1-89, P1-93]                     []

2 37 2

1 Jan 1970

Solution

Since you have objects you'll need to loop. I would first perform a cross-merge, then use a set for efficiency:

out = df1.merge(df2, how='cross')
cols = list(df2)
ins = out.pop('ins').apply(set)

for c in cols:
    out[c] = [[x for x in lst if x in ref]
              for ref, lst in zip(ins, out[c])]

Variant that should be a bit more efficient:

out = df1.assign(ins=df1['ins'].apply(set)).merge(df2, how='cross')
cols = list(df2)
ins = out.pop('ins')

for c in cols:
    out[c] = [[x for x in lst if x in ref]
              for ref, lst in zip(ins, out[c])]

Output:

   chr  start   end family                  set_1                  set_2
0    1     64   150      a  [P1-12, P1-25, P1-28]                     []
1    1     64   150      a                     []  [P1-25, P1-28, P1-90]
2    1   1000  2000      b                     []  [P1-89, P1-92, P1-93]
3    1   1000  2000      b   [P1-6, P1-89, P1-93]                     []

2024-07-25

mozway

Solution

Here is a vectorial approach using sets and numpy broadcasting. Note that the outputs are sets not lists and thus the order of items might not be conserved. Nevertheless, this alternative has the advantage of being concise and elegant (and faster):

# manually replicate the rows
out = df1.drop(columns='ins').loc[df1.index.repeat(len(df2))]

# use broadcasting to perform set intersections
out[list(df2)] = (
    df2.map(set).to_numpy()[..., None] & df1['ins'].map(set).to_numpy()
).reshape(-1, df2.shape[1])

Output:

   chr  start   end family                  set_1                  set_2
0    1     64   150      a  {P1-28, P1-25, P1-12}                     {}
0    1     64   150      a                     {}  {P1-93, P1-92, P1-89}
1    1   1000  2000      b                     {}   {P1-6, P1-89, P1-93}
1    1   1000  2000      b  {P1-28, P1-90, P1-25}                     {}

timings

This also looks a few times faster than a loop.

using 200 rows for each input DataFrame:

# cross-merge + loop
60.5 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# numpy broadcasting
16.4 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

With 2K rows for each:

# cross-merge + loop
6.05 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# numpy broadcasting
1.89 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2024-07-25

mozway