Question

Creating random pairs from a column

The problem

I'm trying to create random pairs from a column using DuckDB.

I have a column of protein accession numbers which looks like this:

┌──────────────┐
│ protein_upkb │
│   varchar    │
├──────────────┤
│ G1XNZ0       │
│ G1XP19       │
│ G1XP66       │
│ G1XP70       │
│ G1XPL1       │
│ G1XPQ7       │
│ G1XQ23       │
│ G1XQ44       │
│ G1XQ89       │
│ G1XQH2       │
├──────────────┤
│   10 rows    │
└──────────────┘

I'm trying to create random pairs of these protein ids such that they look like this:

┌────────────┬────────────┐
│     p1     │     p2     │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ G1XNZ0     │ G1XP19     │
│ G1XP19     │ G1XP66     │
│ G1XP66     │ G1XP70     │
│ G1XP70     │ G1XPL1     │
│ G1XPL1     │ G1XPQ7     │
│ G1XPQ7     │ G1XQ23     │
│ G1XQ23     │ G1XQ44     │
│ G1XQ44     │ G1XQ89     │
│ G1XQ89     │ G1XQH2     │
│ G1XQH2     │ G1XNZ0     │
├────────────┴────────────┤
│ 10 rows       2 columns │
└─────────────────────────┘

N.B: This is just an example, I've thousands of IDs in the table in question.

Some things I've tried

Subqueries

I began by scrambling the order of the proteins by assigning some random number to each row and sorting by it.

CREATE VIEW proteins AS
  SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet');

SELECT * FROM proteins ORDER BY x DESC LIMIT 10;

Which results in

┌──────────────┬────────────────────┐
│ protein_upkb │         x          │
│   varchar    │       double       │
├──────────────┼────────────────────┤
│ A0A1H6HM63   │ 0.9999986232724041 │
│ A0A1G6CK58   │ 0.9999978158157319 │
│ A0A2C5XBA3   │ 0.9999923389405012 │
│ A0A1H9T955   │ 0.9999855090864003 │
│ Q05Q16       │ 0.9999655580613762 │
│ R5PE70       │  0.999956940067932 │
│ R5GUN0       │ 0.9999453630298376 │
│ A0A0L0UJ42   │ 0.9999357375781983 │
│ W6ZJY1       │ 0.9999311361461878 │
│ F6D0F2       │ 0.9999301459174603 │
├──────────────┴────────────────────┤
│ 10 rows                 2 columns │
└───────────────────────────────────┘

I then tried to create random pairs using subqueries. One column would be sorted by x descending, the other by x ascending.

Confusingly (to me), this only creates one random pair rather than the 255,622 I both expected and need.

cursor = duckdb.sql("""
SELECT
(SELECT protein_upkb FROM proteins ORDER BY x DESC) as p1,
(SELECT protein_upkb FROM proteins ORDER BY x ASC) as p2,
LIMIT 10;
""").show()

┌─────────┬─────────┐
│   p1    │   p2    │
│ varchar │ varchar │
├─────────┼─────────┤
│ Q28RH7  │ D8LJ06  │
└─────────┴─────────┘

SELECTing FROM two VIEWs

I figured that I can create two VIEWs, proteins1 and proteins2. I can then independently randomly sort them using random() as I've done before.

Finally, I can create pairs by selecting the protein_upkb column from each table.

Once more, I'm a bit surprised by the outcome.

p2 is a sequence of random proteins, while p1 is just one of the proteins.

cursor = duckdb.sql("""
CREATE VIEW proteins1 AS
  SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet') 
  ORDER BY x ASC;

CREATE VIEW proteins2 AS
  SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet') 
  ORDER BY x ASC;

SELECT ps1.protein_upkb as p1, ps2.protein_upkb as p2,
FROM proteins1 as ps1, proteins2 as ps2
LIMIT 10;
""").show()

┌────────────┬────────────┐
│     p1     │     p2     │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ A0A394DPL7 │ A0A1I3L166 │
│ A0A394DPL7 │ A0A0Q3WJP1 │
│ A0A394DPL7 │ A0A093SP34 │
│ A0A394DPL7 │ A0A127EQY9 │
│ A0A394DPL7 │ K6UP11     │
│ A0A394DPL7 │ A0A1I6M9F9 │
│ A0A394DPL7 │ A0A0Q3SWF8 │
│ A0A394DPL7 │ A0A069RD68 │
│ A0A394DPL7 │ S9ZHA8     │
│ A0A394DPL7 │ Q5P5L0     │
├────────────┴────────────┤
│ 10 rows       2 columns │
└─────────────────────────┘

Notebook

You can test this out in this Colab notebook.

3 77 3

1 Jan 1970

Solution

There are multiple ways of doing it. The exact way would depend on your additional requirements.

This one is very simple, get random sample twice, concatenate them together. You can get same protein twice, but you can exclude it with additional where clause though.

duckdb.sql("""
  with cte as (
    select protein_upkb from proteins using sample(10)
  )
  select *
  from cte as c1
    positional join cte as c2
""")

┌──────────────┬──────────────┐
│ protein_upkb │ protein_upkb │
│   varchar    │   varchar    │
├──────────────┼──────────────┤
│ A0A0F6TCU1   │ A0A4C1ULV9   │
│ D4YJT4       │ A0A3Q3FTK5   │
│ A0A319DTU8   │ C6LIN2       │
│ A0A1Q3D9X9   │ A0A1B3BCY8   │
│ M5F4R3       │ M1NUZ3       │
│ A0A553PJQ2   │ A0A165P0W9   │
│ G7M9F2       │ A0A182JZX3   │
│ A0A0Q1CIG2   │ G3HMK9       │
│ C7YU85       │ A0A3Q2E7T6   │
│ A0A199VI77   │ A0A0R1JQR6   │
├──────────────┴──────────────┤
│ 10 rows           2 columns │
└─────────────────────────────┘

Or assign row_number at random, and then pivot in a way that even an odd rows create pairs:

duckdb.sql("""
with cte1 as (
  select ps1.protein_upkb, row_number() over(order by random()) as rn
  from proteins as ps1
), cte2 as (
  select
    protein_upkb,
    rn % 2 as col,
    rn // 2 as r
  from cte1
)
pivot cte2
on col
using any_value(protein_upkb)
limit 10
""")

───────┬────────────┬────────────┐
│   r   │     0      │     1      │
│ int64 │  varchar   │  varchar   │
├───────┼────────────┼────────────┤
│ 66322 │ A0A1N7AVA0 │ A0A175R4H7 │
│ 66325 │ K9FKM7     │ D8QP02     │
│ 66327 │ A0A1I5KRT3 │ W0V524     │
│ 66328 │ A0A4U2YU79 │ A0A452RP46 │
│ 66334 │ A8RCK1     │ A0A165U1L8 │
│ 66335 │ A0A3Q3QVI9 │ C7MCJ1     │
│ 66336 │ Q3SLR9     │ A0A3B4B0Q2 │
│ 66338 │ A0A1W1XBB2 │ A0A0B7J5C1 │
│ 66339 │ A0A1I4KH70 │ A0A3S4SEU1 │
│ 66340 │ A0A1W0D573 │ Q4ZR49     │
├───────┴────────────┴────────────┤
│ 10 rows               3 columns │
└─────────────────────────────────┘

2024-07-12

Roman Pekar

Solution

I don't know duckdb. But this would raise a runtime error in almost all RDBMS:

SELECT
  (SELECT protein_upkb FROM proteins ORDER BY x DESC) as p1,
  (SELECT protein_upkb FROM proteins ORDER BY x ASC) as p2,
LIMIT 10

Well, first of all, there is a comma too many after p2. But what this query is doing is select one row, because you select from nothing (the main SELECT clause has no FROM clause). This one row has two columns: p1 and p2. Now everything is fine, as long as the two subqueries return only one row, but we know from your table, they do not. The query is trying to fill each of the two cells in the result row with all values in the table. That would usually produce an error. I see, though, that duckdb supports array data types, so I suppose that the two columns result in being some sort of arrays. (Then even the ORDER BY may have some effect, which is not the case in standard SQL where an ORDER BY in a subquery is superfluous, because subquery results are unordered data sets by definition.)

As to the views: In SQL everything is a table :-D There exist stored tables like your proteins table. There exists result tables, as the result from all our queries. There exist subquery results, which are also tables. And there are views, which are just the same. And tables are unordered data sets. An ORDER BY in a view makes no sense hence, as it can be ignored completely by the DBMS.

Anyway, you cross join the two views, i.e. combine every row of the first view with every row of the second view. What you see is part of the result, namely the first row of the first view combined with the first ten rows of the second view. (So maybe the DBMS even took the hard work to please your ORDER BY clauses :-) You only see the first 10 results, because of the LIMIT clause you apply, of course.

What you want to do instead is join random rows with other random rows from the same table. Then it depends how much randomness you want. If you just want to get some random pairs where you don't get any protein twice, you might do:

WITH 
  data AS 
  (
    SELECT protein_upkb, ROW_NUMBER() OVER (ORDER BY random()) AS sortkey 
    FROM proteins 
    ORDER BY random()
  )
SELECT 
  FIRST(protein_upkb ORDER BY sortkey) AS code1,
  LAST(protein_upkb ORDER BY sortkey) AS code2
FROM data
GROUP BY ((sortkey - 1) // 2) -- Integer division 0, 0, 1, 1, 2, 2, etc.
LIMIT 10;

Or, if you are fine with duplicates, you could just

SELECT 
  a.protein_upkb AS code1,
  b.protein_upkb AS code2
FROM proteins a CROSS JOIN proteins b
ORDER BY random()
LIMIT 10;

which can get you a result like

CODE1	CODE2
A0A394DPL7	A0A394DPL7
S9ZHA8	S9ZHA8
A0A394DPL7	S9ZHA8
S9ZHA8	A0A394DPL7
...	...

2024-07-12

Thorsten Kettner