Question
Creating random pairs from a column
The problem
I'm trying to create random pairs from a column using DuckDB.
I have a column of protein accession numbers which looks like this:
┌──────────────┐
│ protein_upkb │
│ varchar │
├──────────────┤
│ G1XNZ0 │
│ G1XP19 │
│ G1XP66 │
│ G1XP70 │
│ G1XPL1 │
│ G1XPQ7 │
│ G1XQ23 │
│ G1XQ44 │
│ G1XQ89 │
│ G1XQH2 │
├──────────────┤
│ 10 rows │
└──────────────┘
I'm trying to create random pairs of these protein ids such that they look like this:
┌────────────┬────────────┐
│ p1 │ p2 │
│ varchar │ varchar │
├────────────┼────────────┤
│ G1XNZ0 │ G1XP19 │
│ G1XP19 │ G1XP66 │
│ G1XP66 │ G1XP70 │
│ G1XP70 │ G1XPL1 │
│ G1XPL1 │ G1XPQ7 │
│ G1XPQ7 │ G1XQ23 │
│ G1XQ23 │ G1XQ44 │
│ G1XQ44 │ G1XQ89 │
│ G1XQ89 │ G1XQH2 │
│ G1XQH2 │ G1XNZ0 │
├────────────┴────────────┤
│ 10 rows 2 columns │
└─────────────────────────┘
N.B: This is just an example, I've thousands of IDs in the table in question.
Some things I've tried
Subqueries
I began by scrambling the order of the proteins by assigning some random number to each row and sorting by it.
CREATE VIEW proteins AS
SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet');
SELECT * FROM proteins ORDER BY x DESC LIMIT 10;
Which results in
┌──────────────┬────────────────────┐
│ protein_upkb │ x │
│ varchar │ double │
├──────────────┼────────────────────┤
│ A0A1H6HM63 │ 0.9999986232724041 │
│ A0A1G6CK58 │ 0.9999978158157319 │
│ A0A2C5XBA3 │ 0.9999923389405012 │
│ A0A1H9T955 │ 0.9999855090864003 │
│ Q05Q16 │ 0.9999655580613762 │
│ R5PE70 │ 0.999956940067932 │
│ R5GUN0 │ 0.9999453630298376 │
│ A0A0L0UJ42 │ 0.9999357375781983 │
│ W6ZJY1 │ 0.9999311361461878 │
│ F6D0F2 │ 0.9999301459174603 │
├──────────────┴────────────────────┤
│ 10 rows 2 columns │
└───────────────────────────────────┘
I then tried to create random pairs using subqueries. One column would be sorted by x
descending, the other by x
ascending.
Confusingly (to me), this only creates one random pair rather than the 255,622 I both expected and need.
cursor = duckdb.sql("""
SELECT
(SELECT protein_upkb FROM proteins ORDER BY x DESC) as p1,
(SELECT protein_upkb FROM proteins ORDER BY x ASC) as p2,
LIMIT 10;
""").show()
┌─────────┬─────────┐
│ p1 │ p2 │
│ varchar │ varchar │
├─────────┼─────────┤
│ Q28RH7 │ D8LJ06 │
└─────────┴─────────┘
SELECTing FROM two VIEWs
I figured that I can create two VIEWs, proteins1
and proteins2
. I can then independently randomly sort them using random()
as I've done before.
Finally, I can create pairs by selecting the protein_upkb
column from each table.
Once more, I'm a bit surprised by the outcome.
p2
is a sequence of random proteins, while p1
is just one of the proteins.
cursor = duckdb.sql("""
CREATE VIEW proteins1 AS
SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet')
ORDER BY x ASC;
CREATE VIEW proteins2 AS
SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet')
ORDER BY x ASC;
SELECT ps1.protein_upkb as p1, ps2.protein_upkb as p2,
FROM proteins1 as ps1, proteins2 as ps2
LIMIT 10;
""").show()
┌────────────┬────────────┐
│ p1 │ p2 │
│ varchar │ varchar │
├────────────┼────────────┤
│ A0A394DPL7 │ A0A1I3L166 │
│ A0A394DPL7 │ A0A0Q3WJP1 │
│ A0A394DPL7 │ A0A093SP34 │
│ A0A394DPL7 │ A0A127EQY9 │
│ A0A394DPL7 │ K6UP11 │
│ A0A394DPL7 │ A0A1I6M9F9 │
│ A0A394DPL7 │ A0A0Q3SWF8 │
│ A0A394DPL7 │ A0A069RD68 │
│ A0A394DPL7 │ S9ZHA8 │
│ A0A394DPL7 │ Q5P5L0 │
├────────────┴────────────┤
│ 10 rows 2 columns │
└─────────────────────────┘
Notebook
You can test this out in this Colab notebook.