Question
Efficiently find the number of different classmates from course-level data
I have been stuck with computing efficiently the number of classmates for each student from a course-level database.
Consider this data.frame, where each row represents a course that a student has taken during a given semester:
dat <-
data.frame(
student = c(1, 1, 2, 2, 2, 3, 4, 5),
semester = c(1, 2, 1, 2, 2, 2, 1, 2),
course = c(2, 4, 2, 3, 4, 3, 2, 4)
)
# student semester course
# 1 1 1 2
# 2 1 2 4
# 3 2 1 2
# 4 2 2 3
# 5 2 2 4
# 6 3 2 3
# 7 4 1 2
# 8 5 2 4
Students are going to courses in a given semester. Their classmates are other students attending the same course during the same semester. For instance, across both semesters, student 1 has 3 classmates (students 2, 4 and 5).
How can I get the number of unique classmates each student has combining both semesters? The desired output would be:
student n
1 1 3
2 2 4
3 3 1
4 4 2
5 5 2
where n
is the value for the number of different classmates a student has had during the academic year.
I sense that an igraph
solution could possibly work (hence the tag), but my knowledge of this package is too limited. I also feel like using joins
could help, but again, I am not sure how.
Importantly, I would like this to work for larger datasets (mine has about 17M rows). Here's an example data set:
set.seed(1)
big_dat <-
data.frame(
student = sample(1e4, 1e6, TRUE),
semester = sample(2, 1e6, TRUE),
course = sample(1e3, 1e6, TRUE)
)