Question

Find index of two identical values in succession for the first time

These are some exemple vectors to reproduce:

a <- c(14,26,38,64,96,127,152,152,152,152,152,152)
b <- c(4,7,9,13,13,13,13,13,13,13,13,13,13,13)
c <- c(62,297,297,297,297,297,297,297,297,297,297,297)

It is obvious that at some point a certain value is repeated until the end. I need to get exactly the index where this values appears for the first time.

So in this case the output would be 7,4,2, since in a 152 starts at the 7th position, in b 13 starts at the 4th position and in c 297 starts at the 2nd position. I hope this is clear.

Anybody with a hint how to get this automatically?

Edit: the data is always increasing and once it starts repeating it continues until the end. In this kind of analysis there will always be a repetition at least at the last two values.

 4  142  4
1 Jan 1970

Solution

 7

You could use rle() to take the run-length encoding of every value except the final one and sum their lengths:

get_index  <- \(x) sum(head(rle(x)$lengths, -1)) + 1
sapply(list(a, b, c), get_index)
# [1] 7 4 2

Rcpp solution

If your vectors are really long and the last value is only repeated towards the end, you don't need to check the length of every run, so the above will be inefficient. It's better to start from the end of the vector and work backwards until you find a different value:

Rcpp::cppFunction('
int get_index2(NumericVector x) {
    int n = x.size();
    double last_value = x[n - 1];
    for (int i = n - 2; i >= 0; --i) {
        if (x[i] != last_value) {
            return i + 2; // +1 as it is next element; +1 for 1-indexing
        }
    }
    return 1; // all elements are the same
}
')

sapply(list(a,b,c), get_index2)
# [1] 7 4 2

data.table solution

Given your update the question, another way to approach this would be:

sapply(list(a,b,c), data.table::uniqueN)
# [1] 7 4 2

This is not conceptually different from the nice answer by zx8754 and with vectors of this size is unlikely to be meaningfully different in speed and could even be slower. However, it is faster for very large vectors.

2024-07-17
SamR

Solution

 4

If you know the last value is the repeated value then you can use that and match(), which finds the index of the first value of a match:

first <- \(x) match(x[length(x)], x)
sapply(list(a, b, c), first)
# 7 4 2

If you're looking for the first successive value then you can use diff() and which():

first_conseq <- \(x) which(diff(x) == 0)[1]
sapply(list(a, b, c), first_conseq)
# 7 4 2

By default, diff() returns the difference between successive values. If two values are the same then their difference will be 0. which() will return the index of all TRUE values in a logical vector so we use [1] to take the first case.

2024-07-17
LMc

Solution

 4

As clarified by OP, if the data is always increasing and starts duplicating on the last value, we just need to check unique length:

lengths(lapply(list(a, b, c), unique))
# [1] 7 4 2
2024-07-18
zx8754

Solution

 3

Another base R solution:

f <- \(x) (length(x) - which.max(rev(x) != x[length(x)]) + 1L)%%length(x) + 1L

I'll compare it to the other options along with some benchmarking. Tossing in a couple edge cases:

a <- c(14,26,38,64,96,127,152,152,152,152,152,152)
b <- c(4,7,9,13,13,13,13,13,13,13,13,13,13,13)
c <- c(62,297,297,297,297,297,297,297,297,297,297,297)
d <- numeric(12)
e <- 1:14

Testing the proposed answers, including the edge cases:

get_index  <- \(x) sum(head(rle(x)$lengths, -1)) + 1L
Edward <- \(a) length(a) - min(which(diff(rev(a))!=0)) + 1L
first_conseq <- \(x) which(diff(x) == 0)[1]

sapply(list(a, b, c, d, e), f)
#> [1]  7  4  2  1 14
sapply(list(a, b, c, d, e), get_index)
#> [1]  7  4  2  1 14
sapply(list(a, b, c, d, e), Edward)
#> Warning in min(which(diff(rev(a)) != 0)): no non-missing arguments to min;
#> returning Inf
#> [1]    7    4    2 -Inf   14
sapply(list(a, b, c, d, e), first_conseq)
#> [1]  7  4  2  1 NA

And SamR's Rcpp function (modified slightly for speed):

Rcpp::cppFunction('
  int get_index2(const NumericVector& x) {
      const int n = x.size();
      const double last_value = x[n - 1];
      for (int i = n - 2; i >= 0; --i) {
          if (x[i] != last_value) {
              return i + 2; // +1 as it is next element; +1 for 1-indexing
          }
      }
      return 1; // all elements are the same
  }
')

sapply(list(a, b, c, d, e), get_index2)
#> [1]  7  4  2  1 14

Only f and the get_index functions behave well with the edge cases.

Benchmarking with a larger dataset:

n <- sample(1e5, 1e3, 1)
x <- lapply(n, \(n) c(sample(1e4, n, 1), 0L, sample(1e5 - n, 1))[-1:-2])
identical(n, vapply(x, f, 0L))
#> [1] TRUE

bench::mark(
  f = vapply(x, f, 0L),
  get_index = vapply(x, get_index, 0L)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f           306.6ms 316.29ms     3.16   580.07MB     12.6
#> 2 get_index     2.46s    2.46s     0.406    4.91GB     13.8
#> 3 get_index2   62.4ms  67.14ms    14.6    404.34MB     42.0
2024-07-17
jblood94

Solution

 2

Since the data is always increasing and once it starts repeating it continues until the end, you can simply do:

min(which(diff(a)==0))
#[1] 7

sapply(list(a, b, c), \(x) min(which(diff(x)==0)))
[1] 7 4 2

If the last condition is relaxed, you can reverse the vector and use diff to find the first occurrence of a non-zero.

length(a) - min(which(diff(rev(a))!=0)) + 1
# [1] 7

x <- c(1,2,2,3,4,5,5,5,5,5,5)
length(x) - min(which(diff(rev(x))!=0)) + 1
#[1] 6
2024-07-17
Edward

Solution

 2

You can try

f <- \(x) {
    length(x) - which.min(replace(rev(duplicated(x, fromLast = TRUE)), 1, TRUE)) + 2
}

such that

> lapply(list(a, b, c), f)
[[1]]
[1] 7

[[2]]
[1] 4

[[3]]
[1] 2
2024-07-18
ThomasIsCoding

Solution

 1

Another base R solution. Applying duplicated gives a logical array with the first TRUE value at the target index plus 1, which extracts the index. I've added the "edge" cases considered by @jblood94 above. Although these cases are not included in OP question, seems if no repeats function should return NA.

a <- c(14, 26, 38, 64, 96, 127, 152, 152, 152, 152, 152, 152)
b <- c(4, 7, 9, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13)
c <- c(62, 297, 297, 297, 297, 297, 297, 297, 297, 297, 297, 297)
d <- 12
e <- 1:14
pull_index <- \(x) which(duplicated(x))[1] - 1
sapply(list(a, b, c, d, e), pull_index)
# 
# [1]  7  4  2 NA NA
2024-07-18
rgt47