Question

How to change old date format in a file to a new format

I have a huge file, and it has around 200 lines like this:

started at Wed Jun  5 08:45:01 PM +0330 2024 -- ended at Wed Jun  5 10:35:34 PM +0330 2024.
started at Thu Jun  6 01:30:01 AM +0330 2024 -- ended at Thu Jun  6 03:17:18 AM +0330 2024.
started at Thu Jun  6 07:30:01 AM +0330 2024 -- ended at Thu Jun  6 09:19:19 AM +0330 2024.
started at Thu Jun  6 01:30:01 PM +0330 2024 -- ended at Thu Jun  6 03:19:16 PM +0330 2024.
started at Thu Jun  6 07:30:01 PM +0330 2024 -- ended at Thu Jun  6 09:16:15 PM +0330 2024.
started at Fri Jun  7 01:30:01 AM +0330 2024 -- ended at Fri Jun  7 03:17:47 AM +0330 2024.
started at Fri Jun  7 07:30:01 AM +0330 2024 -- ended at Fri Jun  7 09:03:05 AM +0330 2024.
started at Fri Jun  7 01:30:01 PM +0330 2024 -- ended at Fri Jun  7 03:19:55 PM +0330 2024.
started at Fri Jun  7 07:30:01 PM +0330 2024 -- ended at Fri Jun  7 09:17:41 PM +0330 2024.
started at Sat Jun  8 01:30:01 AM +0330 2024 -- ended at Sat Jun  8 03:18:12 AM +0330 2024.
started at Sat Jun  8 07:30:01 AM +0330 2024 -- ended at Sat Jun  8 09:20:31 AM +0330 2024.
started at Sat Jun  8 01:30:01 PM +0330 2024 -- ended at Sat Jun  8 03:19:16 PM +0330 2024.
started at Sat Jun  8 07:30:01 PM +0330 2024 -- ended at Sat Jun  8 09:20:01 PM +0330 2024.
started at Sun Jun  9 01:30:01 AM +0330 2024 -- ended at Sun Jun  9 03:15:19 AM +0330 2024.
started at Sun Jun  9 07:30:01 AM +0330 2024 -- ended at Sun Jun  9 09:19:07 AM +0330 2024.
started at Sun Jun  9 01:30:01 PM +0330 2024 -- ended at Sun Jun  9 03:16:44 PM +0330 2024.
started at Sun Jun  9 07:30:01 PM +0330 2024 -- ended at Sun Jun  9 09:15:16 PM +0330 2024.
started at Mon Jun 10 01:30:01 AM +0330 2024 -- ended at Mon Jun 10 03:17:37 AM +0330 2024.
started at Mon Jun 10 07:30:01 AM +0330 2024 -- ended at Mon Jun 10 09:16:38 AM +0330 2024.
started at Mon Jun 10 01:30:01 PM +0330 2024 -- ended at Mon Jun 10 03:17:45 PM +0330 2024.

I changed the date format and after a while, it's like this:

started at Thu Jul 18 01:30:01 PM +0330 2024 -- ended at Thu Jul 18 05:48:36 PM +0330 2024.
started at Fri Jul 19 01:30:01 AM +0330 2024 -- ended at Fri Jul 19 04:47:38 AM +0330 2024.
started at Fri Jul 19 07:30:01 AM +0330 2024 -- ended at Fri Jul 19 10:43:25 AM +0330 2024.
started at Fri Jul 19 01:30:01 PM +0330 2024 -- ended at Fri Jul 19 05:51:24 PM +0330 2024.
started at 2024-07-19 19:30 -- ended at 2024-07-19 23:43.
started at 2024-07-20 01:30 -- ended at 2024-07-20 04:48.
started at 2024-07-20 07:30 -- ended at 2024-07-20 10:55.

I'm going to change the lines in the old format which is $(date)'s output to the new one $(date +%Y-%m-%d %H:%M).

How can I do that? is that even possible?

The old format date start from Jun 5 0:45:01 PM to Fri Jul 19 05:51:24 PM.

 2  72  2
1 Jan 1970

Solution

 5

As whole log file is referenced +0330, I use TZ=Asia/Tehran as this seem match your time zone. The better is to use your own locale settings.

If your log file do contain exactly two date to be converted by lines, You could try something like:

sed < datedlogs.txt 's/^started at \(.*\) +0330 \(.*\) -- ended at \(.*\) +0330 \(.*\)\./TZ="Asia\/Tehran" \1 \2\nTZ="Asia\/Tehran" \3 \4/' |
    TZ="Asia/Tehran" date -f - +'%F %T' |
    paste -d + - - |
    sed 's/^\(.*\)+\(.*\)$/started at \1 -- ended at \2/'

Based on your sample, this produce:

started at 2024-06-05 20:45:01 -- ended at 2024-06-05 22:35:34
started at 2024-06-06 01:30:01 -- ended at 2024-06-06 03:17:18
started at 2024-06-06 07:30:01 -- ended at 2024-06-06 09:19:19
started at 2024-06-06 13:30:01 -- ended at 2024-06-06 15:19:16
started at 2024-06-06 19:30:01 -- ended at 2024-06-06 21:16:15
started at 2024-06-07 01:30:01 -- ended at 2024-06-07 03:17:47
started at 2024-06-07 07:30:01 -- ended at 2024-06-07 09:03:05
started at 2024-06-07 13:30:01 -- ended at 2024-06-07 15:19:55
started at 2024-06-07 19:30:01 -- ended at 2024-06-07 21:17:41
started at 2024-06-08 01:30:01 -- ended at 2024-06-08 03:18:12
started at 2024-06-08 07:30:01 -- ended at 2024-06-08 09:20:31
started at 2024-06-08 13:30:01 -- ended at 2024-06-08 15:19:16
started at 2024-06-08 19:30:01 -- ended at 2024-06-08 21:20:01
started at 2024-06-09 01:30:01 -- ended at 2024-06-09 03:15:19
started at 2024-06-09 07:30:01 -- ended at 2024-06-09 09:19:07
started at 2024-06-09 13:30:01 -- ended at 2024-06-09 15:16:44
started at 2024-06-09 19:30:01 -- ended at 2024-06-09 21:15:16
started at 2024-06-10 01:30:01 -- ended at 2024-06-10 03:17:37
started at 2024-06-10 07:30:01 -- ended at 2024-06-10 09:16:38
started at 2024-06-10 13:30:01 -- ended at 2024-06-10 15:17:45

Quickly, as this run date command only once!

... Or better:

sed 's/^started at \(.*\) \([+-][0-2][0-9][0-5][0-9]\) \(.*\) -- ended at \(.*\) \([+-][0-2][0-9][0-5][0-9]\) \(.*\)\./TZ="\2" \1 \3\nTZ="\5" \4 \6/' |
    TZ="Asia/Tehran" date -f - +'%F %T' |
    paste -d + - - |
    sed 's/^\(.*\)+\(.*\)$/started at \1 -- ended at \2/'

Where original TZ are extracted from input.

2024-07-23
F. Hauri - Give Up GitHub

Solution

 2

Here is the awk script:

function pad(n) {
    return sprintf("%02d", n)
}

function convert_date(y,m,d) {
    return int(y)"-"pad(index("JanFebMarAprMayJunJulAugSepOctNovDec", m) / 3 + 1)"-"pad(d)
}

function convert_time(t,ampm) {
    split(t, a, ":")
    if (ampm == "PM" && a[1] != 12) a[1] += 12
    if (ampm == "AM" && a[1] == 12) a[1] = 0
    return pad(a[1])":"pad(a[2])
}

{print $1, $2, convert_date($9, $4, $5), convert_time($6, $7), $10, $11, $12, convert_date($19, $14, $15), convert_time($16, $17)}
2024-07-23
Stas Simonov

Solution

 2

Perl can handle this:

Create the log file with the mixed timestamps

cat >logfile <<END
started at Thu Jul 18 01:30:01 PM +0330 2024 -- ended at Thu Jul 18 05:48:36 PM +0330 2024.
started at Fri Jul 19 01:30:01 AM +0330 2024 -- ended at Fri Jul 19 04:47:38 AM +0330 2024.
started at Fri Jul 19 07:30:01 AM +0330 2024 -- ended at Fri Jul 19 10:43:25 AM +0330 2024.
started at Fri Jul 19 01:30:01 PM +0330 2024 -- ended at Fri Jul 19 05:51:24 PM +0330 2024.
started at 2024-07-19 19:30 -- ended at 2024-07-19 23:43.
started at 2024-07-20 01:30 -- ended at 2024-07-20 04:48.
started at 2024-07-20 07:30 -- ended at 2024-07-20 10:55.
END

and then normalize them:

perl -MTime::Piece -pe '
    s/[+-]\d{4} //g;
    s{(started|ended) at \K(\w{3} \w{3} \d{2} [\d:]{8} .. \d{4})}
     { Time::Piece->strptime($2, "%a %b %e %r %Y")->strftime("%F %T") }ge;
' logfile
started at 2024-07-18 13:30 -- ended at 2024-07-18 17:48.
started at 2024-07-19 01:30 -- ended at 2024-07-19 04:47.
started at 2024-07-19 07:30 -- ended at 2024-07-19 10:43.
started at 2024-07-19 13:30 -- ended at 2024-07-19 17:51.
started at 2024-07-19 19:30 -- ended at 2024-07-19 23:43.
started at 2024-07-20 01:30 -- ended at 2024-07-20 04:48.
started at 2024-07-20 07:30 -- ended at 2024-07-20 10:55.

The first s/// command was needed to remove the timezone offset from the old timestamp format. With the offset there, Time::Piece would use it to locate the parsed timestamp in UTC, so I was seeing 10:00 instead of 13:30.

2024-07-23
glenn jackman

Solution

 1

A native bash implementation -
My data:

$: cat log
original data
started at Wed Jun  5 08:45:01 PM +0330 2024 -- ended at Wed Jun  5 10:35:34 PM +0330 2024.
started at Thu Jun  6 01:30:01 AM +0330 2024 -- ended at Thu Jun  6 03:17:18 AM +0330 2024.
occasonal
started at Thu Jun  6 07:30:01 AM +0330 2024 -- ended at Thu Jun  6 09:19:19 AM +0330 2024.
started at Thu Jun  6 01:30:01 PM +0330 2024 -- ended at Thu Jun  6 03:19:16 PM +0330 2024.
other
started at Thu Jun  6 07:30:01 PM +0330 2024 -- ended at Thu Jun  6 09:16:15 PM +0330 2024.
started at Fri Jun  7 01:30:01 AM +0330 2024 -- ended at Fri Jun  7 03:17:47 AM +0330 2024.
started at Fri Jun  7 07:30:01 AM +0330 2024 -- ended at Fri Jun  7 09:03:05 AM +0330 2024.
lines
started at Fri Jun  7 01:30:01 PM +0330 2024 -- ended at Fri Jun  7 03:19:55 PM +0330 2024.
started at Fri Jun  7 07:30:01 PM +0330 2024 -- ended at Fri Jun  7 09:17:41 PM +0330 2024.
started at Sat Jun  8 01:30:01 AM +0330 2024 -- ended at Sat Jun  8 03:18:12 AM +0330 2024.
embedded
started at Sat Jun  8 07:30:01 AM +0330 2024 -- ended at Sat Jun  8 09:20:31 AM +0330 2024.
started at Sat Jun  8 01:30:01 PM +0330 2024 -- ended at Sat Jun  8 03:19:16 PM +0330 2024.
started at Sat Jun  8 07:30:01 PM +0330 2024 -- ended at Sat Jun  8 09:20:01 PM +0330 2024.
just
started at Sun Jun  9 01:30:01 AM +0330 2024 -- ended at Sun Jun  9 03:15:19 AM +0330 2024.
started at Sun Jun  9 07:30:01 AM +0330 2024 -- ended at Sun Jun  9 09:19:07 AM +0330 2024.
in
started at Sun Jun  9 01:30:01 PM +0330 2024 -- ended at Sun Jun  9 03:16:44 PM +0330 2024.
started at Sun Jun  9 07:30:01 PM +0330 2024 -- ended at Sun Jun  9 09:15:16 PM +0330 2024.
started at Mon Jun 10 01:30:01 AM +0330 2024 -- ended at Mon Jun 10 03:17:37 AM +0330 2024.
case
started at Mon Jun 10 07:30:01 AM +0330 2024 -- ended at Mon Jun 10 09:16:38 AM +0330 2024.
started at Mon Jun 10 01:30:01 PM +0330 2024 -- ended at Mon Jun 10 03:17:45 PM +0330 2024.
new data
started at 2024-07-19 19:30 -- ended at 2024-07-19 23:43.
started at 2024-07-20 01:30 -- ended at 2024-07-20 04:48.
started at 2024-07-20 07:30 -- ended at 2024-07-20 10:55.

My code -

$: cat tmp
#! /usr/bin/bash

declare -A mo=( [Jan]='1' [Feb]='2' [Mar]='3' [Apr]='4'  [May]='5'  [Jun]='6'
                [Jul]='7' [Aug]='8' [Sep]='9' [Oct]='10' [Nov]='11' [Dec]='12' )
declare -A H24=( [AM]=0 [PM]=12 )

#             1              2                   3            4            5                  6                                            7
pat='(start|end)ed.at.....(...)[[:space:]]+([0-9]+).([0-9][0-9]):([0-9][0-9]):[0-9][0-9].([AP]M)..[0-9][0-9][0-9][0-9].([0-9][0-9][0-9][0-9])'
#             8              9                  10           11           12                 13                                           14
pat="^$pat.--.$pat.\$"

while read -r line; do
  if [[ $line =~ $pat ]]; then
    syr="${BASH_REMATCH[7]}"
    printf -v smo '%02s' "${mo[${BASH_REMATCH[2]}]}"
    printf -v sdy '%02s' "${BASH_REMATCH[3]}"
    printf -v shr '%02d' "$(( 10#${BASH_REMATCH[4]} + ${H24[${BASH_REMATCH[6]}]} ))"
    smi="${BASH_REMATCH[5]}"
    eyr="${BASH_REMATCH[14]}"
    printf -v emo '%02s' "${mo[${BASH_REMATCH[9]}]}"
    printf -v edy '%02s' "${BASH_REMATCH[10]}"
    printf -v ehr '%02s' "$(( 10#${BASH_REMATCH[11]} + ${H24[${BASH_REMATCH[13]}]} ))"
    emi="${BASH_REMATCH[12]}"
    printf '%s\n' "started at $syr-$smo-$sdy $shr:$smi -- ended at $eyr-$emo-$edy $ehr:$emi."
  else echo "$line"
  fi
done < log

The output.

$: ./tmp
original data
started at 2024-06-05 20:45 -- ended at 2024-06-05 22:35.
started at 2024-06-06 01:30 -- ended at 2024-06-06 03:17.
occasonal
started at 2024-06-06 07:30 -- ended at 2024-06-06 09:19.
started at 2024-06-06 13:30 -- ended at 2024-06-06 15:19.
other
started at 2024-06-06 19:30 -- ended at 2024-06-06 21:16.
started at 2024-06-07 01:30 -- ended at 2024-06-07 03:17.
started at 2024-06-07 07:30 -- ended at 2024-06-07 09:03.
lines
started at 2024-06-07 13:30 -- ended at 2024-06-07 15:19.
started at 2024-06-07 19:30 -- ended at 2024-06-07 21:17.
started at 2024-06-08 01:30 -- ended at 2024-06-08 03:18.
embedded
started at 2024-06-08 07:30 -- ended at 2024-06-08 09:20.
started at 2024-06-08 13:30 -- ended at 2024-06-08 15:19.
started at 2024-06-08 19:30 -- ended at 2024-06-08 21:20.
just
started at 2024-06-09 01:30 -- ended at 2024-06-09 03:15.
started at 2024-06-09 07:30 -- ended at 2024-06-09 09:19.
in
started at 2024-06-09 13:30 -- ended at 2024-06-09 15:16.
started at 2024-06-09 19:30 -- ended at 2024-06-09 21:15.
started at 2024-06-10 01:30 -- ended at 2024-06-10 03:17.
case
started at 2024-06-10 07:30 -- ended at 2024-06-10 09:16.
started at 2024-06-10 13:30 -- ended at 2024-06-10 15:17.
new data
started at 2024-07-19 19:30 -- ended at 2024-07-19 23:43.
started at 2024-07-20 01:30 -- ended at 2024-07-20 04:48.
started at 2024-07-20 07:30 -- ended at 2024-07-20 10:55.

It's should even be reasonably fast, though won't be as fast as an awk on any large dataset. There are no subcalls to date since what you have is already (I assume) valid data, so it's just string parsing on the fields. read is kinda slow, but the regex should fail early on lines that don't match, and is a reasonably efficient way to parse the lines that do. Uses both %02s and %02d to show both work, 10#... to force base-10 on values with leading zeroes in a math context, and small lookup tables for fast(-ish), easy conversions.

2024-07-24
Paul Hodges