Question

Extracting specific substrings from a text field in SQL

Consider this snippet code to regenerate the result set:


CREATE SCHEMA "clean";


CREATE TABLE "clean"."InvoiceFact" (
  "InvoiceTitle" TEXT
);


INSERT INTO "clean"."InvoiceFact" ("InvoiceTitle") VALUES
('Renewal of advanced payment gateway services (12/22/2023 to 01/20/2024) -limoome1| Lemon'),
('The difference in payment and settlement services (05/15/2024 to 07/15/2024) -fxbotshop | Forex bot shop'),
('Advanced payment gateway services (01/15/2024 to 04/15/2024)-Khan.academy|Khan Academy'),
('The difference in payment and settlement portal services (10/24/2023 to 12/21/2023) -Frenchattractivemen2 | Jack J. Merc'),
('Advanced payment gateway services (09/04/2023 to 12/04/2023) -shakibaazemati'),
('The difference in advanced payment gateway services (04/21/2023 to 07/21/2023) - rFdbb | Toure Du France'),
('Renewal of advanced settlement services (05/25/2024 to 06/25/2024) + difference payment until 05/24 + Yalux hotel extension until 04/07-Ariaz|Ariaz Ads'),
('Renewal of advanced payment gateway services (11/23/2023 to 12/22/2023) - Broca'),
('Renewal of payment and settlement portal services (05/10/2024 to 06/09/2024) - inria2 | Educational Research Collection of Basic Education'),
('Renewal of payment portal services (03/22/2024 to 04/20/2024) - YooEnglishtube | Wow English Tube (settlement 0.2%)'),
('Renewal of advanced payment and settlement portal services (from 10/22/2023 to 01/20/2024)-googleads|Google advertising agency (first payment)'),
('Renewal of advanced payment gateway services (from 01/21/2024 to 04/21/2024) - accsell2| Axel store (first payment)'),
('Renewal +0.2% advanced payment and settlement gateway services (12/20/2023 to 03/20/2024) Yasamankumari | Dr. Kumarii training group (second payment)'),
('Advanced payment and settlement portal services (04/15/2023 to 08/13/2023) - YEnglishtube | Wow English Tube (settlement 0.2%)'),
('Renewal +0.2% advanced payment and settlement gateway services - (09/20/2023 to 12/20/2023) -Yasamankumari |Dr. Kumarii''s educational group'),
('Renewal of advanced payment gateway services (07/06/2024 to 10/06/2024) - Tesmino'),
('Renewal +0.2% advanced payment and settlement gateway services - (09/20/2023 to 12/20/2023) -Yasamankumari | Dr. Kumarii''s educational group'),
('settlement0.2% payment gateway services and advanced settlement (from 05/21/2024 to 06/20/2024) - komolife | Kumolife'),
('Advanced payment gateway services (09/13/2022 to 03/20/2024) -awajr9 | Ava Vincent Jr'),
('Advanced payment gateway services (07/15/2024 to 08/14/2024) - Cityfortnite'),
('Advanced payment gateway services (01/14/2024 to 04/14/2024) -Gifkart|Gifkart');



SELECT 
  "InvoiceTitle",
  CASE
    WHEN "InvoiceTitle" ILIKE '%difference%' OR 
         "InvoiceTitle" ILIKE '%first payment%' OR 
         "InvoiceTitle" ILIKE '%second payment%' THEN 'Installments'
    WHEN "InvoiceTitle" ILIKE '%Renewal%' THEN 'Renewal'
    ELSE 'New'
  END AS ServiceType,
  SUBSTRING("InvoiceTitle" FROM '\((.*?) to') AS StartDate,
  SUBSTRING("InvoiceTitle" FROM 'to (.*?)\)') AS EndDate,
  SUBSTRING("InvoiceTitle" FROM '-(.*?)(\||$)') AS UserName, 
  SUBSTRING("InvoiceTitle" FROM '\|(.*?)$') AS MerchantName
FROM "clean"."InvoiceFact"


This table has a text field named InvoiceTitle which contains various pieces of information delimited by special characters. Here's an example of the data:

'Renewal +0.2% advanced payment and settlement gateway services - (09/20/2023 to 12/20/2023) -Yasamankumari |Dr. Kumarii's educational group'

Renewal +0.2% advanced payment and settlement gateway services (12/20/2023 to 03/20/2024) Yasamankumari | Dr. Kumarii training group (second payment)

I'm trying to extract specific substrings from this field, namely the service type, start date, end date, username, and merchant name.

However, I'm facing an issue with the extraction of the username. For records like the example above, the username is extracted as (09/20/2023 to 12/20/2023) -Yasamankumari, but I want it to be Yasamankumari.

How can I modify my query to correctly extract the username, as well as MerchantName dose not contains extra info with parenthesis?

Additionally, an invoice may be paid in multiple installments, not just one or two. So, I need to generalize the '%second payment%' condition to support any number of installments.

How can I modify my query to handle any number of installments?


EDIT

enter image description here

Additionally, need to consider such cases:

Advanced payment gateway services - (06/19/1402 to 09/19/1402) -komoliVe | Kumolife
Fixed fee of Square advanced settlement service (4 months) - (KarmaP) - second installment
Advanced payment and settlement portal services - (06/28/1402 to 09/28/1402) - TaraRz |TaraRz
 3  69  3
1 Jan 1970

Solution

 3

For the user name you could require it doesn't contain any parentheses nor pipe symbols, nor hyphens (and drop the requirement for one) -- extend this list of undesired characters as needed:

  SUBSTRING("InvoiceTitle" FROM '([^()|-]*)(?:\||$)') AS UserName, 

For the merchant name you could take a similar approach and drop the requirement that it should match at the very end of the input ($):

  SUBSTRING("InvoiceTitle" FROM '\|([^()-]*)') AS MerchantName
2024-07-21
trincot