Removing old quotes from imported data

nrep

Well-known member
I'm considering importing data from old custom forum software, where some of the really old posts contain poorly formatted data (primarily as this data has been imported several times before).

The problem is that many of the old posts have nested quote tags below the main content. For example:

Code:
Here is the main part of the post - useful information.

[QUOTE=User 2]
Old quote post, not useful to keep below the text
     [QUOTE=User 3]
     A useless nested quote
     [/QUOTE]
[/QUOTE]
[QUOTE=User 4]Separate useless quote[/QUOTE]

I'm struggling to figure out a way to remove these. I've not been able to figure out a regex that can filter out any quotes or nested quotes placed at the end of the post content. The closest I've found is to use the following, but it only works when there are no nested quotes at the end:

/(.*)\n\[QUOTE=(.+)\](.+?)\[\/QUOTE\]$/is

I would keep all of the data returned from the first (.*).

However on posts with multiple nested quotes at the end, I can't figure out a regex that would fully work, as it becomes too greedy and fails under certain conditions.

I'd be grateful for a fresh pair of eyes to consider this problem and see if I've missed an easier way to do this.[/QUOTE]
 
Last edited:
I'd say regex is one of the least optimal ways to achieve this, as it hasn't been designed for nested patterns like yours. There are ways to run nested regexes, but I never got one of them working in a way that felt really satisfying.
 
Thanks for the reply.

Have you got any suggestions for other ways to approach this? I've had so many problems with regex, as I can find ones that work 95% of the time but will then break something when unexpected formatting occurs.

I need to figure out a way that removes all [QUOTE] content from the end of a post (and to leave quotes intact if they don't appear at the end: either individually, nested or one after the other).
 
Mhhh, maybe with a DOM-Parser and some hackish approach through converting the quote bb code into some html tag and back after stripping all children, but that might not really improve the situation a lot. A DOM would at least be a suited structure for nesting. Someone else has probably a better suited approach.
 
Back
Top Bottom