Find duplicate content MySQL Query

nrep

Well-known member
It looks like I've got problems with duplicate content on one of my very old forums. There are some threads posted in different forums with the same title, username and first post text (but have differing replies).

I'd like to try and run a MySQL query that finds duplicate threads by matching titles, matching thread username and matching first post message text.

It's the matching first post message text that trips me up. I know that I need to get the "xf_post.message" content and see if there are duplicates where xf_thread.title and xf_thread.username have matches.

Can anyone help me nail this query please?

Here is what I have, without matching the xf_post.message content:

Code:
SELECT DISTINCT *
  FROM xf_thread t1
 WHERE EXISTS (SELECT * FROM xf_thread t2
                WHERE t2.thread_id <> t1.thread_id
                AND t2.title = t1.title
                AND t2.username = t1.username
               );
 
I'd like to try and run a MySQL query that finds duplicate threads by matching titles, matching thread username and matching first post message text.
SQL:
SELECT
    t.thread_id,
    t.title, COUNT(t.title),
    t.username, COUNT(t.username),
    p.message, COUNT(p.message)
FROM
    xf_thread t
INNER JOIN
    xf_post p ON p.thread_id = t.thread_id
WHERE
    p.position = 0
GROUP BY
    t.title,
    t.username,
    p.message
HAVING
    COUNT(t.title) > 1
    AND COUNT(t.username) > 1
    AND COUNT(p.message) > 1;
 
Oh boy, there are a few hundred threads I need to remove!

I think the best way I should deal with this is to delete all duplicates, leaving only the thread with the most number of views.

Is there a way I can get this query to return only threads that don't have the max(xf_thread.view_count) value? I tried adding the following to the WHERE clause, but that doesn't work:

t.view_count != max(t.view_count)

If I can get this working, it'll make it easy to delete all duplicates, leaving only 1 thread (based on view count).
 
Is there a way I can get this query to return only threads that don't have the max(xf_thread.view_count) value?
Not really easily, to exclude the non-max rows also results in excluding for duplicate comparison. Here's the next best, displaying the thread_id's and view_counts .. I'm sure with an export into a spreadsheet, you could do some sorting on the view counts.
SQL:
SELECT
    GROUP_CONCAT(t.thread_id),
    GROUP_CONCAT(t.view_count),
    t.title, COUNT(t.title),
    t.username, COUNT(t.username),
    p.message, COUNT(p.message)
FROM
    xf_thread t
INNER JOIN
    xf_post p ON p.thread_id = t.thread_id
WHERE
    p.position = 0
GROUP BY
    t.title,
    t.username,
    p.message
HAVING
    COUNT(t.title) > 1
    AND COUNT(t.username) > 1
    AND COUNT(p.message) > 1;
 
Many thanks @Mouth, I will have a play with that and see if I can figure out a way to do it. I appreciate the help (y).
 
I think this new query (below) may work, but it's awfuly slow. Is there anything I can do optimise it?

SQL:
SELECT DISTINCT *
FROM   xf_thread t1
WHERE  EXISTS (SELECT *
               FROM   xf_thread t2
               WHERE  t2.thread_id <> t1.thread_id
                      AND t2.title = t1.title
                      AND t2.username = t1.username
                      AND t2.view_count > t1.view_count
                      AND (SELECT message
                           FROM   xf_post
                           WHERE  post_id = t2.first_post_id) = (SELECT message
                                                                 FROM   xf_post
                                                                 WHERE
                          post_id = t1.first_post_id));
 
I think the best way I should deal with this is to delete all duplicates, leaving only the thread with the most number of views.

I would merge the smaller threads into the bigger one in order not to lose some useful replies. The posts that were OPs of the threads that got merged will need to be duplicated.

Alternatively move all posts from the smaller threads (except the duplicate OP) into the biggest thread.

It will take time but will help with SEO probably.

It's worth from now on ask your members to report any cross posting they see (make cross posting against your T & C).

Useful addon is Similar Threads by @AndyB this can find duplicate titles
 
These threads are so old and I'm not sure how to merge them with a MySQL query, at least not in a way to avoid more duplicates - so I'm aiming for a "least worst" result. The threads are over 15 years old now and only kept for legacy reasons, but there are a lot of duplicates in some cases.

Thankfully in the past 10 years there aren't really any duplicates :)
 
I meant merge them (or move posts) with a moderator action on the forum

There's just too many unfortunately, although if any of the threads have have replies in the last few years, I'll do those manually as you suggest :).
 
Back
Top Bottom