XF 2.2 Finding duplicated thread titles

briansol

Well-known member
That might be an XF thing. In sql, generally you'd continue to use the 'reference' after it's declared.

like,

if you do var x = 10 + 2.

you wouldn't say z = y + 10 + 2. you would say z = y + x.
 

Anatoliy

Well-known member
so what do I do about pagination?
as I understand when using sql query I can't use ->limitByPage(), $page, $perPage etc... those things that are used for pagination when using finder.
 

briansol

Well-known member
the abstract controller has:
Code:
public function getPaginationData($results, $page, $perPage, $total)

are you extending the abstractcontroller?


you should be able to use that. Several examples in XF native code end in "...Paginated" like:

Code:
protected function getCommentsOnProfilePostPaginated(\XF\Entity\ProfilePost $profilePost, $page = 1, $perPage = null)
	{
		$perPage = intval($perPage);
		if ($perPage <= 0)
		{
			$perPage = $this->options()->messagesPerPage;
		}

		$commentFinder = $this->setupCommentsFinder($profilePost);

		$total = $commentFinder->total();
		$this->assertValidApiPage($page, $perPage, $total);

		$commentFinder->limitByPage($page, $perPage);
		$postResults = $commentFinder->fetch()->toApiResults();

		/** @var \XF\Repository\Attachment $attachmentRepo */
		$attachmentRepo = $this->repository('XF:Attachment');
		$attachmentRepo->addAttachmentsToContent($postResults, 'profile_post_comment');

		return [
			'comments' => $postResults,
			'pagination' => $this->getPaginationData($postResults, $page, $perPage, $total)
		];
	}
 

Kirby

Well-known member
Error: Call to undefined method XF\Db\Mysqli\Statement::limitByPage() in src\addons\AV\ThreadTitles\Admin\Controller\TitleController.php at line 65
limitByPage() is a method of class XF\Mvc\Entity\Finder but as can be seen from the error message, $db->query() returns an instance of XF\Db\Mysqli\Statement.

This class does not have methods limitByPage() or total().

For your usecase (finding duplicate thread titles you've got two options

  1. Use the Finder system to fetch all thread titles and group them in PHP
  2. Use raw queries and group in MySQL
Option 1) isn't realistically usable due to the amount of data so your are basically forced to use option 2).

This means tht you have to everything manually, convenience ORM stuff like limitByPage() or groupBy() isn't possible with "low level" DB access.
Code:
<xf:foreach loop="$duplicates" value="$duplicate">
    <div class="block" style="margin-bottom: 3px">
        <div class="block-container">
            <div style="padding:10px;">
                <h3 style="margin:0">{$duplicate.title}</h3>
            </div>
        </div>
    </div>
</xf:foreach>

And I don't really understand why {$duplicate.title} works as $duplicates is array of arrays, not array of objects.

This is basically the equivalent of
PHP:
$duplicates = [
    [
        'title' => 'Some title',
        'COUNT(title)' => 2,
    ],
    [
        'title' => 'Another title ',
        'COUNT(title)' => 4,
    ],
];

$output = '';

foreach ($uplicates as $duplicate)
{
$output .= '<div class="block" style="margin-bottom: 3px">

        <div class="block-container">

            <div style="padding:10px;">

                <h3 style="margin:0">' . $duplicate['title'] . '

</h3>

            </div>

        </div>

    </div>';

}

(Btw, your should never-ever use inline styles - unless you know 110% waht you are doing and it is unavoidable).

As you can see, $duplicates is an array of arrays so foreach does work just fine.
In case of a Finder result the variable would be an instance of \XF\Mvc\Entity\FinderCollection whis is a class that implements ArrayAccess so it can be used in foreach as well.
Every etnry would be an instance of \XF\Mvc\Entity\Entity - which again implements ArrayAccess.

So how can you use raw queries from a repository? Just by using the DB class instance like your are doing in an controller.

This can be done in a myriad of ways:
PHP:
$db = $this->db();
$db = $this->em->getDb();
$db = \XF::em()->getDb();
$db = \XF::app()->em()->getDb();
$db = \XF::app()->em->getDb();
$db = \XF::db();
$db = \XF::app()->db;
$db = \XF::app()->db();

$app = \XF::app();
$app['db'];

$container = \XF::app()->container();
$db = $container['db'];
$db = $container->offsetGet('db');
$db = $container->__get('db');
// probably a dozen more variants ... I am too lazy to list them all :)

Now that you've got $db you can perform raw queries using method query.
In order to implement pagiation you have to use two queries (one to get the date and one to get the count, this is what Finder does behind the scenes if you call total()) and build pagination based on that.

In MySQL, COUNT() is a grouping function you can GROUP BY on the result of a grouping function, so
Code:
SELECT title, COUNT(title) AS mycount
FROM xf_thread
GROUP BY mycount
HAVING mycount > 1
ORDER BY mycount DESC
doesn't work.
 
Last edited:

Anatoliy

Well-known member
are you extending the abstractcontroller?
yes
Code:
<?php

namespace AV\ThreadTitles\Admin\Controller;
use XF\Admin\Controller\AbstractController;
class TitleController extends AbstractController
    {
you should be able to use that.
PHP:
        public function actionDuplicates()
        {
            $db = $this->app->db();

            $myQuery = $db->query('
            SELECT title, COUNT(title) AS mycount
            FROM xf_thread
            GROUP BY title
            HAVING mycount > 1
            ORDER BY mycount DESC
            ');

            $page = $this->filterPage();
            $perPage = 20;
            $total = $myQuery->total();
            $duplicates = $myQuery->fetchAll();

            $viewParams = [
            'duplicates' => $duplicates,
            'page' => $page,
            'perPage' => $perPage,
            'total' => $total,

           ];

               return $this->view('AV\ThreadTitles:TitleController\Duplicates', 'av_thread_titles_duplicates', $viewParams);           
        }

Error: Call to undefined method XF\Db\Mysqli\Statement::total() in src\addons\AV\ThreadTitles\Admin\Controller\TitleController.php at line 69
  1. AV\ThreadTitles\Admin\Controller\TitleController->actionDuplicates() in src\XF\Mvc\Dispatcher.php at line 352
  2. XF\Mvc\Dispatcher->dispatchClass() in src\XF\Mvc\Dispatcher.php at line 259
  3. XF\Mvc\Dispatcher->dispatchFromMatch() in src\XF\Mvc\Dispatcher.php at line 115
  4. XF\Mvc\Dispatcher->dispatchLoop() in src\XF\Mvc\Dispatcher.php at line 57
  5. XF\Mvc\Dispatcher->run() in src\XF\App.php at line 2352
  6. XF\App->run() in src\XF.php at line 524
  7. XF::runApp() in admin.php at line 13
🤷‍♂️
 

Kirby

Well-known member
the abstract controller has:
Code:
public function getPaginationData($results, $page, $perPage, $total)
\XF\Api\Controller\AbstractController has this method, not \XF\Pub\Controller\AbstractController or \XF\Admin\Controller\AbstractController!
 
Last edited:

Anatoliy

Well-known member
@Kirby I feel like you hit me in a head with a thick "Quantum physics" book.
:)

will read tomorrow again, trying to find familiar letters...
 

Kirby

Well-known member
Sorry :)

Unfortunately XenForo does require somewhat decent PHP knowledge (MVC, ORM, DI, Design Patterns, ...) so at first it has a somewhat steep learning curve for beginners that are not familiar with those.

Your repository method could be smth. like
PHP:
public function findDuplicateThreadTitles(int $page = 1, int $perPage = 20): array
{
    $db = $this->db();

    if ($page < 1)
    {
        $page = 1;
    }
 
    if ($perPage < 1)
    {
        $perPage = 20;
    }
 
    $offset = ($page - 1) * $perPage;

    $duplicates = $db->fetchAll('
        SELECT title, COUNT(title) AS dupes
        FROM xf_thread
        GROUP BY title
        HAVING dupes > 1
        ORDER BY dupes DESC
        LIMIT ?, ?
    ', [$offset, $perPage]);
 
    $total = $db->fetchOne('
        SELECT COUNT(*)
        FROM xf_thread
        GROUP BY title
        HAVING COUNT(title) > 1
    ');
 
    return [
        'duplicates' => $duplicates,
        'total' => $total,
    ];
}

and your controller code
PHP:
public function actionDuplicates()
{
    $page = $this->filterPage();
    $perPage = 20;

    $duplicateThreadsRepo = $this->getDuplicateThreadsRepo();
    $duplicates = $duplicateThreadsRepo->findDuplicateThreadTitles($page, $perPage);

    $this->assertValidPage($page, $perPage, $duplicates['total'], 'duplicated-thread-titles');

    $viewParams = [
        'duplicates' => $duplicates['duplicates'],
        'page' => $page,
        'perPage' => $perPage,
        'total' => $duplicates['total'],
    ];

    return $this->view('AV\ThreadTitles:TitleController\Duplicates', 'av_thread_titles_duplicates', $viewParams);
}

(Btw: The route violates resource standards rule #25 as well).
 
Last edited:

Anatoliy

Well-known member
and your controller code

$duplicateThreadsRepo = $this->getDuplicateThreadsRepo();

Error: Call to undefined method AV\ThreadTitles\Admin\Controller\TitleController::getDuplicateThreadsRepo() in src\addons\AV\ThreadTitles\Admin\Controller\TitleController.php at line 60
  1. AV\ThreadTitles\Admin\Controller\TitleController->actionDuplicates() in src\XF\Mvc\Dispatcher.php at line 352
  2. XF\Mvc\Dispatcher->dispatchClass() in src\XF\Mvc\Dispatcher.php at line 259
  3. XF\Mvc\Dispatcher->dispatchFromMatch() in src\XF\Mvc\Dispatcher.php at line 115
  4. XF\Mvc\Dispatcher->dispatchLoop() in src\XF\Mvc\Dispatcher.php at line 57
  5. XF\Mvc\Dispatcher->run() in src\XF\App.php at line 2352
  6. XF\App->run() in src\XF.php at line 524
  7. XF::runApp() in admin.php at line 13
🤕
 

Anatoliy

Well-known member
I'm really appreciate for your help, but this piece is way bigger than I can chew. )

I found a quick fix. I fetch all sorted by mycount desc in $duplicates
Code:
            $duplicates = $db->query('
            SELECT title, COUNT(title) AS mycount
            FROM xf_thread
            GROUP BY title
            HAVING mycount > 1
            ORDER BY mycount DESC
            ')->fetchAll();

then I grab 100 and put it in another array
Code:
            $myDuplicates = array();

            for ($x = 1; $x <= 100; $x++) {
                foreach ($duplicates as $title => $mycount){
                    $myDuplicates[$title] = $mycount;
                    }
            }

and pass $myDuplicates to a template. and now I'm not afraid of thousands results so it will do just fine without a pagination.
Correct? Or I'm missing something?
 

Kirby

Well-known member
Error: Call to undefined method AV\ThreadTitles\Admin\Controller\TitleController::getDuplicateThreadsRepo() in src\addons\AV\ThreadTitles\Admin\Controller\TitleController.php at line 60
🤕
Well, I though you knew how to use a repository as your Add-on does have one :)

As you can see from the error message, method getDuplicateThreadsRepo does not exist.

Now the question I would like you to answer is:
Why is it missing and how can this be fixed?

PHP:
for ($x = 1; $x <= 100; $x++)
{
	foreach ($duplicates as $title => $mycount)
	{
		$myDuplicates[$title] = $mycount;
	}
}
Pretty complicated approach, using array_slice would be a lot easier.
(But IMHO you shouldn't like this in PHP anyway)

and pass $myDuplicates to a template. and now I'm not afraid of thousands results so it will do just fine without a pagination.
Correct? Or I'm missing something?
On a cufficiently large board (eg. a few 100K threads), $duplicates could easily be a few 10K rows.
That is a lot of data for a PHP array and requires quite a bit of RAM (+ network fransfer from DB + CPU).
IMHO this at least somewhat borderlines resource standards rule #8

Therefore I'd at least implement pagination with SQL, but even that might be too resource heavy - ideally the data should be aggregated somehow (maybe once/week or so via cron?) and only the aggregated data should be queries (via repository) from acp controllers.
 
Last edited:

Anatoliy

Well-known member
Now the question I would like you to answer is:
Why is it missing and how can this be fixed?
didn't I ask that by posting a single line from your code following by error "call to undefined method"? )
so why is it missing and how can this be fixed?
 

Anatoliy

Well-known member
Pretty complicated approach, using array_slice would be a lot easier.
(But IMHO you shouldn't like this in PHP anyway)
I'll google for slice. I guess it will drop items > limit I would indicate (say 100), so yeah, no need to run a loop 100 times. but I heard that php makes 1m operations in a second, so i guess the loop of 100 doesn't take much time.
On a cufficiently large board (eg. a few 100K threads), $duplicates could easily be a few 10K rows.
That is a lot of data for a PHP array and requires quite a bit of RAM (+ network fransfer from DB + CPU).
IMHO this at least somewhat borderlines resource standards rule #8

Therefore I'd at least implement pagination with SQL, but even that might be too resource heavy - ideally the data should be aggregated somehow (maybe once/week or so via cron?) and only the aggregated data should be queries (via repository) from acp controllers.
this part I don't understand much yet. if I understood correctly, I have to go though all the rows so $mycount will show actual numbers of duplicates. when I had limit 100 in my sql query it returned 100 items with $mycount == 2.

my forum has 17k threads and a page "duplicates" opens as fast as any other xF page. maybe LIMIT 20000 in sql query will do the thing? it will protect from cases where forum has 100sK, and will show realistic $mycount if a board has less than 20k of threads and somewhat realistic $mycount for bigger ones.
 

Kirby

Well-known member
didn't I ask that by posting a single line from your code following by error "call to undefined method"? )
so why is it missing and how can this be fixed?
You should be able to answer that yourself ;)
The method is missing because your controller has to implement it and apparently you didn't do that (yet).

Here is a a smiliar method from a XenForo Admin Controller that should give you an idea how to implement the missing method in your controller.
PHP:
/**
 * @return \XF\Repository\ErrorLog
 */
protected function getErrorLogRepo()
{
    return $this->repository('XF:ErrorLog');
}

my forum has 17k threads and a page "duplicates" opens as fast as any other xF page.
17K isn't that much so you probably won't feel that it is slow.
Run an EXPLAIN on that query - what does it show?

Adding a LIMIT doesn't affect the dupe counts at all, it just limits the number of duplicate titles being returned.
It also doesn't affect query perfomance much, the runtime will pretty much be the same for LIMIT 10 or LIMIT 1000.
So I don't really understand " will show realistic $mycount if a board has less than 20k of threads and somewhat realistic $mycount for bigger ones".
 
Last edited:

Anatoliy

Well-known member
Here is a a smiliar method from a XenForo Admin Controller that should give you an idea how to implement the missing method in your controller.
PHP:
/**
 * @return \XF\Repository\ErrorLog
 */
protected function getErrorLogRepo()
{
    return $this->repository('XF:ErrorLog');
}

that's cheating ) without this example there is no way I understood that I have to fetch a xf method. I thought that you named it wrong, or place the same method twice or something like that, because it was in the middle of the night.
I'll try. now it looks understandable.

So I don't really understand " will show realistic $mycount if a board has less than 20k of threads and somewhat realistic $mycount for bigger ones".
I have to say that I read new things that I barely understand and - "well, probably this..." then I do and results are not what I expected so I make assumptions - "well, probably that". so now my head is fool of a mix of "probably this" and "probably that", instead of knowledge. )

when I had LIMIT 100 in my sql query, I believe template showed 100 titles with mycount 2. when without LIMIT it shows titles with mycount 6 & 5. so I made an assumption that limit stops queiry when it have 100 items with count > 1. and that's why mycount is 2. but if the query would go all the way, mycounts of some titles would become 3, 4, 5, 6.
 

Anatoliy

Well-known member
And by the way, I don't see any need in pagination for dups as I don't see any need to list all duplicates. 20-50 with biggest mycount is a big enough list to work with. I made it 100 just to be sure that it's double big enough. And as I deal with the biggest mycount, new pops up in a list, so on the top of my list will be mycount 6, after some time of me fixing it - 5, then 4 etc...

So I guess I'll keep it as it is, will just add LIMIT to my sql query. If limit 100 as you are saying returns 100 items with real mycount, then it's just what I need - no keepeng 100sK in array, no need in pagination.
 

Anatoliy

Well-known member
ideally the data should be aggregated somehow (maybe once/week or so via cron?) and only the aggregated data should be queries (via repository) from acp controllers.
ok, after digesting this information for some time I understand that that's how it should be.
so I'm planning to add a custom table and trying to figure out an architecture or whatever you call it.
let's say there is a cron that runs hourly, checks for duplicates by small chunks and put data into that custom table. and when an admin opens the "Duplicates" page of the add-on, query fetches just 20 items from that custom table regardless if a forum is small or huge. everything's perfect.

The next thought that hits my head is that I'm going to put in that custom table a lot of info from xf_thread - id, title, reply_count, post_date, last_post_date, probably node_id and maybe something else. And when the cron jobs make a fool loop, I'll have this my custom table that is almost a clone of xf_thread table. Maybe I want not a custom table but a custom column in xf_thread that will hold dups_count?
 
Top