Fixed Query to fetch threads in a forum

digitalpoint

Well-known member
The XenForo_Model_Thread::prepareThreadFetchOptions() method has some less than ideal joins and calculations going on...

INNER JOINs are noticeably slower than LEFT JOINs (because they need to actually be looked at for exclusions). This gets *really* slow when you have a forum with hundreds (or thousands) of pages of threads... All the threads from previous pages need to actually perform the join in order to see if they are to be excluded from the list.

In the case of this method, the INNER JOINs are really serving no purpose other than to do a double check on data integrity (don't show threads unless the node exists, don't show threads unless the forum exists, don't show the thread unless it has a first post, etc) Doing data integrity checks at the query level is really not worth the huge amount of resources needed when you start getting into high page numbers.

The other issue in this method is SQL IF() statements are even more resource intensive than the JOINs because they also need to be evaluated for all threads in previous pages. This part in particular:

Code:
IF(thread_user_post.user_id IS NULL, 0, thread_user_post.post_count) AS user_post_count

I've modified the method so all the INNER JOINs are LEFT JOINs and changed the above IF() to just return the count regardless if it's NULL or not by changing it to this:

Code:
thread_user_post.post_count AS user_post_count

Doing this made the query to get threads in a forum at a very high page number go from 4-5 MINUTES to 2-3 seconds with no noticeable ill effects. The JOINS are just data integrity JOINs that really should never happen... and if something WAS wrong with a thread, wouldn't you want to know/see it? The user_post_count is used for hovering over the small avatar, and it works the same after the change.
 
Also confirmed.

Before fix loading page 854:
Timing: 15.4578 seconds Memory: 6.539 MB DB Queries: 17

After fix:
Timing: 1.1214 seconds Memory: 6.649 MB DB Queries: 17


I have a question, however.

Why didn't they use LEFT JOIN's instead of INNER in the first place?

They did choose to use LEFT JOIN in some circumstances in this function, but others are INNER.

I can only assume there must have been some sort of reason.

Nice work though, Shawn. It clearly works and makes a huge difference.
 
Why didn't they use LEFT JOIN's instead of INNER in the first place?

They did choose to use LEFT JOIN in some circumstances in this function, but others are INNER.

I can only assume there must have been some sort of reason.
I skimmed over it... it looks like they are using LEFT JOIN for records that may or may not be there (things like if the thread was read or not, if the forum was read or not, deletion logs, etc.)

The INNER JOINs are used on stuff that *should* be there, but if parts are not there, you have a data integrity problem so the INNER JOIN essentially hides threads that have data integrity problems.

I ended up using this method to extend the default XenForo_Model_Thread::prepareThreadFetchOptions() method... optimizes the query without needing to edit the files:


PHP:
	public function prepareThreadFetchOptions(array $fetchOptions)
	{
		$response = parent::prepareThreadFetchOptions($fetchOptions);

		$response['joinTables'] = str_replace ('INNER JOIN', 'LEFT JOIN', $response['joinTables']);
		$response['selectFields'] = str_replace (array('IF(thread_user_post.user_id IS NULL, 0, thread_user_post.post_count) AS user_post_count', 'IF(user.username IS NULL, thread.username, user.username) AS username'), array('thread_user_post.post_count AS user_post_count', 'thread.username AS username'), $response['selectFields']);
		
		return $response;
	}

When it comes down to it, INNER JOIN should never be used on large sets of data, as it simply can't scale well (since you have to actually perform the JOIN on every record before the record you are actually looking at). IF() calculations don't scale well either because the records all need to be evaluated (and dumped to a temporary table) for every record up to the one you are looking at in case you are doing an evaluation of the result of the calculation.

While there are cases where you can't avoid using INNER JOIN (hopefully just on small sets of data), using it for full data integrity checking every time you are getting records is just not a good idea if you want to be able to scale out.

Maybe I'll put together an article (or add-on) that optimizes stuff like I did for vBulletin 4 (see this).
 
Back
Top Bottom