Fixed PostEmbedMetadata::getIdsToRebuild has horrible performance characteristics

Xon

Well-known member
Affected version
2.0.9
XF\Job\PostEmbedMetadata::getIdsToRebuild does a select and in the where clause has "attach_count > 0"

This causes what is effectively a full-table scan if there is statistically very few attachments. Worse, the batch-size will be reduced, except this makes the performance worse are it means more table-scans to get usable post-ids.

Instead of;
SQL:
SELECT post_id
FROM xf_post
WHERE post_id > 19209
   AND attach_count > 0
ORDER BY post_id
LIMIT 293

Explain
Code:
+------+-------------+---------+-------+---------------+---------+---------+------+---------+-------------+
| id   | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows    | Extra       |
+------+-------------+---------+-------+---------------+---------+---------+------+---------+-------------+
|    1 | SIMPLE      | xf_post | range | PRIMARY       | PRIMARY | 4       | NULL | 5935051 | Using where |
+------+-------------+---------+-------+---------------+---------+---------+------+---------+-------------+

Why not use;
SQL:
select distinct content_id
from xf_attachment
where content_type = 'post'
and content_id > 19209
order by content_id
limit 293;

Explain
Code:
+------+-------------+---------------+-------+----------------------+----------------------+---------+------+------+--------------------------+
| id   | select_type | table         | type  | possible_keys        | key                  | key_len | ref  | rows | Extra                    |
+------+-------------+---------------+-------+----------------------+----------------------+---------+------+------+--------------------------+
|    1 | SIMPLE      | xf_attachment | range | content_type_id_date | content_type_id_date | 31      | NULL |  414 | Using where; Using index |
+------+-------------+---------------+-------+----------------------+----------------------+---------+------+------+--------------------------+

In my environment this is the difference of ~4 minutes to run the 1st query (per iteration!!) vs 0.01 seconds for the 2nd


This will then be accurate where as attach_count may not be, and is a horrible selector. If needed a join from xf_attachment => xf_post can be added for additional filtering criteria.
 
It's worth noting that generally, I haven't seen the horrible performance characteristics you've mentioned, though I accept that it will vary based on data size and server configuration. I'm curious if the 4 minutes you saw was on a production-tuned server (where the data is likely in memory). While our data set isn't as big, the first query takes < 0.01 seconds here. Even if I tune it to effectively cover our full data set (`attach_count > 50`), it's still approximately 0.3 seconds.

While this does effectively need to fully scan the post table over time, it should scan an index for the ordering and then it's a matter of finding the first X posts that have an attachment (and then stopping). Fewer attachments will take more time, though equally, they should also take far fewer passes.

As a general comment, this tool is designed to be generic, though we don't inherently have anything else with it right now. Saying that, if we didn't use a condition, it could be used to retroactively detect unfurls or to detect some media embeds more accurately. I suppose we could change the query significantly for the specific attachment case, though I haven't heard of other issues with this. It is run after a 2.0 upgrade, though it's very possible that people aren't aware that it's happening.
 
One thing which makes this worse is that AbstractEmbedMetadataJob records the start time. it queries for IDs, then it checks to see if the max run-time has expired. Then it actually processes a loop, except if the query takes too long it does no work except for repeatedly querying the same ids until the batch size becomes small enough it might finish.

Running SELECT post_id FROM xf_post WHERE post_id > 19209 AND attach_count > 0 ORDER BY post_id LIMIT 293; my new dev machine that query takes ~45 seconds. On the older machine it takes ~3 minutes.

Changing innodb from 512mb -> 2gb (increasing the VM's ram) doesn't make a huge difference. I suspect the SSD in my older machine needs updating.
 
Bit of bug necromancy going on here, but @Xon could you post an EXPLAIN SELECT post_id FROM xf_post WHERE post_id > 19209 AND attach_count > 0 ORDER BY post_id LIMIT 293 ?
 
Bit of bug necromancy going on here, but @Xon could you post an EXPLAIN SELECT post_id FROM xf_post WHERE post_id > 19209 AND attach_count > 0 ORDER BY post_id LIMIT 293 ?

Grabbed from a more recent snapshot, rather than the hilariously old snapshot I was testing against;
Code:
EXPLAIN
    -> SELECT post_id
    -> FROM xf_post WHERE post_id > 19209 AND attach_count > 0
    -> ORDER BY post_id
    -> LIMIT 293;
+------+-------------+---------+-------+---------------+---------+---------+------+---------+-------------+
| id   | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows    | Extra       |
+------+-------------+---------+-------+---------------+---------+---------+------+---------+-------------+
|    1 | SIMPLE      | xf_post | range | PRIMARY       | PRIMARY | 4       | NULL | 6928588 | Using where |
+------+-------------+---------+-------+---------------+---------+---------+------+---------+-------------+

(EXPLAIN format=json has more details)
Code:
EXPLAIN format=json
    -> SELECT post_id
    -> FROM xf_post WHERE post_id > 19209 AND attach_count > 0
    -> ORDER BY post_id
    -> LIMIT 293;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "xf_post",
      "access_type": "range",
      "possible_keys": ["PRIMARY"],
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["post_id"],
      "rows": 6928588,
      "filtered": 100,
      "attached_condition": "xf_post.post_id > 19209 and xf_post.attach_count > 0"
    }
  }
} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

ANALYZE format=json is Mariadb's run query & show execution plan, with extra details
Code:
ANALYZE format=json
    -> SELECT post_id
    -> FROM xf_post WHERE post_id > 19209 AND attach_count > 0
    -> ORDER BY post_id
    -> LIMIT 293;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ANALYZE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 116550,
    "table": {
      "table_name": "xf_post",
      "access_type": "range",
      "possible_keys": ["PRIMARY"],
      "key": "PRIMARY",
      "key_length": "4",
      "used_key_parts": ["post_id"],
      "r_loops": 1,
      "rows": 6928588,
      "r_rows": 1.15e7,
      "r_total_time_ms": 114504,
      "filtered": 100,
      "r_filtered": 0.0025,
      "attached_condition": "xf_post.post_id > 19209 and xf_post.attach_count > 0"
    }
  }
} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1 min 56.436 sec)

These where run on a non-compressed xf_post table! Running the ANALYZE format=json against the live forums.spacebattles.com database takes over 2 minutes 30 seconds with ~34.8 million rows
 
Thank you for reporting this issue. It has now been resolved and we are aiming to include it in a future XF release (2.1.4).

Change log:
Not exactly a fix, but a workaround as we currently have no need of the content-table-specific queries while we only support attachments. We will need to revisit this if we extend the metadata rebuild to support stuff other than just attachments, most likely with an index on `attach_count` or something similar.
Any changes made as a result of this issue being resolved may not be rolled out here until later.
 
Last edited by a moderator:
Top Bottom