XF 2.2 Unused attachments at bottom of posts - do they get removed over time?

Fullmental

Active member
We've been moving a lot of user generated images over to attachment files, and one of the things we've run into occasionally is if those attachment files get unlisted from the post (user copies and pastes an out of date version of their post from an external document, or they the post history is reverted), those attachments wind up with no reference to the post and they all get dumped at the bottom of the post like this:

1683407329030.webp

In some cases these folks may have hundreds of duplicated images in a single post, just chilling at the bottom of their post in an "attachments" section. It becomes impossible to manage.

Do these unreferenced attachments get cleaned up over time, or do they just sit there for eternity? Need to know if we have to go through manual effort to find and remove these or not.

I caught this post referencing something similar but not exact in XF 1.X, does it apply in 2.X, and if it does, does it still apply if the attachments are technically associated with a post, just unused in the post content?

 
OK. Perhaps you can help me understand the database relationship here then? Let's assume the number of posts and attachments we're talking about number in the 100,000's due to a change in how we handled content and proxying. It's a long story. Anyway, we can't reconcile it manually, but if we can find and remove the attachment's post reference, they would clean up, yes? We could do that if we were able to target the batch of posts that were affected, compare the content to find attachment IDs in the message bbcode vs attachment IDs associated with the post, and remove the reference to those attachments and posts where there is no bbcode ID match. At least, that's the theory.

I can see there is an "attach count" in the xf_post table, and the below information in xf_attachment, but I don't see how the two reference the post ID to actually link an attachment to a post:

1683408685833.png



Should I be looking elsewhere for this relationship? I thought it might be "content_id", but that doesn't seem to correspond with a post ID, because if I follow content ID of 1 for example, there is no attachment in post 1.
 
I think I found something. In xf_post there seems to be an embed_metadata field as a BLOB which contains attachments by ID.:

1683409742762.webp

If I remove a specific attachment from this BLOB and run the cleanup job, will it detect and remove that attachment I removed the reference from in xf_post? Any other adjustments I need to make to the database here or anywhere to ensure best practices/no errors? I would assume the attach_count might need to be updated manually too if there is no rebuild option for it.
 
I recommend using the function in the ACP to delete attachments: /admin.php?attachments/ .
That page doesn't seem have any filtering options or tools to mass delete other than a "select all" for 20 attachments at a time. That's helpful for managing maybe a few dozen problem posts, but we're talking about 100,000+ posts, and apparently 1.7 million attachments when filtering by the date range for said posts, with legitimate attachments likely dispersed in between going sequentially by attachment ID:

1683410280812.png

I really think a database driven approach is the best way to handle this given the sheer scale of the problem. I just need to be sure we're accounting for any other required updates at the same time.

I already have a custom database handler coded in python that can compare values within various cells in a table or set of tables, it would not be difficult to use that as a framework for a new script to compare attachment values and then update the blob and attach_count data accordingly. Handling the JSON-encoded BLOB values and doing a string compare to values within ATTACH bbcode is simple enough. I just need to know what the relevant relationships are between the data and know that I'm not missing anything vitally important for the function of the attachment system. We would obviously do this in a version of xf_post that was dumped from the live site to a local test instance, then we can do an update on the affected rows based on the results later during a maintenance window after we've confirmed both intended behavior and desired result.

In other words, I accept full responsibility for meddling with the database, but ask for guidance because we don't know what we don't know about the inner workings of the attachment system itself.
 
Last edited:
Still looking for a solution to this. Would the API be a better option? There's no documentation there so it's been a bit of a challenge to figure it out, but it looks like there is a delete endpoint. What we could do is run a get each post ID to see attachments for each affected post, check for a) image attachments that are b) not inserted by attachment ID, and then run an attachment/delete on the attachment ID. Would that be safer than a direct database manipulation, and would that achieve the goal of letting the XF install clean up the directories and restore our disk space? We'd really love to resolve this quickly as the extra storage space currently stuck on this server is quite literally costing us $10 a day vs our entire server before costing us $20 a month...
 
You would need to inspect all posts (or the message contents I suppose) and extract all of the attachment IDs which have been embedded in them, and then iterate over all of your attachments, skipping over the ones extracted previously, and delete them. If you don't want to do this via the XF framework (ie. using the entity system via an add-on or PHP script), then the API is probably your best since it will handle all of the pertinent record-keeping automatically.

Once the attachment records are cleaned up, the underlying data records and files will be removed by the hourly clean-up cron at a rate of 1,000 per hour. You can run the cron manually from the control panel, or (given the number of attachments involved) come up with a script to call \XF\Repository\Attachment::deleteUnusedAttachmentData more frequently until they are all removed.
 
You would need to inspect all posts (or the message contents I suppose) and extract all of the attachment IDs which have been embedded in them, and then iterate over all of your attachments, skipping over the ones extracted previously, and delete them. If you don't want to do this via the XF framework (ie. using the entity system via an add-on or PHP script), then the API is probably your best since it will handle all of the pertinent record-keeping automatically.

Once the attachment records are cleaned up, the underlying data records and files will be removed by the hourly clean-up cron at a rate of 1,000 per hour. You can run the cron manually from the control panel, or (given the number of attachments involved) come up with a script to call \XF\Repository\Attachment::deleteUnusedAttachmentData more frequently until they are all removed.
Thank you! This sounds very doable.

Is it possible to temporarily raise the limit of this job so we don't have to call it manually 1,700 times?
 
Not without editing files (to change the batch size, 1k) and/or enabling development mode (to change the cron frequency, each hour). If you script it, I would still recommend some sort of rate limiting. It'd probably be easiest to call that method in a loop with a delay in between each call.
 
Not without editing files (to change the batch size, 1k) and/or enabling development mode (to change the cron frequency, each hour). If you script it, I would still recommend some sort of rate limiting. It'd probably be easiest to call that method in a loop with a delay in between each call.
Would there be any negative effects if we took the hourly cleanup cron job and set it to run every minute instead? Presumably it does not take terribly long to run? Would be cleaner than writing a custom PHP script, and with a 1 minute delay in between, that should clear out everything within about 30 hours, which is fast enough for me.
 
Did you get an answer for this at all?
I realize this is very late, but no I wound up pulling a database backup to revert the action that lead to the excess attachments.

In fact we ran into a storage runaway situation tonight on a similar issue related to the image cache - apparently it is also limited to an arbitrarily low number during the hourly cleanup cron, which lead to us saturating our storage volume without knowing it. I had to manually trigger that hourly cleanup about 60 times to get it to catch up again. I'm extremely disappointed in how these cleanup jobs were implemented with no controls to handle runaway situations, and there's no overall storage limit for the image cache tool either.

I will have to figure out a way to invoke these PHP commands, but there seems to be no easy way to just directly reference the specific script to run from a crontab entry, and there isn't a CLI command for the hourly cleanup either, only one for "outstanding" jobs, but that's not going to trigger it any more frequently than once per hour.
 
Last edited:
You have a huge forum and only pay $20 a month for the server? In which country is that possible? Here, a server costs more like $250 a month, and it can only handle one large forum, not two.

What’s your problem?

You have lots of images on the server that no longer have corresponding posts? If so ...

The tedious approach would be to go through all the images and check the database to see if there’s still a reference. If not, don’t delete them immediately, but move them somewhere else first.

Alternatively, you could maybe do something smarter. Perhaps. For example, you could regenerate all the images. After that, they should have a current timestamp, and then you could move all the images with a significantly older timestamp.

If you’re sure everything matches, you can delete what you moved.

The remaining question is, what gets rewritten when you regenerate the images? Maybe everything that’s resized but not the original? In that case, you’ll need to cleverly query to ensure you get everything from the thumbnails.

Alternatively, and this is what I’ll try:

I’ll convert everything to AVIF with 2.3; then I’ll move all the JPGs away. If I understood correctly, we now have AVIF in 2.3, but I haven’t checked yet.


But maybe I didn’t fully understand the problem? Hopefully, I’ve at least solved mine in theory. I’ll go ahead and try it now.
 
Back
Top Bottom