XF 2.2 Is there a way to stop repeated/redundant attachment data? (taking up a lot of space)

Feanor

Active member
I'm trying to find ways to free up space on my server and I first looked at the internal_data folder since it's so large. I noticed some attachments were repeated: same file_hash, so I downloaded the files and confirmed they are in fact exactly the same image.

Sometimes users post the same image across the forum and it can add up. Is there a reason XF needs to create a new file for the attachment in internal_data if's exactly the same?

I decided to use a SQL query to check how much space could be freed if these all used the same image file, and it's at least 7 GB for my forum. That's a lot of space.

This is the query I used:
SQL:
SELECT SUM(redundant_space)
FROM
(
    SELECT file_hash, file_size, COUNT(1), file_size*COUNT(1), file_size*(COUNT(1)-1) AS redundant_space
    FROM xf_attachment_data
    GROUP BY file_hash, file_size
    HAVING COUNT(1) > 1
    ORDER BY file_size*COUNT(1) DESC
) attachdata

It could actually be more than that because I found some additional duplicate images in internal_data that weren't in the xf_attachment_data table for some reason.
 
Disk space is so cheap these days, does it even make sense to try and track down even a 7GB saving?

You also have issues of “ownership” say user A uploads an image and then user B uploads the same image later. You can’t really have them both use the same one because what if user A deletes the image he originally uploaded?

FWIW, you could move 7GB off your server completely into cloud storage system (Cloudflare’s R2) for a cost of $0.015 per GB, per month, so that 7GB you are trying to save has a cost of about 10 cents per month.
 
Disk space is so cheap these days, does it even make sense to try and track down even a 7GB saving?
Indeed it does. I have a few sites and one is a very small non-profit: on a 500mb quota. I don't yet have a forum on there but if I did which is highly possible then I would want to be as economic as possible with image sizes. 7GB would make a huge difference to the hosting costs
 
Well if the cost is more than 0.015 per GB, per month, just move it off your server? Then you don’t have to pay the hosting company for storage. Unlimited attachments/avatars, etc taking up 100GB would cost $1.35 per month if you were to move them off your server.
 
For me it's not about the cost but about efficiency. Why have a second when you already have one.

You also have issues of “ownership” say user A uploads an image and then user B uploads the same image later. You can’t really have them both use the same one because what if user A deletes the image he originally uploaded?
That makes sense. I also think that is the reason. But I believe there is a way around it. You can store who used the image and where in the DB. But it might not be worth all the changes.
 
Ya, basically the attachment permission system would need a redesign from the ground up to support multiple users “owning” a single attachment. Doesn’t seem like it would be worth the effort when things like cloud-based storage is so cheap it’s effectively free.
 
So 3.0 then. That's where they do major rewrites. 😄
This would actually go hand in hand with an attachment manager where users can use previously used images. :p
 
Code:
rdfind -makehardlinks true ./internal_data/attachments
Using hardlinks is one option (if the local driver is used as storage backend), but it's not without risk of application-level data corruption or data loss:
If there is code (this is not that case with standard XenForo, but there are Add-ons taht do this) that modifies an existing attachment on disk without also changing the hash this would cause the hardlinked file to be overwritten causing all copies to be affected by thae change (instead of just one like it would be without hardlinks).
Furtermore this would cause the all hardlinked files to have the same meta data (permissions, owner, timestamps).

Other options that don't touch XenForo code:
  • Use a filesystem that has support for in-band/online deduplication (for example ZFS)
    Pro ZFS: Mature technology, can also do compression
    Con ZFS: Out-of-tree modules (eg. might not be easily available), possible license issues, requires quite a bit of CPU and RAM for deduplication, usually requires "reformatting the volume" (if it is not already ZFS)
  • Use a filesystem that has support for out of band/offline deduplication (for example BtrFS)
    Pro BtrFS: Available with mainstream kernels / distributions, can also do compression
    Con Btrfs: Might not be that stable, requires additional tools for deduplication, usually requires "reformatting the volume" (if it is not already BtrFS)
  • Use a filesystem with support for reflinks (for example XFS)
    Pro XFS: Mature technology, available with mainstream kernels / distributions
    Con XFS: Requires additional tooling for "deduplication", usually requires "reformatting the volume" (if it is not already XFS)
  • Use lvmdo
    Pro lvmdo: Filesystem independent, can also do compression
    Con lvmdo: Relatively "new" technology, so might not be easily available, usually requires reformatting the volume (if it is not already backed by lvmdo)

Ya, basically the attachment permission system would need a redesign from the ground up to support multiple users “owning” a single attachment.
Not necessarily.

Table xf_attachment_data already has a reference counter (attach_count), so theoretically it would already be possible to use the same data_id for multiple attachment_id; there is code in place to update this reference counter and only delete from xf_attachment_data (and the files) via cron when this counter reaches 0.

Though that would cause the same issues as hardlinked files and this is probably one of the reasons why XenForo choose not to implement this (yet).
 
Last edited:
I decided to use a SQL query to check how much space could be freed if these all used the same image file, and it's at least 7 GB for my forum. That's a lot of space.
While I entirely agree with you and others that XF could be modified to store only one copy of any given attachment (storing a little metadata for each use, so that permissions etc. work as they should), I'd be stunned if it was worthwhile on 99.9% of installations - i.e. I'd expect it to affect a tiny fraction of the total stored data.

On that note, I'm curious to know what fraction of your total attachment data that 7 GB represented @Feanor ?
 
Using hardlinks is one option (if the local driver is used as storage backend), but it's not without risk of application-level data corruption or data loss:
If there is code (this is not that case with standard XenForo, but there are Add-ons taht do this) that modifies an existing attachment on disk without also changing the hash this would cause the hardlinked file to be overwritten causing all copies to be affected by thae change (instead of just one like it would be without hardlinks).
Furtermore this could cause the all hardlinked files to have the same meta data (permissions, owner, timestamps).

Other options that don't touch XenForo code:
  • Use a filesystem that has support for in-band/online deduplication (for example ZFS)
    Pro ZFS: Mature technology, can also do compression
    Con ZFS: Out-of-tree modules (eg. might not be easily available), possible license issues, requires quite a bit of CPU and RAM for deduplication, usually requires "reformatting the volume" (if it is not already ZFS)
  • Use a filesystem that has support for out of band/offline deduplication (for example BtrFS)
    Pro BtrFS: Available with mainstream kernels / distributions, can also do compression
    Con Btrfs: Might not be that stable, requires additional tools for deduplication, usually requires "reformatting the volume" (if it is not already BtrFS)
  • Use a filesystem with support for reflinks (for example XFS)
    Pro XFS: Mature technology, available with mainstream kernels / distributions
    Con XFS: Requires additional tooling for "deduplication", usually requires "reformatting the volume" (if it is not already XFS)
  • Use lvmdo
    Pro lvmdo: Filesystem independent, can also do compression
    Con lvmdo: Relatively "new" technology, so might not be easily available, usually requires reformatting the volume (if it is not already backed by lvmdo)

Not necessarily.

Table xf_attachment_data already has a reference counter (attach_count), so theoretically it would already be possible to use the same data_id for multiple attachment_id.

Though that would cause the same issues as hardlinked files and this is probably one of the reasons why XenForo choose not to implement this (yet).

Xenforo does it the way they do because they based XF on VB's intellectual property. When I inspected the database for the first time, I was not the least bit surprised that VB filed a lawsuit. The database schema was (and still is) a near clone of VB 4.x with a bit of renaming some table names and column names. The actual "real" differences are minor, and few. VB 4.x supported multiple attachment records pointing to the same data record, and if you uploaded a duplicate, it would only store one copy.

Xenforo already supports multiple xf_attachment records pointing to a single xf_attachment_data record in existing data. :-)

I acknowledge there might be poorly coded add-ons that later on could change an image, in a way that would change the other files when they shouldn't be. But in actual practice I have not encountered a problem in the 10+ years I've used hardlinks for deduplication.

  • The ownership is going to be the same since they are in the same directory tree.
  • Xenforo doesn't care about the attachment timestamps, and stores the Unix timestamps in the database.
  • Add-ons that optimize a file when uploaded work with this, since new file uploads are not hard-linked with existing files.
  • If an add-on which changes images at a later time is coded properly, it's not going to overwrite the same file if it changes the file content, since it would:
a) Delete the existing attachment file, which one effect the links.
b) Create a new file, to match the new hash.

I still re-run this technique every few months on my big boards, and have used it on attachment file stores nearing 3TB in size. :-)
 
Not necessarily.

Table xf_attachment_data already has a reference counter (attach_count), so theoretically it would already be possible to use the same data_id for multiple attachment_id; there is code in place to update this reference counter and only delete from xf_attachment_data (and the files) via cron when this counter reaches 0.

Though that would cause the same issues as hardlinked files and this is probably one of the reasons why XenForo choose not to implement this (yet).
Oh ya, I forgot the data table was decoupled. It probably would be fairly easy now that I think about it. The guts are already there and even deleting attachments doesn’t delete the record from the data table. It even has an md5 hash already of the contents to do a quick check if something incoming is already in the system.
 
Top Bottom