[bd] Attachment Store [Deleted]

Hi xfrocks

I have a question... :)

My client is hosted on a shared host which provides some other business services due to which she cannot change her host. Now we have installed XF and the RM and we want to upload large PDF files upto 80 MB into the RM. The upload of large files into RM fails because of low values of upload_max_filesize and post_max_size php settings.

One solution we are thinking of is storing the large pdf files in the resource manager to Amazon S3.
Now the question is, does this addon still upload to the site's server and copy the file to S3 or using the addon the upload to site can be totally bypassed?
Can this addon upload the file directly to S3 from the RM? If we could upload directly to S3 we would possibly bypass the host restrictions.

I hope my question is clear...

Regards
 
One enhancement I would like to see in this is the ability to add custom items to 'meta' when saving to S3, that way I can do things like add ache-control and expire values.
 
Now the question is, does this addon still upload to the site's server and copy the file to S3 or using the addon the upload to site can be totally bypassed?
Can this addon upload the file directly to S3 from the RM? If we could upload directly to S3 we would possibly bypass the host restrictions.
Having looked at the code, the answer is no. While it is technically possible for what you want to be coded, its not that straight forwarded as whatever loading mechanism needs to get assigned IAM credentials for a one time use only. It's not that straight forward and really isn't practical. Dropbox, for example, runs off S3, but they dont allow to upload directly to S3 either.

If you are not uploading too many files like this, then upload a small dummy file (with the same name and type) and replace it directly with the correct file via the AWS console or use S3browser or S3Fox.
 
Hi xfrocks

I have a question... :)

My client is hosted on a shared host which provides some other business services due to which she cannot change her host. Now we have installed XF and the RM and we want to upload large PDF files upto 80 MB into the RM. The upload of large files into RM fails because of low values of upload_max_filesize and post_max_size php settings.

One solution we are thinking of is storing the large pdf files in the resource manager to Amazon S3.
Now the question is, does this addon still upload to the site's server and copy the file to S3 or using the addon the upload to site can be totally bypassed?
Can this addon upload the file directly to S3 from the RM? If we could upload directly to S3 we would possibly bypass the host restrictions.

I hope my question is clear...

Regards
It is not possible unfortunately, at least for now. Exactly as @Jim Boy said, you can workaround it like that.

One enhancement I would like to see in this is the ability to add custom items to 'meta' when saving to S3, that way I can do things like add ache-control and expire values.
That's a good idea. Maybe I will add a new option for S3 for that.
 
I was playing around with the add-on before deploying it. I noticed that using the tool to move attachments to S3 deletes them locally, which seems unusual. Why would I want to immediately delete them locally? What if something goes wrong and I want to revert without delay?

I also noticed that once files are uploaded to S3, they won't be uploaded again, even if they're restored locally and deleted from S3. This is concerning, as we'll likely be using a combination of local storage and S3 during the transition to prevent downtime. We'd like to synchronize twice, but with this design, we're fairly certain that won't do any good.

I'm considering migrating to local storage in /data/ first, which will be quick, then using s3cmd or a similar tool to synchronize with S3 twice while switching over (once before switching, once after switching). Is there any reason that wouldn't work? It seems like a more durable solution. On the same note, it would be convenient to be able to move data between storage options before switching where they are served from in order to avoid downtime--that is, with the built-in tool, instead of this hackish method.
 
  • Like
Reactions: HWS
I was playing around with the add-on before deploying it. I noticed that using the tool to move attachments to S3 deletes them locally, which seems unusual. Why would I want to immediately delete them locally? What if something goes wrong and I want to revert without delay?
Keep a back up? Personally I didn't bother moving our large collection of existing attachments over, people dont tend to view the older stuff much. I've witten my own script for migrating, but i haven't bothered to ususing it.
I also noticed that once files are uploaded to S3, they won't be uploaded again, even if they're restored locally and deleted from S3. This is concerning, as we'll likely be using a combination of local storage and S3 during the transition to prevent downtime. We'd like to synchronize twice, but with this design, we're fairly certain that won't do any good.
Why would you do that? S3 is rock solid and web servers are inherently ephemeral. If you are hosting on EC2, expect your web server to disappear at any time, if you aren't prepared for that, then you aren't using EC2 correctly.
I'm considering migrating to local storage in /data/ first, which will be quick, then using s3cmd or a similar tool to synchronize with S3 twice while switching over (once before switching, once after switching). Is there any reason that wouldn't work? It seems like a more durable solution. On the same note, it would be convenient to be able to move data between storage options before switching where they are served from in order to avoid downtime--that is, with the built-in tool, instead of this hackish method.
I dont see why you would have any downtime in ralation to attachments, just turn it on and it works, attachments that were local will continue to be served from the local server. New files will get stored on S3 and will be stored on s3.
 
Keep a back up? Personally I didn't bother moving our large collection of existing attachments over, people dont tend to view the older stuff much. I've witten my own script for migrating, but i haven't bothered to ususing it.

...

Why would you do that? S3 is rock solid and web servers are inherently ephemeral. If you are hosting on EC2, expect your web server to disappear at any time, if you aren't prepared for that, then you aren't using EC2 correctly.

Our web servers autoscale, but they currently store attachments on NFS--admittedly not the most reliable solution due to lack of redundancy, but we had other priorities at the time. We'd like to keep the files in NFS until we're sure S3 is working well in production so that we can revert if necessary. The backup is only temporary. Of course, we'll be making intermittent backups from S3 as well: even S3 is not perfectly invulnerable, and even though the SLA is well in excess of five nines, we'd rather not take any chances.

I dont see why you would have any downtime in ralation to attachments, just turn it on and it works, attachments that were local will continue to be served from the local server. New files will get stored on S3 and will be stored on s3.

In order to migrate the attachments with the built-in tool, you have to first change where your attachments are being served from. Between the time you switch the settings from default to S3 and the time migration completes, some attachments will be unavailable, with older attachments becoming available before more recent attachments. In my testing, the add-on refused to serve local files while in S3 mode; it would only use S3, and would not fallback to local storage.

I should note that this add-on's directory structure is different from XenForo's default. I didn't look too much into it, but it doesn't seem likely that you can map the URIs to local files in /internal_data/ with Nginx trickery.

Right now I'm using aufs to mount a bridge at /internal_data/; after the add-on "deletes" all of the attachments, I just remove the whiteout files (rm -f **/.wh.* in the writable directory). It's a bit cumbersome to configure, though.
 
Last edited:
In order to migrate the attachments with the built-in tool, you have to first change where your attachments are being served from. Between the time you switch the settings from default to S3 and the time migration completes, some attachments will be unavailable, with older attachments becoming available before more recent attachments. In my testing, the add-on refused to serve local files while in S3 mode; it would only use S3, and would not fallback to local storage.
Why bother migrating existing attachments? The addon does serve existing local files, if the system wasn't setup to use S3 at the time the attachment was uploaded. The attachments are flagged in the database as being S3 or not. If they are not S3, they get served locally. Switching on the add on means existing attachments will continue to get served from the local server and any new attachments served from S3. Bake the old data into your ami and you will be fine. When I switched to this arrangement, we had 4GB of attachments, not that that is a large amount, but significant enough and we've had zero problems with this add-on on a very large instalation. Chopping and changing is a really bad idea.

That's why I've written my own script to migrate, it will copy over the existing locally held data into S3. It will copy everything over, I'll then test on my test machine and if happy I'll make the database change to get the attachment add-on to use the S3 held data rather than the locally held data. But I'm really in no hurry to d that as it works perfectly well now anyway.

Right now I'm using aufs to mount a bridge at /internal_data/; after the add-on "deletes" all of the attachments, I just remove the whiteout files (rm -f **/.wh.* in the writable directory). It's a bit cumbersome to configure, though.
That sounds like over-engineerng and could potentially lead to other issues related to how XF uses data within the internal-data directory. I've found it best just to leave them alone on a per-webserver basis.

I hope you aren't doing anything like that with the external data directory, by the simplest option there is to register an S3 stream.
 
Why bother migrating existing attachments? The addon does serve existing local files, if the system wasn't setup to use S3 at the time the attachment was uploaded. The attachments are flagged in the database as being S3 or not. If they are not S3, they get served locally. Switching on the add on means existing attachments will continue to get served from the local server and any new attachments served from S3. Bake the old data into your ami and you will be fine. When I switched to this arrangement, we had 4GB of attachments, not that that is a large amount, but significant enough and we've had zero problems with this add-on on a very large instalation. Chopping and changing is a really bad idea.

That's why I've written my own script to migrate, it will copy over the existing locally held data into S3. It will copy everything over, I'll then test on my test machine and if happy I'll make the database change to get the attachment add-on to use the S3 held data rather than the locally held data. But I'm really in no hurry to d that as it works perfectly well now anyway.

Alright, I didn't quite understand the role the database played in serving attachments; that is, I didn't know options were saved per-attachment. That explains the second cache rebuild tool--and why files weren't being served locally after I ran it.

Putting them on the AMI is definitely an enticing option, but I'd much rather have them all in one place. It'll make it much easier for anyone else looking at the infrastructure to understand. Now that I understand how the add-on works, it should be easy to migrate them from NFS without any downtime.

That sounds like over-engineerng and could potentially lead to other issues related to how XF uses data within the internal-data directory. I've found it best just to leave them alone on a per-webserver basis.

I hope you aren't doing anything like that with the external data directory, by the simplest option there is to register an S3 stream.

We've been using aufs for both directories on testing/development servers for about a year without any issues. For local development servers, we often use docker, which puts most of the system on aufs. For our shared development servers, we often update to snapshots of production's state; aufs makes this process much easier.

It's unlikely that we'd use aufs as a permanent solution on production (with the exception of docker and similar deployment mechanisms), but we've used it in the past while migrating data so we can easily revert to different stages if anything goes wrong. I've toyed around with using a low-level filesystem like btrfs instead, but they're a bit more heavyweight and cumbersome.

We will certainly either disable the local file deletion step in the add-on or use something like aufs during migration. We're not willing to risk losing data. Even if we take an EBS snapshot before the migration, we'd risk longer downtime in the event of failure and loss of any attachments uploaded during the migration process. (It's quite likely that additional attachments will be uploaded during the migration.)
 
It's unlikely that we'd use aufs as a permanent solution on production
I just dont know why you need any solution. Ive used gluster etc in the past past but XenForo wasn't really designed to be distributed. I've found that their is no need to have a shared internal_data directory as long as you use this add-on. If you do share the directory, it is just another thing that could fail. If you are running multi-az as well, it just adds to your bill. We will scale from one to as many eight web servers in a day on the most punishing of XenForo sites and I have never seen any issues at all related to each web server maintaining their own internal_data directory.
We're not willing to risk losing data.
You seem to be seriously underestimating the reliability of S3, it's 11 9's, not 5 as you earlier stated. Plus you can put in versioning and if you're really keen, run an s3cmd sync command from a non AWS box to store in a third party location. Add in a regular back up of your core software (eg add-ons etc) and daily backup of the database and you'll be protected in the most catastrophic of circumstances. Not to mention appropriate use of IAM to ensure against acts of stupidity.
 
I was playing around with the add-on before deploying it. I noticed that using the tool to move attachments to S3 deletes them locally, which seems unusual. Why would I want to immediately delete them locally? What if something goes wrong and I want to revert without delay?

I also noticed that once files are uploaded to S3, they won't be uploaded again, even if they're restored locally and deleted from S3. This is concerning, as we'll likely be using a combination of local storage and S3 during the transition to prevent downtime. We'd like to synchronize twice, but with this design, we're fairly certain that won't do any good.

I'm considering migrating to local storage in /data/ first, which will be quick, then using s3cmd or a similar tool to synchronize with S3 twice while switching over (once before switching, once after switching). Is there any reason that wouldn't work? It seems like a more durable solution. On the same note, it would be convenient to be able to move data between storage options before switching where they are served from in order to avoid downtime--that is, with the built-in tool, instead of this hackish method.
There is an option called "keep local copy" that will do that you want. It basically keep a copy in the default XenForo internal_data directory so you disable the add-on at any time and files are still being served without disruption.

And yes, files won't be tracked on the file system. They are tracked in the database only. This is done to make sure performance does not suffer then there are a lot of files.

You don't need to migrating that way. You can enable the local copy option and start using the tool to upload files to Amazon. You may need to double the hard drive space during the run FYI.
 
  • Like
Reactions: HWS
Alright, deployed successfully after tweaking. Ran into a few problems:

The add-on is written such that images uploaded to S3 do not have Content-Disposition headers. However, when migrating data, each image takes on the Content-Disposition header of the last non-image in the batch. Images uploaded before any non-images will correctly lack a Content-Disposition header. For example, if a batch uploads alpha.png, beta.txt, and gamma.png, alpha.png and beta.txt will download as expected, but gamma.png will download with the filename beta.txt. This is actually a bug in Zend, and there is a comment in the Zend source code acknowledging the bug (Zend/Service/Amazon/S3.php, lines 609-610). However, the acknowledgment is actually mistaken: the bug is not in Zend_Http_Client, but at the location of the comment deferring blame. Zend_Http_Client has a resetParameters method that must be called between requests. In order for it to reset headers, true must be passed as the first parameter; however, Zend_Http_Client omits this. A simple fix would be to extend Zend_Service_Amazon_S3, override _makeRequest, and call self::getHttpClient()->resetParameters(true) prior to calling parent::_makeRequest(...). Example:
Code:
class bdAttachmentStore_Zend_Service_Amazon_S3 extends Zend_Service_Amazon_S3
{
    public function _makeRequest($method, $path = '', $params = null, $headers = array(), $data = null)
    {
        if (isset(self::$_httpClient)) {  // Don't bother if we're creating a new client
            self::getHttpClient()->resetParameters(true);
        }

        return parent::_makeRequest($method, $path, $params, $headers, $data);
    }
}
Note that this is also a potential issue with bdDataStorage, since bdAttachmentStore and bdDataStorage combined can result in multiple S3 uploads using the same Zend_Http_Client instance. The same fix can be applied there as well.


The migration tool has an SQL statement in bdAttachmentStore_CacheRebuilder_AttachmentData::rebuild that would benefit from an additional WHERE condition. Currently, the SELECT includes rows that are already on the current engine. I've added the following as a quick hack:
Code:
AND (
    attachment_data.bdattachmentstore_engine NOT LIKE ?
    ' . (empty($defaultEngine) ? 'AND attachment_data.bdattachmentstore_engine IS NOT NULL' : 'OR attachment_data.bdattachmentstore_engine IS NULL') . '
)

...

array(
    $position,
    empty($defaultEngine) ? '' : $defaultEngine,
    $options['batch']
)

$defaultEngine needs to be declared a bit earlier for this to work. This could probably be improved by excluding attachments with an attach_count of 0 (attachments that have already been migrated, but have yet to be cleaned up).

Also: both bdDataStorage and bdAttachmentStore appear to be saving attachment thumbnails. Is there an easy way to disable this for bdDataStorage? I haven't looked at that code as much.
 
I have some beginner questions here.

1. Does it make sense to use this add-on for a forum from the start (so no members, no attachments at this moment)? Or should someone wait until one have GBs of attachments?
2. What is this CDN your are talking about? Some enable it, some dont. For what do I need it?
3. Let's say a forum with 30 gb attachments and 5000 users. The Amazon S3 would charge how much estimated?
4. Can we use this to save the attachments on our own personal computer which is not 24/7 online?
 
I have some beginner questions here.

1. Does it make sense to use this add-on for a forum from the start (so no members, no attachments at this moment)? Or should someone wait until one have GBs of attachments?
Do it from the start, apart from anything else it means that should anything happen to your server, the attachments will be safe - you'll only have to back up the core code and the database. Any good design architecture for a web application needs to have static assets stored in a location that can be accessible from multiple locations, much like the database is. I'd argue that XF's biggest design fault is that it doesn't do this out of the box.
2. What is this CDN your are talking about? Some enable it, some dont. For what do I need it?
A CDN means data is served to users from a location closer to them, for some sites it has advantages others dont. It can be enabled later without major headaches
3. Let's say a forum with 30 gb attachments and 5000 users. The Amazon S3 would charge how much estimated?
Depends a bit on how much gets download, lets say 50GB, about a $5 a month
4. Can we use this to save the attachments on our own personal computer which is not 24/7 online?
I dont believe so, its not a backup tool
 
Top Bottom