Hashed or Tokenised Data

BIG LLC

Active member
A feature that provides extra variables containing hashed, tokenised, or partially masked versions of user fields would be great.

We already salt and hash passwords, but sometimes information that will or might become public, or even just be shared with a third party like Google Analytics or an advertising platform, might be shared with more regard to GDPR and other legislation if it were hashed and salted, or otherwise made more secure.

Nothing is perfect, but extra security options to abide by platform rules and laws around the world would be a good direction to take.
 
Upvote 2
TL;DR: Tokenise local-part and the domain of a user email separately to allow mods to see the values and more easily use the email address to identify sock puppets, spammers, and banned users re-registering.


Spitballing on this:

YES! I do have too much time to waste today :D

Comparing email addresses is a useful tool for hunting spammers and sock puppet accounts.

But for user privacy, most forum owners would turn off any display of email addresses to all other staff accounts. (If you haven't, fellow forum people - go and do it now!)

To keep the usefulness of email addresses in manually comparing suspicious accounts, easy to remember tokens might make sense.

To do this you could salt and SHA-256 hash the email address local-part separately from the domain, and make an easy to read token of each hash.

So, while John.Smiths.Fake.Email@protonmail.com (plus a salt) might hash to:

551eff8d9d5d557535d124c21a48f0c6633e39e5a0d0eb8d2387535c094166a0

John.Smiths.Fake.Email might salt and hash to:

75c19d54bf908e93695136aed98cde098ef8cd64bd714f81fbbf84c7bd990ce4

Protonmail.com would then salt and hash to:

8b23c042130fb8768209d717cbfade176f454477d603473a89573561762bc430

But 75c19d54bf908e93695136aed98cde098ef8cd64bd714f81fbbf84c7bd990ce4@8b23c042130fb8768209d717cbfade176f454477d603473a89573561762bc430 is really hard to read.

If you took the hash 75c19d54bf908e93695136aed98cde098ef8cd64bd714f81fbbf84c7bd990ce4 and randomly assigned it the token (for instance) SpeakTrenchPortal5017 and similarly tokenised 8b23c042130fb8768209d717cbfade176f454477d603473a89573561762bc430 to PlacementBeamRavage3333...

The email's tokenised two-part salted SHA-256 hash would look like:

SpeakTrenchPortal5017@PlacementBeamRavage3333

Three words and four numbers makes (I think?) about 1 billion tokens available in the token space for either side of the @ if you count 100,000 English words - there are more than that, though. Move the numbers around and you can multiply that range.

That's not enough for Facebook, but it's enough for any forum.

Why do this?

If a bunch of indicators pointed to that account being a sock for another - any mod could easily see that they were both also on the same domain (PlacementBeamRavage3333) or had the same local part (SpeakTrenchPortal5017).

A forum admin could even make available to all staff a list of the tokens that represent fake email domains to reference.

That is not a totally robust system. A determined mod could make their own translation table out of the popular email domains and local-parts (just register a new account with a protonmail.com address and go look for the tokens to pop up on the output of an addon like Tickets). Once they found out, they would know that PlacementBeamRavage3333 is always the token for protonmail.com

But big deal, IMO.

A hacker with a leaked database might efficiently make their own as well - just check for posts of mods mentioning clear-text versions of known tokens. That sort of thing. But they would never get all the emails with this system, while right now they would get everything if they got access to the database with clear text emails.

A solution like this would mean mods would not have to be shown clear text email addresses to use them for sock puppet identification, and damage from a data leak would be minimal.

Now...

Tell me why this is overly complicated, with glaring holes in the logic :D
 
TL;DR: Tokenise local-part and the domain of a user email separately to allow mods to see the values and more easily use the email address to identify sock puppets, spammers, and banned users re-registering.


Spitballing on this:

YES! I do have too much time to waste today :D

Comparing email addresses is a useful tool for hunting spammers and sock puppet accounts.

But for user privacy, most forum owners would turn off any display of email addresses to all other staff accounts. (If you haven't, fellow forum people - go and do it now!)

To keep the usefulness of email addresses in manually comparing suspicious accounts, easy to remember tokens might make sense.

To do this you could salt and SHA-256 hash the email address local-part separately from the domain, and make an easy to read token of each hash.

So, while John.Smiths.Fake.Email@protonmail.com (plus a salt) might hash to:

551eff8d9d5d557535d124c21a48f0c6633e39e5a0d0eb8d2387535c094166a0

John.Smiths.Fake.Email might salt and hash to:

75c19d54bf908e93695136aed98cde098ef8cd64bd714f81fbbf84c7bd990ce4

Protonmail.com would then salt and hash to:

8b23c042130fb8768209d717cbfade176f454477d603473a89573561762bc430

But 75c19d54bf908e93695136aed98cde098ef8cd64bd714f81fbbf84c7bd990ce4@8b23c042130fb8768209d717cbfade176f454477d603473a89573561762bc430 is really hard to read.

If you took the hash 75c19d54bf908e93695136aed98cde098ef8cd64bd714f81fbbf84c7bd990ce4 and randomly assigned it the token (for instance) SpeakTrenchPortal5017 and similarly tokenised 8b23c042130fb8768209d717cbfade176f454477d603473a89573561762bc430 to PlacementBeamRavage3333...

The email's tokenised two-part salted SHA-256 hash would look like:

SpeakTrenchPortal5017@PlacementBeamRavage3333

Three words and four numbers makes (I think?) about 1 billion tokens available in the token space for either side of the @ if you count 100,000 English words - there are more than that, though. Move the numbers around and you can multiply that range.

That's not enough for Facebook, but it's enough for any forum.

Why do this?

If a bunch of indicators pointed to that account being a sock for another - any mod could easily see that they were both also on the same domain (PlacementBeamRavage3333) or had the same local part (SpeakTrenchPortal5017).

A forum admin could even make available to all staff a list of the tokens that represent fake email domains to reference.

That is not a totally robust system. A determined mod could make their own translation table out of the popular email domains and local-parts (just register a new account with a protonmail.com address and go look for the tokens to pop up on the output of an addon like Tickets). Once they found out, they would know that PlacementBeamRavage3333 is always the token for protonmail.com

But big deal, IMO.

A hacker with a leaked database might efficiently make their own as well - just check for posts of mods mentioning clear-text versions of known tokens. That sort of thing. But they would never get all the emails with this system, while right now they would get everything if they got access to the database with clear text emails.

A solution like this would mean mods would not have to be shown clear text email addresses to use them for sock puppet identification, and damage from a data leak would be minimal.

Now...

Tell me why this is overly complicated, with glaring holes in the logic :D

One-way functions such as hashing are considered destructive, and make the data unusable or at least very difficult to use. It also makes it completely useless to your staff, as no one is going to go through and compile a full cheat sheet for emails the way you used as an example; I say that as someone who literally makes cheat sheets and databases for everything, even for purchasing decisions.

Encryption wouldn't matter, as most of the time being hacked means they've gained access to your server (unless you're absolute garbage at securing your backups), which would (usually) give them the ability to decrypt the data.

I can honestly think of more reasons why not to do this, than I can think of the benefits as to why to do this, or at least why it isn't necessarily needed by XF (other than an expansion of custom fields). Sites that do need to highly secure their data are likely going to be using a custom solution to do so.
 
This is a tangent from the OP suggestion of core Xenforo making available separate hashed or tokenised versions of variables for PII fields. I don't see any downsides to that as a cheap control against PII leakage.

I will add to the OP that another good idea would be to expose these hashed versions to and via addons as the default core XF setting.

For example, force the default to $email.hash instead of $email. Then badly written code or badly set privileges don't automatically result in PII leakage.

Insider data hoarders can't just grab all the emails they see, save them into a text file on their hard drive, and then lose them when their local machines are compromised.

The wrong person getting added to the Moderator user group won't automatically expose PII built up in the private mods boards. Same goes with a single bug that lets people elevate privileges from registered user to moderator. If that sort of thing happened, anyone could get ChatGPT to write a simple Python script using Beautiful Soup and other libraries to scrape threads in boards they shouldn't have access to.


Onwards into the weeds:


One-way functions such as hashing are considered destructive, and make the data unusable or at least very difficult to use.

This is the point. The suggestion of exposing hashed or tokenised personal data to addons, and moderators and admins, is to keep the clear text personal data out of the public-facing forum.

Put a fence around it and only allow privileged accounts to see it. Maybe even force an indicator on the public forum that says whether the PII is tokenised so the user can make their own choice about whether they want a forum mod to see their email address or IP addresses - though of course that can be faked unless you get more complicated with hashes of the code.

Just occurred to me that we already have file health check, so something like that could be used? Don't know.

This stems what I guarantee you - without shadow of a doubt - is a flow of PII from forums all over the world via well-meaning or malicious insiders. For our part, we lock our PII up but that takes away the usefulness of extra comparison dimensions for a number of admin and moderator responsibilities. Hence the tangential suggestion of human usable tokens.

It also makes it completely useless to your staff, as no one is going to go through and compile a full cheat sheet for emails the way you used as an example;

You'd be surprised what sort of spreadsheets some mods make to help them in the role. Many would absolutely note down the token for the major free email domains like gmail.com or hotmail.com, and a few of the common spam email domains like outlook.com and protonmail.com. They would discuss it on their private mods boards and in PM's. No doubt in my mind.

Encryption wouldn't matter, as most of the time being hacked means they've gained access to your server (unless you're absolute garbage at securing your backups), which would (usually) give them the ability to decrypt the data.

This is the more complicated bit - true tokenisation down to the database level would require a second database. This is an impractical control for most sites, that's true.

But to my mind the only people who should see clear text of things like email and date of birth should be the Super Admin and maybe trusted Admins if really needed.

I can honestly think of more reasons why not to do this, than I can think of the benefits as to why to do this, or at least why it isn't necessarily needed by XF (other than an expansion of custom fields). Sites that do need to highly secure their data are likely going to be using a custom solution to do so.
Big or small, the forum users individually are impacted in the same way by exposure of personal data. It should all be highly secured.

As that level of security is too expensive, a cheap alternative is swapping out personal information like date of birth and email address when that is useful to display through moderating tools.


On the topic of existing settings on forums, I have another suggestion - I'll add a new thread if I get time.
 
Back
Top Bottom