• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

Not planned utf8mb4_unicode_ci instead of utf8mb4_general_ci

#1
For v2.0, utf8mb4_unicode_ci will be a better collation for international boards. It does much better for people who are lazy writing accents when searching, like your users.

We were originally going to use the utf8mb4_unicode_ci collation but this poses some problems when converting tables containing existing data.
In that case, I suggest just using it for thread titles and message contents. Converting other columns can cause problems, but converting those two should give you most of the advantages without problems. I have had my forums using unicode_ci for the last 8 years or so.

Stack Overflow Google Search

Cheers!
 

Mike

XenForo developer
Staff member
#2
In terms of accents, unicode and general are not significantly different. They're both accent insensitive. There may be a few types of accents where they're different, but there are also numerous other places where the differences are unexpected. (As a random example, "™" and "tm" are treated as identical.)

Converting just those areas won't really do much. The differences mostly come up with unique indexes and sorting. While you can order by thread titles, it's not particularly common. Neither have unique indexes, so those differences won't apply. Converting usernames is probably one of the most obvious places that could benefit, but it's also a very good example of a column that can't easily be converted because of additional things that are considered identical. Arguably, the other most beneficial change could be to the search index to catch those additional accent insensitivities/equalities. That is an area that could probably be converted safely.

However, we've taken the view that we don't want to mix collations like that out of the box. If the benefits are worth it for you, then go for utf8mb4_unicode_ci across the board. We may revisit making that an option in the future, but at this time, we wanted to keep it limited to 2 collations with very similar behaviors being used out of the box (and for add-ons to consider).
 
#3
I forget exactly what language we ran into sorting or matching issues with general_ci. Changing the usernames to unicode_ci didn't work when I did it years ago. I needed to leave it in general_ci because of quite a few "matched" with unicode_ci, whereas they were still unique with general_ci.

Thanks