Discussion in 'Resolved Bug Reports' started by Lin, Aug 13, 2010.
Although the use of UTF-8 encoding, but still can not search Chinese.
These are limitations of MySQL full text search and it's approach to tokenizing words. CJK (Chinese, Japanese, Korean) searching is a challenging thing for any Western language-based search system to support.
I hope you can solve this problem, let's use it just translate the language file.
CJK searching has already been a big problem for Chinese, Japanese, Korean admins using VBB or IPB.I hope Xenforo can work this out.
As far as I know,China would be a big potential market for forum platform.Most of them would like to pay for the license if the forum support the CJK well.
Not really, no... Sadly, only small handful would be willing to pay, the rest majority would rather stay with Discuz or pirate... Though, after official version of XenForo gets released, if it still doesn't work then, I'll see what hacks I can try to device in my spare time...
All over the world who use piracy, but the official technical support is that they can not do, this is the reason I am willing to pay.
If you can not properly use the search function, then XenForo lose a customer.
Yes, I need the Chinese search support too.
The whole 'lose a customer' or 'miss out on a market' idea needs to go, really. The XenForo team and I have seen it first hand with vB China. It's a failed cause trying to enter those market with commercial discussion forum packages because of well rooted market for Discuz, and general practice in piracy.
Yes, there are a few potential customers lost, and yes, that is really sad that those are willing to pay probably will not get everything they want. But regardless of whether or not search actually works, the tiny fraction of potential customers in that market simply does not justify the time, effort, and cost involved to have a point of presence in that market.
As much as I hate to put it so bluntly, XenForo's team will most likely benefit more by not worrying about CJK searches, but instead, focus their effort on addressing issues brought up by English users, where they know their majority of customers are.
But, that said, XF team may choose to tread the dangerous water again should they choose to do so. That's entirely up for them to decide.
Hmm, I don't think it's 不能支持中文 but
So it just means it would take more time to figure out, as CJK is not native to the programmers and requires a lot of time to figure out (meanwhile, the software is still in alpha).
Hmm, a good point. My main question is if Discuz can get it to work, would it still be possible (and manageable) for the XF team to allow support for CJK?
I would think so, as I see:
I remember what a hassle it was trying to get support for searching in VB, trying to use iconv on large database, etc. It would really, REALLY be cool if Mike and Kier did somehow manage to nail this as it seems it can be done [from the example of Discuz]. I'm just not sure if they'll see it as worth the effort
尽快支持中文搜索吧，这是优于vbb 和 ipb的地方。绝对是个亮点！！！
It's 3am on a school night, so I shouldn't be replying, but @#$% fml. There are so many points to be made here, I don't even know where to start....
First; vBulletin 4's search is not done by one person, but a team. However small as it may be, myself was also involved to some minuscule extent. Additionally, it was not done in a short time, it was done over weeks and weeks of development time. Certain development team members were stuck on that project for several weeks straight.
Also, the said search system have its problems. Contents are not searchable until they are indexed by a scheduled task; indexer uses fair bit of resources; results have counting limitations are just a few problems I can remember. I don't have the luxury of going into details, but I think most can take ease in just taking my words for it.
Next; Implementations of CJK search. In vBulletin China's modified distribution, it goes back to at least 3.5; although not "official", I'm pretty sure 3.0.x (I know I've translated the hack and posted on vb.org) and 2.x series was also covered to some extent. This was achieved by taking every single word in posts and bundle them together in 2 CJK character indexes. This method, while it may work, is limited by MySQL's index size limitation. As result of this, busy forums and long posts often encountered errors or simply don't work proper. My 2 minutes scanning of Discuz's code also suggests a similar implementation, but I'll need to read further when I have time to find out for sure.
Added to the CJK search problem, this only works assuming if we know the character encoding used on the forum (Read: UTF-8). Neither vB (Jelsoft version anyways) nor IPB have really strictly forced everyone to use Unicode UTF-8. There are many valid reasons for this, I had several long posts on vb.com's forum about these issues. But what this does mean is by the time you import to XenForo, you'll probably have to discard some content because converter cannot convert multiple source encodings (IE: Chinese forum with BIG5 and GBK for Traditional and Simplified Chinese sub forums). If we were to make some sort of indexer that will index CJK text for search, it will only really work best for new freshly installed XenForo with no contents; or you're lucky and already have your source all sorted out in UTF-8.
Database configuration is also a big issue. I don't remember the variable name, but long story short, we have several stages of things going wrong:
- HTML's character encoding
- MySQL's connection encoding
- MySQL database's charset setting
- MySQL database's collation
Several combinations of there of can work together, and present what end user would call "Chinese". But they all mean different things, and would require some different handling. I still recall changing one variable in config.php of vBulletin can cause your database to spew out garbage and blank pages... And changing the said variable at a wrong time, or attempting to make a backup inappropriately can result in full irreversible data loss.
Oh, also, for the record, having CJK search in vB China's modified distro did not particularly helped penetrate Chinese market. Interestingly, opposite to the popular voice, people frankly don't care. They have their Discuz, they're happy and not planning to change. They have their pirate version running with our modified code and have no intention in purchasing the license.
Anyways, 4am now... I've spent about 1 hour writing and deleting... I don't even know if this makes any sense. I'm just gonna hit post reply, get burned for any mistakes, and check over again with a clear mind. so much for waking up at 6 to go in early and work out thesis stuff with my prof...
such a headache indeed! I for one am glad I don't have to deal with this - but indeed to have such support would be another postive tick for xenforo
Andy, thanks so much for providing all of the input (at the cost of your sleep, grades, and possible future plans to ever own your own house/car/clothing)~
It seems that simply implementing search isn't just a drag and drop procedure, if only it were that easy, but a pain staking task that also can be resource intensive, and prone to bugs. While it would be cool if the work had been put in, with the limitations and the amount of hoops to jump through, I can see where they would have more critical issues to deal with--even just making sure XF is up to their quality for shipping out the door.
Hopefully, perhaps when XF is released, a community of those of us interested in CJK XF could try to work around the limitations of MySQL and even release something upstream. But that's a big hope, haha.
Thanks again, Andy
Andy, thanks for the input. What kind of steps are there to take, if any, of the seemingly rampant piracy of licensed software in Asia?
原來 Andy Huang 你也跑來這裏了啊。
If can provide API or something else, then Xenforo fans(CJK) maybe can doing something related, let's see what will be happened then...
XenForo should support CJK searching.
Separate names with a comma.