• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

Service issue Can not search Chinese

Status
Not open for further replies.

Mike

XenForo developer
Staff member
#2
These are limitations of MySQL full text search and it's approach to tokenizing words. CJK (Chinese, Japanese, Korean) searching is a challenging thing for any Western language-based search system to support.
 

bookmark

Well-known member
#4
These are limitations of MySQL full text search and it's approach to tokenizing words. CJK (Chinese, Japanese, Korean) searching is a challenging thing for any Western language-based search system to support.
CJK searching has already been a big problem for Chinese, Japanese, Korean admins using VBB or IPB.I hope Xenforo can work this out.
As far as I know,China would be a big potential market for forum platform.Most of them would like to pay for the license if the forum support the CJK well.
 

Andy Huang

Well-known member
#5
Not really, no... Sadly, only small handful would be willing to pay, the rest majority would rather stay with Discuz or pirate... Though, after official version of XenForo gets released, if it still doesn't work then, I'll see what hacks I can try to device in my spare time...
 
#6
All over the world who use piracy, but the official technical support is that they can not do, this is the reason I am willing to pay.

If you can not properly use the search function, then XenForo lose a customer.
 

Andy Huang

Well-known member
#8
The whole 'lose a customer' or 'miss out on a market' idea needs to go, really. The XenForo team and I have seen it first hand with vB China. It's a failed cause trying to enter those market with commercial discussion forum packages because of well rooted market for Discuz, and general practice in piracy.

Yes, there are a few potential customers lost, and yes, that is really sad that those are willing to pay probably will not get everything they want. But regardless of whether or not search actually works, the tiny fraction of potential customers in that market simply does not justify the time, effort, and cost involved to have a point of presence in that market.

As much as I hate to put it so bluntly, XenForo's team will most likely benefit more by not worrying about CJK searches, but instead, focus their effort on addressing issues brought up by English users, where they know their majority of customers are.

But, that said, XF team may choose to tread the dangerous water again should they choose to do so. That's entirely up for them to decide.
 
#9
不能支持中文?据我所知vbb4 的中文搜索问题,只被一个人用了很短的时间就解决了,虽然不像处理英文搜索那么完美,但是完全满足使用了。
 

chousho

Well-known member
#10
不能支持中文?据我所知vbb4 的中文搜索问题,只被一个人用了很短的时间就解决了,虽然不像处理英文搜索那么完美,但是完全满足使用了。
Hmm, I don't think it's 不能支持中文 but
These are limitations of MySQL full text search and it's approach to tokenizing words. CJK (Chinese, Japanese, Korean) searching is a challenging thing for any Western language-based search system to support.
So it just means it would take more time to figure out, as CJK is not native to the programmers and requires a lot of time to figure out (meanwhile, the software is still in alpha).


It's a failed cause trying to enter those market with commercial discussion forum packages because of well rooted market for Discuz...
Hmm, a good point. My main question is if Discuz can get it to work, would it still be possible (and manageable) for the XF team to allow support for CJK?
I would think so, as I see:
Discuz! 论坛(BBS),是一个采用PHP 和MySQL...
I remember what a hassle it was trying to get support for searching in VB, trying to use iconv on large database, etc. It would really, REALLY be cool if Mike and Kier did somehow manage to nail this as it seems it can be done [from the example of Discuz]. I'm just not sure if they'll see it as worth the effort :p
 

Andy Huang

Well-known member
#12
It's 3am on a school night, so I shouldn't be replying, but @#$% fml. There are so many points to be made here, I don't even know where to start....

First; vBulletin 4's search is not done by one person, but a team. However small as it may be, myself was also involved to some minuscule extent. Additionally, it was not done in a short time, it was done over weeks and weeks of development time. Certain development team members were stuck on that project for several weeks straight.

Also, the said search system have its problems. Contents are not searchable until they are indexed by a scheduled task; indexer uses fair bit of resources; results have counting limitations are just a few problems I can remember. I don't have the luxury of going into details, but I think most can take ease in just taking my words for it.

Next; Implementations of CJK search. In vBulletin China's modified distribution, it goes back to at least 3.5; although not "official", I'm pretty sure 3.0.x (I know I've translated the hack and posted on vb.org) and 2.x series was also covered to some extent. This was achieved by taking every single word in posts and bundle them together in 2 CJK character indexes. This method, while it may work, is limited by MySQL's index size limitation. As result of this, busy forums and long posts often encountered errors or simply don't work proper. My 2 minutes scanning of Discuz's code also suggests a similar implementation, but I'll need to read further when I have time to find out for sure.

Added to the CJK search problem, this only works assuming if we know the character encoding used on the forum (Read: UTF-8). Neither vB (Jelsoft version anyways) nor IPB have really strictly forced everyone to use Unicode UTF-8. There are many valid reasons for this, I had several long posts on vb.com's forum about these issues. But what this does mean is by the time you import to XenForo, you'll probably have to discard some content because converter cannot convert multiple source encodings (IE: Chinese forum with BIG5 and GBK for Traditional and Simplified Chinese sub forums). If we were to make some sort of indexer that will index CJK text for search, it will only really work best for new freshly installed XenForo with no contents; or you're lucky and already have your source all sorted out in UTF-8.

Database configuration is also a big issue. I don't remember the variable name, but long story short, we have several stages of things going wrong:
- HTML's character encoding
- MySQL's connection encoding
- MySQL database's charset setting
- MySQL database's collation

Several combinations of there of can work together, and present what end user would call "Chinese". But they all mean different things, and would require some different handling. I still recall changing one variable in config.php of vBulletin can cause your database to spew out garbage and blank pages... And changing the said variable at a wrong time, or attempting to make a backup inappropriately can result in full irreversible data loss.

Oh, also, for the record, having CJK search in vB China's modified distro did not particularly helped penetrate Chinese market. Interestingly, opposite to the popular voice, people frankly don't care. They have their Discuz, they're happy and not planning to change. They have their pirate version running with our modified code and have no intention in purchasing the license.

Anyways, 4am now... I've spent about 1 hour writing and deleting... I don't even know if this makes any sense. I'm just gonna hit post reply, get burned for any mistakes, and check over again with a clear mind. so much for waking up at 6 to go in early and work out thesis stuff with my prof...
 

p4guru

Well-known member
#13
It's 3am on a school night, so I shouldn't be replying, but @#$% fml. There are so many points to be made here, I don't even know where to start....

Database configuration is also a big issue. I don't remember the variable name, but long story short, we have several stages of things going wrong:
- HTML's character encoding
- MySQL's connection encoding
- MySQL database's charset setting
- MySQL database's collation

Several combinations of there of can work together, and present what end user would call "Chinese". But they all mean different things, and would require some different handling. I still recall changing one variable in config.php of vBulletin can cause your database to spew out garbage and blank pages... And changing the said variable at a wrong time, or attempting to make a backup inappropriately can result in full irreversible data loss.

Oh, also, for the record, having CJK search in vB China's modified distro did not particularly helped penetrate Chinese market. Interestingly, opposite to the popular voice, people frankly don't care. They have their Discuz, they're happy and not planning to change. They have their pirate version running with our modified code and have no intention in purchasing the license.

Anyways, 4am now... I've spent about 1 hour writing and deleting... I don't even know if this makes any sense. I'm just gonna hit post reply, get burned for any mistakes, and check over again with a clear mind. so much for waking up at 6 to go in early and work out thesis stuff with my prof...
such a headache indeed! I for one am glad I don't have to deal with this - but indeed to have such support would be another postive tick for xenforo :)
 

chousho

Well-known member
#14
Andy, thanks so much for providing all of the input (at the cost of your sleep, grades, and possible future plans to ever own your own house/car/clothing)~
It seems that simply implementing search isn't just a drag and drop procedure, if only it were that easy, but a pain staking task that also can be resource intensive, and prone to bugs. While it would be cool if the work had been put in, with the limitations and the amount of hoops to jump through, I can see where they would have more critical issues to deal with--even just making sure XF is up to their quality for shipping out the door.

Hopefully, perhaps when XF is released, a community of those of us interested in CJK XF could try to work around the limitations of MySQL and even release something upstream. But that's a big hope, haha.

Thanks again, Andy :D
 
#16
It's 3am on a school night, so I shouldn't be replying, but @#$% fml. There are so many points to be made here, I don't even know where to start....

First; vBulletin 4's search is not done by one person, but a team. However small as it may be, myself was also involved to some minuscule extent. Additionally, it was not done in a short time, it was done over weeks and weeks of development time. Certain development team members were stuck on that project for several weeks straight.

Also, the said search system have its problems. Contents are not searchable until they are indexed by a scheduled task; indexer uses fair bit of resources; results have counting limitations are just a few problems I can remember. I don't have the luxury of going into details, but I think most can take ease in just taking my words for it.

Next; Implementations of CJK search. In vBulletin China's modified distribution, it goes back to at least 3.5; although not "official", I'm pretty sure 3.0.x (I know I've translated the hack and posted on vb.org) and 2.x series was also covered to some extent. This was achieved by taking every single word in posts and bundle them together in 2 CJK character indexes. This method, while it may work, is limited by MySQL's index size limitation. As result of this, busy forums and long posts often encountered errors or simply don't work proper. My 2 minutes scanning of Discuz's code also suggests a similar implementation, but I'll need to read further when I have time to find out for sure.

Added to the CJK search problem, this only works assuming if we know the character encoding used on the forum (Read: UTF-8). Neither vB (Jelsoft version anyways) nor IPB have really strictly forced everyone to use Unicode UTF-8. There are many valid reasons for this, I had several long posts on vb.com's forum about these issues. But what this does mean is by the time you import to XenForo, you'll probably have to discard some content because converter cannot convert multiple source encodings (IE: Chinese forum with BIG5 and GBK for Traditional and Simplified Chinese sub forums). If we were to make some sort of indexer that will index CJK text for search, it will only really work best for new freshly installed XenForo with no contents; or you're lucky and already have your source all sorted out in UTF-8.

Database configuration is also a big issue. I don't remember the variable name, but long story short, we have several stages of things going wrong:
- HTML's character encoding
- MySQL's connection encoding
- MySQL database's charset setting
- MySQL database's collation

Several combinations of there of can work together, and present what end user would call "Chinese". But they all mean different things, and would require some different handling. I still recall changing one variable in config.php of vBulletin can cause your database to spew out garbage and blank pages... And changing the said variable at a wrong time, or attempting to make a backup inappropriately can result in full irreversible data loss.

Oh, also, for the record, having CJK search in vB China's modified distro did not particularly helped penetrate Chinese market. Interestingly, opposite to the popular voice, people frankly don't care. They have their Discuz, they're happy and not planning to change. They have their pirate version running with our modified code and have no intention in purchasing the license.

Anyways, 4am now... I've spent about 1 hour writing and deleting... I don't even know if this makes any sense. I'm just gonna hit post reply, get burned for any mistakes, and check over again with a clear mind. so much for waking up at 6 to go in early and work out thesis stuff with my prof...
1.谢谢你的技术文章
2.还是有很多中文论坛用授权版的vbb的,包括我在内。去vbb论坛上看看就知道了,现在中文客户对vbb的抱怨声很大,如果xenforo能把CJK搜索的问题解决好了,相信会带来不少客户的。
3.另外,据我所知最初解决vbb4中文支持(包括中文搜索)的人只有两个人,而且是非官方人员。
 
#17
路過~~~看看~~~
該中文搜尋並非是很完美的做法。至少就搜尋而言來說。
不過認真來說。是現在最好的cjk搜尋的方式了。
可惜此做法還是會有大論壇中索引無法工作的情況。但比起之前的做法已經好很多了。
 
Status
Not open for further replies.