Fixed vBulletin 4.x Import: Only imports the first 200 posts per thread

Steffen

Well-known member
Affected version
2.0.0
The importer tries to import posts in batches of 200 but only the first 200 posts per thread ever make it into XenForo. In the method "stepPosts", when the inner loop foreach ($posts AS $oldId => $post) { ... } finishes then the importer will move on with the next thread and never consider the remaining posts of the last thread again.

My test import has resulted in lots of threads with exactly 200 posts each. :D

Maybe it would be better to use a single loop over all posts instead of two nested loops over threads and their posts? That would simplify things and solve an edge case that would still exist even if the mentioned bug was fixed: I suspect posts could still get lost if there are two posts in the same thread with exactly the same dateline and if these posts fall onto a batch border (if the last post of batch n has dateline x then in batch n+1 the importer only considers posts with dateline > x). Ok, I admit it's an edge case (although not impossible). But it could be solved by using a single loop.

You seem to have chosen the nested loop to more easily compute the "position" of each post. But isn't the "position" recomputed during the "finalize" step anyway? If it isn't then a single loop over all posts sorted by "threadid ASC, dateline ASC" could be the solution (you'd only have to remember a single "position" value).

(For performance reasons: Can we maybe skip updating "xf_thread_user_post" during the import because that table will be rebuilt during the finalize step, too?)
 
Last edited:
I’ll need to verify this - I did suspect that there may have been a flaw in 2.0.0 related to continuing the post loop when a thread is not completed, but the code has changed somewhat for 2.0.1.

As for the recalculation of post position, I had actually assumed that we do not adjust that in the post rebuild step of ‘finalize’, but if in fact we do, then you would be correct in that it would be much easier to loop through all posts rather than trying to perform the double-loop I currently employ.

Watch this space...
 
Having double-checked the finalize system, we don’t actually rebuild post position or xf_thread_post_user, so that will need to remain in the import system.

The 200 post thing is legit though, I’ll get to work on that.
 
@Steffen I haven't fully tested this yet, but would you try this with your import that has threads with > 200 posts please?
Diff:
Index: trunk/src/XF/Import/Importer/vBulletin.php
===================================================================
--- a/trunk/src/XF/Import/Importer/vBulletin.php
+++ b/trunk/src/XF/Import/Importer/vBulletin.php
@@ -2888,77 +2888,77 @@
                         LIMIT {$limit}
                 ", 'threadid', [$startAfter, $end]);
         }
 
         protected function getThreadSubscriptions($oldThreadId)
         {
                 return $this->sourceDb->fetchPairs("
                         SELECT userid, emailupdate
                         FROM {$this->prefix}subscribethread
                         WHERE threadid = ?
                 ", $oldThreadId);
         }
 
         // ########################### STEP: POSTS ###############################
 
         public function getStepEndPosts()
         {
                 return $this->getStepEndThreads();
         }
 
-        public function stepPosts(StepState $state, array $stepConfig, $maxTime, $limit = 200, $threadLimit = 50)
+        public function stepPosts(StepState $state, array $stepConfig, $maxTime, $limit = 50)
         {
                 if ($state->startAfter == 0)
                 {
                         // just in case these are lying around, get rid of them before continue...
                         unset($state->extra['postStartDate'], $state->extra['postPosition']);
                 }
 
                 $timer = new \XF\Timer($maxTime);
 
                 $threadIds = $this->getThreadIds($state->startAfter, $state->end, $limit);
 
                 if (!$threadIds)
                 {
                         return $state->complete();
                 }
 
                 $this->lookup('thread', $threadIds);
 
                 foreach ($threadIds AS $oldThreadId)
                 {
                         if (!$newThreadId = $this->lookupId('thread', $oldThreadId))
                         {
                                 $state = $this->setStateNextThread($state, $oldThreadId);
                                 continue;
                         }
 
                         $total = 0;
 
                         if (empty($state->extra['postDateStart']))
                         {
                                 // starting a new thread, so initialize the variables that tell us we are mid-thread
                                 $state->extra['postDateStart'] = 0;
                                 $state->extra['postPosition'] = 0;
                         }
 
-                        $posts = $this->getPosts($oldThreadId, $state->extra['postDateStart'], $limit);
+                        $posts = $this->getPosts($oldThreadId, $state->extra['postDateStart']);
 
                         if (!$posts)
                         {
                                 $state = $this->setStateNextThread($state, $oldThreadId);
                                 continue;
                         }
 
                         $mapUserIds = [];
 
                         foreach ($posts AS $post)
                         {
                                 $mapUserIds[] = $post['userid'];
                                 if ($post['edituserid'])
                                 {
                                         $mapUserIds[] = $post['edituserid'];
                                 }
                         }
 
                         $this->lookup('user', $mapUserIds);
 
@@ -3015,57 +3015,56 @@
                         {
                                 break;
                         }
                 }
 
                 return $state->resumeIfNeeded();
         }
 
         protected function getThreadIds($startAfter, $end, $threadLimit)
         {
                 return $this->sourceDb->fetchAllColumn("
                         SELECT threadid
                         FROM {$this->prefix}thread
                         WHERE threadid > ? AND threadid <= ?
                         AND open <> 10
                         ORDER BY threadid
                         LIMIT {$threadLimit}
                 ", [$startAfter, $end]);
         }
 
-        protected function getPosts($threadId, $startDate, $limit)
+        protected function getPosts($threadId, $startDate)
         {
                 return $this->sourceDb->fetchAllKeyed("
                         SELECT post.*,
                                 IF(user.username IS NULL, post.username, user.username) AS username,
                                 editlog.dateline AS editdate,
                                 editlog.userid AS edituserid
                         FROM {$this->prefix}post AS
                                 post
                         LEFT JOIN {$this->prefix}user AS
                                 user ON(user.userid = post.userid)
                         LEFT JOIN {$this->prefix}editlog AS
                                 editlog ON(editlog.postid = post.postid)
                         WHERE post.threadid = ?
                         AND post.dateline > ?
                         ORDER BY post.dateline
-                        LIMIT {$limit}
                 ", 'postid', [$threadId, $startDate]);
         }
 
Yep, removing the limit works!

(After adjusting vBulletin5.php, too, otherwise PHP raises a warning because of incompatible function signatures. But I guess you had that covered.)
 
Btw, I think you can now remove $state->extra['postStartDate'] and turn $state->extra['postPosition'] into a simple local variable.

One more thing: vBulletin5.php is missing from hashes.json.
 
I think this should handle the posts with identical dateline issues:
Diff:
Index: trunk/src/XF/Import/Importer/vBulletin.php
===================================================================
--- a/trunk/src/XF/Import/Importer/vBulletin.php
+++ b/trunk/src/XF/Import/Importer/vBulletin.php
@@ -2986,41 +2986,50 @@
                                         'message_state' => $this->decodeVisibleState($post['visible']),
                                         'last_edit_date' => $post['editdate'] ?: 0,
                                         'last_edit_user_id' => $post['edituserid'] ? $this->lookupId('user', $post['edituserid'], 0) : 0,
                                         'edit_count' => $post['editdate'] ? 1 : 0,
                                         'position' => $state->extra['postPosition'],
                                 ]);
 
                                 $import->setLoggedIp($post['ipaddress']);
 
                                 if ($import->message_state == 'visible')
                                 {
                                         $state->extra['postPosition']++;
                                 }
 
                                 if ($newId = $import->save($oldId))
                                 {
                                         $state->imported++;
                                         $total++;
                                 }
 
-                                if ($timer->limitExceeded())
+                                /*
+                                 * Only allow the timer to break the loop if the next post in the array
+                                 * has a dateline different from that of the current post, because when we
+                                 * pick up the loop again, we will only fetch posts that have a date that
+                                 * is greater than the current post, so in the event that the next post
+                                 * has a dateline that is the same as the current one, it would otherwise
+                                 * be omitted.
+                                 */
+                                $next = next($posts);
+                                if ($next['dateline'] != $post['dateline'] && $timer->limitExceeded())
                                 {
                                         break 2; // end both the post loop, AND the thread loop
                                 }
                         }
 
                         $state = $this->setStateNextThread($state, $oldThreadId);
 
                         if ($timer->limitExceeded())
                         {
                                 break;
                         }
                 }
 
                 return $state->resumeIfNeeded();
         }
 
Btw, I think you can now remove $state->extra['postStartDate'] and turn $state->extra['postPosition'] into a simple local variable.

One more thing: vBulletin5.php is missing from hashes.json.
I don't think we can, because the posts loop could still be subject to the $timer->limitExceeded() check at the end of the $posts loop, so we need to know where to resume in the event that the timer does break the loop.
 
Back
Top Bottom