Connectivity issues in replication process

1 vote
1 answer
2188 views
                          I've been trying to add a new node to a replica set without success due to these errors:

> 2019-08-12T06:48:15.031+0200 I ASIO     [NetworkInterfaceASIO-RS-0]
> Ending connection to host mongo03:27017 due to bad connection status;
> 1 connections to that host remain open 2019-08-12T06:48:15.031+0200 I
> REPL     [replication-224] Restarting oplog query due to error:
> NetworkInterfaceExceededTimeLimit: error in fetcher batch callback:
> Operation timed out. Last fetched optime (with hash): { ts:
> Timestamp(1565585283, 1317), t: 9 }. Restarts
> remaining: 10 2019-08-12T06:48:15.032+0200 I REPL    
> [replication-224] Scheduled new oplog query Fetcher source:
> mongo03:27017 database: local query: { find: "oplog.rs", filter: { ts:
> { $gte: Timestamp(1565585283, 1317) } }, tailable: true, oplogReplay:
> true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 9,
> readConcern: { afterOpTime: { ts: Timestamp(1565585283, 1317), t: 9 }
> } } query metadata: { $replData: 1, $oplogQueryData: 1,
> $readPreference: { mode: "secondaryPreferred" } } active: 1
> findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting
> down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler
> request: RemoteCommand 2262327 -- target:mongo03:27017 db:local cmd:{
> find: "oplog.rs", filter: { ts: { $gte: Timestamp(1565585283, 1317) }
> }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS:
> 2000, batchSize: 13981010, term: 9, readConcern: { afterOpTime: { ts:
> Timestamp(1565585283, 1317), t: 9 } } } active: 1
> callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1
> retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
> 2019-08-12T06:48:17.637+0200 I ASIO     [NetworkInterfaceASIO-RS-0]
> Connecting to mongo03:27017 2019-08-12T06:48:17.645+0200 I ASIO    
> [NetworkInterfaceASIO-RS-0] Successfully connected to mongo03:27017,
> took 8ms (2 connections now open to mongo03:27017)
> 2019-08-12T06:49:36.961+0200 I -        [repl writer worker 13]  
> DB.CollectionName collection clone progress: 221390504/643329790
> 34% (documents copied) 2019-08-12T06:51:52.859+0200 I ASIO    
> [NetworkInterfaceASIO-RS-0] Ending connection to host mongo03:27017
> due to bad connection status; 1 connections to that host remain open
> 2019-08-12T06:51:52.859+0200 I REPL     [replication-229] Restarting
> oplog query due to error: NetworkInterfaceExceededTimeLimit: error in
> fetcher batch callback: Operation timed out. Last fetched optime (with
> hash): { ts: Timestamp(1565585502, 204), t: 9 }.
> Restarts remaining: 10 2019-08-12T06:51:52.859+0200 I ASIO    
> [NetworkInterfaceASIO-RS-0] Connecting to mongo03:27017
> 2019-08-12T06:51:52.859+0200 I REPL     [replication-229] Scheduled
> new oplog query Fetcher source: mongo03:27017 database: local query: {
> find: "oplog.rs", filter: { ts: { $gte: Timestamp(1565585502, 204) }
> }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS:
> 2000, batchSize: 13981010, term: 9, readConcern: { afterOpTime: { ts:
> Timestamp(1565585502, 204), t: 9 } } } query metadata: { $replData: 1,
> $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } }
> active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms
> shutting down?: 0 first: 1 firstCommandScheduler:
> RemoteCommandRetryScheduler request: RemoteCommand 2272962 --
> target:mongo03:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: {
> $gte: Timestamp(1565585502, 204) } }, tailable: true, oplogReplay:
> true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 9,
> readConcern: { afterOpTime: { ts: Timestamp(1565585502, 204), t: 9 } }
> } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0
> attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis:
> -1ms

I'm a little bit lost regarding how to debug it. 
It's a 250GB database, the synchronization is being done through a private network 100Mb. There is an iptables in all servers but the traffic from the private network is allowed:

    -A INPUT -i eth1 -p tcp -m tcp --dport 27017 -m state --state NEW,ESTABLISHED -j ACCEPT

At some point the replication process fails:

> 2019-08-12T02:22:37.582+0200 I REPL     [replication-143] Restarting
> oplog query due to error: ExceededTimeLimit: error in fetcher batch
> callback: Executor error during find command :: caused by :: errmsg:
> "oper ation exceeded time limit". Last fetched optime (with hash): {
> ts: Timestamp(1565569328, 710), t: 9 }. Restarts
> remaining: 1 2019-08-12T02:22:37.582+0200 I REPL     [replication-143]
> Scheduled new oplog query Fetcher source: mongo03:27017 database:
> local query: { find: "oplog.rs", filter: { ts: { $gte:
> Timestamp(1565569328, 710) } },  tailable: true, oplogReplay: true,
> awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 9,
> readConcern: { afterOpTime: { ts: Timestamp(1565569328, 710), t: 9 } }
> } query metadata: { $replData: 1, $oplogQ ueryData: 1,
> $readPreference: { mode: "secondaryPreferred" } } active: 1
> findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting
> down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler
> re quest: RemoteCommand 1381614 -- target:mongo03:27017 db:local cmd:{
> find: "oplog.rs", filter: { ts: { $gte: Timestamp(1565569328, 710) }
> }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS:
> 2000, ba tchSize: 13981010, term: 9, readConcern: { afterOpTime: { ts:
> Timestamp(1565569328, 710), t: 9 } } } active: 1 callbackHandle.valid:
> 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl
> maxAttem pts: 1 maxTimeMillis: -1ms 2019-08-12T02:22:39.612+0200 I
> REPL     [replication-142] Error returned from oplog query (no more
> query restarts left): ExceededTimeLimit: error in fetcher batch
> callback: Executor error during find command ::  caused by :: errmsg:
> "operation exceeded time limit" 2019-08-12T02:22:39.615+0200 I REPL   
> [replication-142] Finished fetching oplog during initial sync:
> ExceededTimeLimit: error in fetcher batch callback: Executor error
> during find command :: caused by :: errms g: "operation exceeded time
> limit". Last fetched optime and hash: { ts: Timestamp(1565569328,
> 710), t: 9 } 2019-08-12T02:22:43.034+0200 I REPL 
> [replication-142] CollectionCloner ns:DB.CollectionName finished
> cloning with status: IllegalOperation: AsyncResultsMerger killed
> 2019-08-12T02:22:44.551+0200 W REPL     [replication-142] collection
> clone for 'DB.CollectionName' failed due to IllegalOperation: While
> cloning collection 'DB.CollectionName' there was an error
> 'AsyncResultsMerger killed' 2019-08-12T02:22:44.551+0200 I REPL    
> [replication-142] CollectionCloner::start called, on ns:DB.CollectionName2
> 2019-08-12T02:22:44.551+0200 W REPL     [replication-142] database
> 'dbName' (9 of 9) clone failed due to ShutdownInProgress: collection
> cloner completed 2019-08-12T02:22:44.551+0200 I REPL    
> [replication-142] Finished cloning data: ShutdownInProgress:
> collection cloner completed. Beginning oplog replay.
> 2019-08-12T02:22:44.551+0200 I REPL     [replication-142] Initial sync
> attempt finishing up.

And it has to start over again.

Don't know if I could configure a bigger time out to avoid this disconnections. 

Thanks in advance.

                        
Asked by brb (41 rep)
Aug 12, 2019, 05:12 AM
Last activity: Aug 9, 2025, 03:03 AM
Connectivity issues in replication process

Related Questions