Summary: You can look at the progress of a priority problem in more detail here.
Problem: [41624] Email Time-outs Status: Closed
Problem description Added on: 02/04/2007 @ 15:54

There is high load across the mail collection servers that is causing attempts to collect email to time-out for customers.

If you were logged in to the portal, you could associate your username with this problem.

Comment updated on: 02/04/2007 @ 15:57

Problem is being assigned to a network engineer for investigation.

Comment updated on: 02/04/2007 @ 17:27

We have identified a number of mailboxes that have reached their maximum directory size. This is creating excessive load on the mail collection servers which is causing intermittent time-outs. We are now addressing these instances and expect service to be restored shortly.

Comment updated on: 02/04/2007 @ 19:18

A problem has been identified that was causing excessive connection errors to our mail authentication database. We have since cleared these sessions and restarted the services on each of the mail collection servers which has restored service.

Comment updated on: 02/04/2007 @ 21:38

We have received further scattered reports this evening of continuing timeouts, though this does appear to be very intermittent. Network engineers are working on this now.

Comment updated on: 02/04/2007 @ 23:34

The problem with email collection has now been resolved and customers should now be able to collect mail as normal. Service status has been updated and we will keep this problem open for monitoring and post an update in the morning.

Comment updated on: 03/04/2007 @ 10:00

We are receiving further reports of mail timeouts from customers and the load across our mail collection servers is rising again. Our network engineers are investigating and details will be posted to service status.

Comment updated on: 03/04/2007 @ 12:25

We have restarted the courier service across the mail collection servers in a bid to restore service whilst we continue diagnosing the problem. Unfortunately unlike yesterday, this has not restored access for customers.

We are now looking at renaming a number of customers directories who are thought to be saturating threads to the storage platform.

Comment updated on: 03/04/2007 @ 13:35

We have rolled back a recent courier upgrade that was applied to mailc01 as a precautionary measure.

A number of users of users on mail02 have had their directories renamed.

We will continue this process with mail01.

We have started to build 4 additional Linux mailc servers to add additional capacity in terms of load sharing should these be required.

Mailc logs are being checked over the affected period to identify if particular users are placing abnormal load on the mail c platform.

Comment updated on: 03/04/2007 @ 13:38

We are looking to disable POP/IMAP access separately for each vISP in the alteon load balancers to help isolate the source of the problem.

Comment updated on: 03/04/2007 @ 15:45

At approximately 1400hrs the load on the mailc's started to decrease. This is now back to normal levels although there is still a degree of doubt as to the primary cause of the problem.

Customers should now be able to access their email as normal.

We have chosen not to disable IMAP/POP access in the load balancers and our network engineers are continuing with thier investigation.

We are continuing to pro actively manage mailboxes with the potential to create load on the storage platform and hope to have the additional linux mailc's installed later today.

Following completion of this work we will also be looking to build and install two sun solaris servers to compare nfs performance between these machines and the existing mailc's.

The mxcore delivery servers use the same attached storage as the mailc's but are based on solaris kit. No problems have been seen with the mxcore platform throughout the duration of this problem.

Service Status to be updated shortly.

Comment updated on: 03/04/2007 @ 17:23

An increase in load on the mail collection servers was seen at at approximately 4:15pm. There is a possibility that this will have resulted in further timeouts.

Load is increasing at the time of writing which may present problems for customers attempting to collect their email.

Comment updated on: 03/04/2007 @ 21:06

Monitoring of the platform over the last few hours has shown no further increases in load. Our network operations team are continuing to work on improving performance and will continue to monitor the situation throughout the night.

Comment updated on: 03/04/2007 @ 23:29

Our monitoring has shown no increases in load since approximately 5:30pm this evening. Since then all testing is showing that mail collection is working as normal.

Four additional mail servers are currently undergoing testing. A maintenance slot has been scheduled for midnight (service status has been posted) to make changes to the Sheffield load balancers in preparation for the introduction of the new servers.

Service status has been updated.

Comment updated on: 04/04/2007 @ 07:48

The four additional mail servers have been installed and are now operational. No spikes in load have been seen since approximately 5:30pm yesterday. We will be monitoring the platform today and continuing our investigations.

We have two Solaris servers on standby to further aid in our diagnostic analysis.

Comment updated on: 04/04/2007 @ 17:43

No further problems have been seen since yesterday evening. Tonight we will be installing a further 4 mail collection servers which will have almost doubled the number of mail collection servers since this problem was originally opened.

We are also looking at further work to reduce levels of spam on the platform. It has been suggested that this could be achieved by dropping spam email from the mx last servers and offering the functionality to delete mail marked as spam before it is downloaded to customers' mailboxes.

Comment updated on: 05/04/2007 @ 08:41

The 4 additional mail collection servers have been installed successfully and are operating as expected.

Comment updated on: 05/04/2007 @ 16:48

Monitoring of the platform has seen no further instances of time-outs following the maintenance last night. Problem is now being downgraded for signoff.

Go to current priority problems | Go to archived priority problems