Difference between revisions of "Recente Storingen"

From Cncz
Jump to: navigation, search
([Disk server pile offline][Disk server pile offline])
([Recent Verholpen Storingen en Onderhoud][Recently Resolved Service Interruptions and Maintainance])
Line 28: Line 28:
  
 
<startFeed />
 
<startFeed />
 +
=== [Disk server stack offline][Disk server stack offline] ===
 +
<itemTags>medewerkers</itemTags>
 +
[nl]
 +
  Begin        : 20130408 08:55
 +
  Eind          : 20130408 09:30
 +
  Getroffen    : Gebruikers van diskruimte/netwerkschijven op de Stack.
 +
  Probleem      : Stack: Crash door uitgevallen disk
 +
  Oplossing    : Gebruik gemaakt van hot spare door disk uit te laten vallen
 +
[/nl]
 +
 +
[en]
 +
  Begin        : 20130408 08:55
 +
  End          : 20130408 09:30
 +
  Affected    : Users of disk volumes/network shares on file server Stack.
 +
  Problem      : Stack:  Crash due to failing disk
 +
  Solution    : Deactivated disk using hot spare
 +
[/en]
 +
 +
[[Gebruiker:visser|Erik Joost Visser]] 8 apr 2013 9:45 (CET)
 +
 
=== [Disk server pile offline][Disk server pile offline] ===
 
=== [Disk server pile offline][Disk server pile offline] ===
 
<itemTags>studenten,medewerkers,docenten</itemTags>
 
<itemTags>studenten,medewerkers,docenten</itemTags>
Line 33: Line 53:
 
   Begin        : 20130408 06:30
 
   Begin        : 20130408 06:30
 
   Eind          : 20130408 08:15
 
   Eind          : 20130408 08:15
   Getroffen    : Gebruikers van diskruimte op de Pile.
+
   Getroffen    : Gebruikers van diskruimte op de Pile (homedisks).
 
   Probleem      : Pile: Kernel panic tijdens wekelijkse reboot; stond te wachten in console
 
   Probleem      : Pile: Kernel panic tijdens wekelijkse reboot; stond te wachten in console
 
   Oplossing    : Power-cycle van het systeem
 
   Oplossing    : Power-cycle van het systeem
Line 41: Line 61:
 
   Begin        : 20130408 06:30
 
   Begin        : 20130408 06:30
 
   End          : 20130408 08:15
 
   End          : 20130408 08:15
   Affected    : Users of disk volumes on file server Pile.
+
   Affected    : Users of disk volumes on file server Pile (userhomes).
 
   Problem      : Pile: Did not shutdown properly during weekly reboot due to a kernel panic which was  
 
   Problem      : Pile: Did not shutdown properly during weekly reboot due to a kernel panic which was  
 
   Solution    : Executed power-cycle of the system
 
   Solution    : Executed power-cycle of the system

Revision as of 08:50, 8 April 2013

{{#customtitle:Recent Service Interruptions}}


Current Service Interruptions and Maintenance



Report a problem

Use this form to report less urgent problems. For urgent problems, call 56666 (helpdesk).

Recently Resolved Service Interruptions and Maintainance

Disk server stack offline

  Begin        : 20130408 08:55
  End          : 20130408 09:30
  Affected     : Users of disk volumes/network shares on file server Stack.
  Problem      : Stack:  Crash due to failing disk
  Solution     : Deactivated disk using hot spare

Erik Joost Visser 8 apr 2013 9:45 (CET)

Disk server pile offline

  Begin        : 20130408 06:30
  End          : 20130408 08:15
  Affected     : Users of disk volumes on file server Pile (userhomes).
  Problem      : Pile: Did not shutdown properly during weekly reboot due to a kernel panic which was 
  Solution     : Executed power-cycle of the system

Erik Joost Visser 8 apr 2013 9:30 (CET)

Mail problems after supplying password to phishers

 Begin         : 20130319 11:45
 End           : 20130319 12:14
 Affected      : Users of Science mail

Again a Science user supplied his Science password to phishers. We notice that because Internet criminals use these passwords to get into the Science mail servers (horde webmail, smtp) in order to send spam.

PLEASE: do not naively click on a link in an e-mail!

File server miii no private network

  Begin        : 20130315 17:30
  End          : 20130318 09:00
  Affected     : Users of network disks of the server miii.
  Problem      : While performing maintenance (removing a network card) a network cable was not inserted correctly.

Wim Janssen 18 mrt 2013 11:11 (CET)

Disk server pile offline

  Begin        : 20130318 06:30
  End          : 20130318 07:36
  Affected     : Users of disk volumes on file server Pile.
  Problem      : Pile: Was waiting for interactive input after reporting a warning (^d)

Network problems due to installation of Matlab R2013a

Begin         : 20130313 13:00
End           : 20130313 13:40
Affected      : users of the network

Yesterday Matlab R2013a has been installed. Today at 13:00 hours many servers started to automatically copy this 5.4 GB to their local disc. Some parts of the network were overloaded by all these copying, which made accessing the network slow for many users. The distribution of this software will now be scheduled to happen over a longer period, primarily outside of working hours.

Peter van Campen 13 mrt 2013 18:31 (CET)

More IP-numbers for ru-wlan and Science (wireless)

Monday, March 4th 2013 at 18:00 hours, the number of IP numbers that is available in the FNWI buildings for ru-wlan and Science will be doubled. Because ru-wlan moves to a new range, users of ru-wlan will lose connectivity for at most 15 minutes. There was already a plan to replace ru-wlan and Science within the FNWI buildings by the RU-wide Eduroam and ru-wlan, but the wireless network usage has grown so fast that we can not wait for this plan to be realized. Last week some wireless users at times could not even get an IP address, although the lease time had been brought down to 30 minutes. Therefore this temporary measure became necessary on such short notice.

Marcel Kuppens 4 mar 2013 12:20 (CET)

Short interval in wireless network service

On Monday feb 18 at 6:00 pm there will be some maintenance at the wireless network which will effect the following locations at Toernooiveld:

FNWI cellar A1
FEL
Huygens: Library of Science, terrace behind Huygens, cantine, room HG-1.132
KDV1 en KDV2
Linnaeus building
Logistic Centre
Mercator I
Mercator II, ground floor and 7nd floor
Mercator III, 2nd floor
Transitorium FNWI (ACSW and FELIX)
UBC

We expect the service will be completely available again within 30 minutes.

Marcel Kuppens 18 feb 2013 10:53 (CET)

Mail problems after supplying password to phishers

 Begin         : 20130212
 End           : 20130214 (for now)
 Affected      : Users of Science mail, specifically of horde webmail

The last few days three Science users have supplied their Science password to phishers. We notice that because Internet criminals use these passwords to get into the Science mail servers (horde webmail, smtp) in order to send spam. This time they even used a fake copy of the horde Science webmail website. The big differences with the real horde Science webmail website are:

  • the URL is not within science.ru.nl
  • the connection is not a secure https connection, there is no lock
  • the username and password do not arrive at C&CZ servers, but in the hands of Internet criminals.

PLEASE: do not naively click on a link in an e-mail!

New Radius server for ru-wlan and eduroam (wireless)

On Monday, January 28th 2013 at 8:00 am, one of the servers that is being used by the wireless network of the RU, will be replaced. This replacement will affect you as a user of the wireless networks ru-wlan and eduroam: There will appear a new certificate when connecting. You can just accept this, after which the connection should work. If this appears not to be the case, then it’s best that you remove your old Eduroam- respectively your old RU-WLAN settings first to activate the new connection .

Specifically for iPhone / iPad users: We recommend that you first remove your old Eduroam- respectively your old RU-WLAN profile before activating the new connection without a profile. If that unexpectedly fails, please review the information on www.ru.nl/wireless for iPhone/iPads. If necessary, you can also download a new profile from that site.

Marcel Kuppens 17 jan 2013 10:53 (CET)

Homeserver bundle crashed

  Begin         : 2013-01-16 ~ 13:30
  Einde         : 2013-01-16 ~ 14:00
  Affected      : all FNWI users with a homedirectory on fileserver bundle

Because the file server crashed, it had to be rebooted.

LDAP server vernieuwd

  Date         : 20121214
  Affected     : Users with a Fedora based desktop PC

Older Fedora desktop PC's may experience startup problems after an upgrade of one of our LDAP servers. A fix is available and has been applied. If you still encounter this problem, please contact C&CZ.

Mail problems after supplying password to phishers

  Begin        : 20121116 04:45
  End          : 20121117 ca 12:00
  Affected     : Users of horde webmail and users wanting to send mail to e.g. hotmail.com

Horde webmail again appeared to be misused for sending spam. This could happen because a naive user gave the Science password to phishers/spammers. After first stopping horde, early Friday morning we disabled the account of the naive user and restarted horde. Saturday morning it appeared that this short spam-outbreak had caused administrators of hotmail.com to add our mail server to their blacklist. Therefore we switched the IP-number of this mail server Saturday morning.

Homeserver bundle will be rebooted

  Begin         : 2012-10-24 ~ 12:45
  Einde         : 2012-10-24 ~ 13:00
  Affected      : all FNWI users with a homedirectory on fileserver bunlde

Because the file server refuses to accept a spare disk, it needs a reboot.

Homeserver bundle unavailable

  Begin         : 2012-10-22 12:15
  End           : 2012-10-22 13:00
  Affected      : all FNWI users with a homedirectory on fileserver bunlde

At the moment, we are solving the problem.

Services unavailable due to power and network outage

  Begin         : 20121018 03:00
  End           : 20121018 10:00
  Affected      : all users until 09:30; afterwards: "bundle" home directories, wireless, "plus" network shares and several websites

During the night of wednesday on thursday a power outage resulted in a network outage in the basement computing facilities. The power was restored to the network equipment using a bypass thus circumventing the UPS at about 09:15. Further checks implied that most servers had not become powerless so that most services became automatically available again. Network drivers on "bundle" had to be restarted in order to get access to home directories for a large number of users. Furthermore, several websites had to be restarted which made it possible for PC's to boot properly. During the day, an unrelated issue with the RAID storage of "plus" has been fixed as well granting access to the following network shares: sofie, ams*, molchem, mb*, encapson, milkun4, snn, neuropi, digicd. carta, ... Since wireless devices were unable to acquire IP addresses, i.e. gain access to the network, a split-brain situation was diagnosed within the DHCP service which was resolved around 13:00.

Announced downtime: home server "pile" down for reboot

  Begin        : 20121012 07:00
  End          : 20121012 09:00
  Affected     : Users with homedirectory server "pile" (as can be seen on http://DIY.science.ru.nl)

Next Friday morning, the home server "pile" will be rebooted. There are problems with the snapshots, which could make a reboot take more time. Therefore we schedule the reboot for early next Friday.

Peage top-up unit near Huygens restaurant in maintenance

In order to test new software, the Peage top-up unit near the Huygens restaurant was switched to maintenance mode. This unit is not used often yet, therefore this wil not have caused problems. Students that wanted to top-up their Peage account, could do that only elsewhere on campus. See the http://www.ru.nl/peage Peage website], locations are the halls of the Erasmus, Spinoza and Library buildings.

Eduroam incoming doesn't work for iPhone/iPad/iPod

  Begin         : spring 2012 (?)
  End           : 20121005
  Affected      : incoming Eduroam users with an iPhone/iPad/iPod

The UCI network management reports that at this moment the incoming version of Eduroam doesn't work for iPhone/iPad/iPod. A solution is being worked upon. Eduroam incoming means that one uses the wireless network of a remote institute, with authentication (login/password) being checked by RU or Science.

Horde webmail server down because of spam

  Begin        : 20120925 23:05
  End          : 20120926 10:20
  Affected     : Users of horde webmail

Yesterday evening, horde webmail appeared to be misused for sending spam. This could happen because a naive user gave the Science password to spammers. First we stopped horde. This morning we disabled the account of the naive user and restarted horde.

Disk server "Stack" offline

  Begin        : 20120924 06:30
  End          : 20120924 09:35
  Affected     : Users of disk volumes on file server Stack.

Disk server "Plenty" offline

  Begin        : 20120924 06:30
  End          : 20120924 09:00
  Affected     : Users of disk volumes on file server Plenty. The S and T disks that are used in the PC rooms.

During the weekly reboot (monday mornings), the server got stuck in the BIOS.

Announced downtime: home server "pile" down for replacement

  Begin        : 20120724 07:00
  End          : 20120724 09:00 (ca)
  Affected     : Users with homedirectory server "pile" (as can be seen on http://DIY.science.ru.nl)

Next Tuesday morning, the home server "pile" will be replaced by a new, more powerful server. Because the data have been synchronized with the new server, there will not be much downtime.

Postponed: Announced downtime: home server "pile" down for replacement

The downtime below has been postponed, because we had a few questions on the new server, that could not be answered in time. To be continued...

  Begin        : 20120724 07:00
  End          : 20120724 09:00 (ca)
  Affected     : Users with homedirectory server "pile" (as can be seen on http://DIY.science.ru.nl)

Next Tuesday morning, the home server "pile" will be replaced by a new, more powerful server. Because the data have been synchronized with the new server, there will not be much downtime. The new server should be very dependable: hardware RAID-6, double processors and power supplies and a 5-year support contract from the supplier. The performance has improved, e.g. by using hardware RAID with a 1 GB write cache with battery backup.

Partly announced downtime for mailman + horde webmail server

  Begin        : 20120712 09:09
  End          : 20120712 14:00 (ca)
  Affected     : Users of horde webmail and/or mailman mailing lists

This morning, horde webmail appeared to be misused for sending spam. This could happen because naive users gave their Science password to spammers. After we found out who the users were and had them change their password, we decide to also replace a defective cpu fan. Therefore also Mailman mailing lists will be down from 13:00 to 14:00 hours.


SMTP server blacklisted by MS Live Hotmail

  Begin        : 20120711 03:08
  End          : 20120711 14:55
  Affected     : Science mail users trying to send mail to MS-domains: hotmail.com, live.com, ...

This morning, users reported that mail from smtp.science.ru.nl to hotmail users was being bounced by hotmail. We have tried to let the hotmail administrators change this fast, but when this took too long, we changed the IP-number of our smtp-server.

Planned service interruption: file server with problems

  Begin        : 20120622 17:03
  End          : 20120624 19:30
  Affected     : stack fileservices

A hardware failure of a boot disk of the fileserver stack was reported Friday morning June 22. We decided to repair this after working hours. Thus at approximately 17:00 the defective boot disk was removed from the machine and replaced by a spare one. Enabling the disk, making it bootable, restoring file systems and rebooting the machine (after removing all snapshots) took a lot of time. When this was resolved Friday evening, the NFS/SMB fileservice was not active on the mounted filesystems. It took a reboot Sunday evening to resolve all problems.

Tracelab server poly defective

  Begin       : 20120621 14:12
  End         : 20120621 17:15
  Affected    : Tracelab for users. For administrators also Prism&Deploy and the WDS-service

A hardware failure of the server poly was reported at 2012-06-21 14:12. After a restart of the machine, it stopped working again. No more recoveries were attempted and an identical spare machine was outfitted with the disks from the defective server. Disks had to be synchronized before making the machine available again.

Servers without electric power

  Begin       : 20120607 13:45
  End         : 20120607 15:30
  Affected    : e-mail and users of the fileservers bundle, heap and stack

A power failure in a rack in a server room brought some C&CZ servers down. After less than two hours all problems were dealt with. Affected systems ware mainly: postvak (Science mail server), bundle (user homedisk), heap/stack (network discs), resser/kookpunt/brievenbus/rustug (mail transport smtp servers)

Planned Service: website-databases and maybe Linux clients

20 Apr 2012 17:00 - 17:15

A defective hard disc has been replaced in a server, but the server needs to be rebooted to ensure that this is reboot proof. The MySQL database of roughly 70 websites will therefore be down for a short time. Since this server also provides the Kerberos authentication for Linux clients, Linux clients might encounter service interruptions during a short period.

Windows server "plenty" with xpsoftware unavailable

Thursday July 7, around 13.00 hours the server "plenty" could not be reached. Because this server serves the "xpsoftware" share for the Managed Windows PC's, all these PC's had a problem. After the server was restarted and the disks had been checked, it was available again at 14:26.


Downtime Science servers: Sunday July 3, 09:00 - 12:00 hours

In order to improve the cooling of a server room, we plan to move three racks of Science servers a few meters on Sunday morning, July 3. We will have to switch off a lot of servers temporarily. Therefore several services will be unavailable some time starting July 3, 09:00 hours. We expect the downtime will last until 10:00 hours for servers with a lot of different users. The cn compute cluster will probably be fully operational again at 12:00 hours.

The servers/services affected are:

fileservers: plenty/pile/bundle with shares like:
             amsbackup2 bbb-priv botany bsweet comsol exoarchief gi3 hfml-data ifl iris
             lambiek mestrelab mi1/2/3 molchem2 molphtec morph multimedia olsen pcb planthgl
             sdisk share snn2 spmdata1 tdisk tece temp wallpaper xpcursus xpsoftware
potkast: films via Blackboard
ts2: Windows Terminal Server
lilo1: Linux Login Server, alternative: lilo/lilo2
cn compute cluster
horde webmail
License server for: Comsol

With apologies for the inconvenience
C&CZ

Peter van Campen 22 jun 2011 09:57 (UTC)

Network outage June 22, 10:55-11:30

This morning, in the network hub for Huygens South a UPS (battery power supply) went down, which made a set of network switches loose power. Because of this, users in Huygens wing 1 and spin-off companies lost their connection to the network. After bypassing the UPS, everything was up and running again at 11:30. We are still searching for the exact origin of this outage.

New SSH keys for new login servers

The LInux LOgin server lilo has been replaced. The name now points to the new machine lilo2, because that one is faster than the other login server lilo1. Therefore it is quite normal to accept once the new SSH-key.

Planned Service: Limited computer services

12 Feb 2011 7:00 - 11:00

A backup cooling system will be installed in our main computer room. Therefore the air conditioning system must be switched off, which means that most of the computer facilities in this room must be shut down. This includes the cluster nodes cn00 through cn53 and many of the web- and file- (network share) servers. It is advised to expect a very limited service level. We will try to keep all home directories and the mail system available. For detailed information about the impact please contact C&CZ.

Printer lp5

24 Jan 2011 - 11 Mar 2011

Printer lp5 has been moved to HG00.089. You can't use this printer at the moment, there's a problem with the power supply unit.

Fixed phone problem

7 Mrt 2011

You can't reach certain fixed phones at the university right now, mobile phones and Skype do work ok though.

Mailserver blacklisted

4 Feb 2011 9:00 - 12:00

One of our mail servers has been sending loads of spam after a successful phishing attack. Since then, our server has been blacklisted on several domains. Currently this affects the delivery of email to @hotmail and @live addresses.