Difference between revisions of "Recente Storingen"

From Cncz
Jump to navigation Jump to search
 
(62 intermediate revisions by 4 users not shown)
Line 14: Line 14:
 
<!-- [en]and of course it has an english part[/en] -->
 
<!-- [en]and of course it has an english part[/en] -->
 
<!-- ~ ~ ~ ~ -->
 
<!-- ~ ~ ~ ~ -->
</onlyinclude>
 
  
 
=== [Standaard RU ICT onderhoudsvensters][Standard RU IT maintenance windows] ===
 
=== [Standaard RU ICT onderhoudsvensters][Standard RU IT maintenance windows] ===
<itemTags>medewerkers,studenten</itemTags>
 
 
[nl]
 
[nl]
 
Het ISC maakt ruim vooraf [http://www.ru.nl/systeem-meldingen/?id=26&lang=nl&tg=0&f=0 de ICT onderhoudsvensters voor het huidige studiejaar] bekend.
 
Het ISC maakt ruim vooraf [http://www.ru.nl/systeem-meldingen/?id=26&lang=nl&tg=0&f=0 de ICT onderhoudsvensters voor het huidige studiejaar] bekend.
Line 35: Line 33:
 
== [Actuele storingen en gepland onderhoud][Current Service Interruptions and Planned Maintenance] ==
 
== [Actuele storingen en gepland onderhoud][Current Service Interruptions and Planned Maintenance] ==
 
<onlyinclude>
 
<onlyinclude>
 +
=== [Zaterdag 14 mei belendende gebouwen (Mercator, Proeftuin, Logistiek) 5 minuten zonder netwerk][Saturday May 14 adjacent buildings (Mercator, Proeftuin, Logistiek) 5 minutes without network] ===
 +
[nl]
 +
  Begin        : 2022-05-14 09:00
 +
  Eind          : 2022-05-14 10:00
 +
  Getroffen    : alle netwerkaansluitingen in Mercator, Proeftuin en Logistiek Cwntrumzullen max. 5 minuten onderbroken zijn
 +
 +
RU/ILS netwerkbeheer zet nieuwe apparatuur in. Dit zal maximaal 5 minuten tot een netwerkonderbreking leiden.
 +
[/nl]
 +
[en]
 +
  Begin        : 2022-05-14 09:00
 +
  End          : 2022-05-14 10:00
 +
  Affected      : all network oulets in Mercator, Proeftuin and Logistiek Centrum will be down for max. 5 minutes
 +
 +
RU/ILS network management will switch to new hardware. This will lead to a network interruption of at most 5 minutes.
 +
[/en]
 +
 +
<startFeed />
 +
<endFeed />
 +
</onlyinclude>
 +
 +
== [Recent Verholpen Storingen en Onderhoud][Recently Resolved Service Interruptions and Maintainance] ==
 +
 +
[nl]Voor het snel ge&iuml;nformeerd worden over storingen kan men zich abonneren op de [/nl]
 +
[en]To be quickly informed about service interruptions one can subscribe to the [/en]
 +
[http://mailman.science.ru.nl/mailman/listinfo/CPK CPK mailinglist].
  
 
<startFeed />
 
<startFeed />
=== [RU netwerkonderhoud zaterdag 27 februari 08:00-20:00][RU network maintenance Saturday Feb. 27 08:00-20:00] ===
+
=== [Coma, coma01 en coma46 netwerkprobleem][Coma, coma01 and coma46 network problem] ===
 +
[nl]
 +
  Begin        : 2022-05-03 13:47
 +
  Eind          : 2022-05-03 14:55
 +
  Getroffen    : Gebruikers van coma, coma01 en coma46
 +
 
 +
Vanmiddag verloren drie coma-nodes hun netwerk vanwege een verkeerde netwerkconfiguratie. Ze bleken al langer af en toe slecht bereikbaar te zijn. Het duurde even om uit te zoeken wat dit netwerkprobleem veroorzaakte, maar toen dat gevonden was, was het snel gerepareerd.
 +
[/nl]
 +
[en]
 +
  Begin        : 2022-05-03 13:47
 +
  End          : 2022-05-03 14:55
 +
  Affected      : Users of coma, coma01 en coma46
 +
 
 +
This afternoon three coma nodes lost their network because of an incorrect network configuration. They must have shown intermittent network problems earlier. It took us some time to find out what caused this network problem, but when found, it was easy to fix.
 +
[/en]
 +
 
 +
=== [Astro.ru.nl DNS(SEC) service down][Astro.ru.nl DNS(SEC) service down] ===
 +
[nl]
 +
  Begin        : 2022-04-28 12:02
 +
  Eind          : 2022-05-13 14:00
 +
  Getroffen    : Gebruikers die met of vanaf astro.ru.nl wilden communiceren 
 +
 
 +
Bij de gebruikelijke vervanging van de [https://nl.wikipedia.org/wiki/DNSSEC DNSSEC]-sleutels, die het DNS-verkeer beveiligen, is in de externe DNS van ru.nl een foute sleutel voor astro.ru.nl ingevoerd. Daardoor verdween astro.ru.nl vanaf extern gezien van het internet. Dat is op 2022-05-02 om ca. 14:00 met een automatische procedure hersteld, maar het automatische proces gebruikte een verkeerde versleuteling. Het duurde tot 2022-05-12 voordat ILS dit met de hand gecorrigeerd had nadat we dat laat geconstateerd hadden. Omdat men het DNS-antwoord dat astro.ru.nl niet bestaat, 24 uur mocht gebruiken, is de overlast pas 2022-05-13 13:00 compleet verdwenen. De voornaamste overlast was waarschijnlijk dat mail vanaf een @astro.ru.nl adres geweigerd werd.
 +
[/nl]
 +
[en]
 +
  Begin        : 2022-04-28 12:02
 +
  End          : 2022-05-13 13:00
 +
  Affected      : Users wanting to communicate with of from astro.ru.nl
 +
 
 +
During the regular change of [https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions DNSSEC] keys that secure DNS traffic, an incorrect key was introduced in the external DNS of ru.nl for astro.ru.nl. This made astro.ru.nl disappear from the internet. This error was partly corrected  2022-05-02 at ca. 14:00 hours, but the automatic process used an not accepted encryption. It took ILS until 2022-05-22 to correct that by hand after we eventually noticed the error. Because the DNS answer that astro.ru.nl doesn't exist may be cached for 24 hours, the problem was not completely over until 2022-05-13 13:00. The main problems for users were that mail from an @astro.ru.nl address bounced.
 +
[/en]
 +
 
 +
=== [SUSE Linux 15.3 Eduroam werkt niet met U- of S-nummer, wel met Science account][SUSE Linux 15.3 Eduroam doesn't work with U- or s-number, but does with Science account] ===
 +
<itemTags>medewerkers,studenten</itemTags>
 +
[nl]
 +
  Begin        : 2022-02-14
 +
  Eind          : 2022-03-09 14:47
 +
  Getroffen    : Eduroam-gebruikers met SUSE Linux 15.3
 +
 
 +
Op 14 februari 2022 zijn antieke versies van [https://nl.wikipedia.org/wiki/Transport_Layer_Security TLS] (1.0 en 1.1) uitgeschakeld bij de [https://www.eduroam.nl/ Eduroam]-authenticatie op de [https://www.ru.nl/ils ILS] LDAP-servers. Daarna lukt het SUSE Linux 15.3 clients niet meer om met U- of s-nummer te authenticeren. Zij hebben alleen TLS1.2 en de ILS-servers bieden eerst TLS1.3 aan, daarna gaat het fout. Door te authenticeren met Science-account, waarbij de Science-servers TLS1.2 aanbieden, kunnen ze wel verbinding maken met Eduroam.
 +
[/nl]
 +
[en]
 +
  Begin        : 2022-02-14
 +
  End          : ?
 +
  Affected      : Eduroam users with SUSE Linux 15.3
 +
 
 +
February 14, ILS switched off antique versions of [https://en.wikipedia.org/wiki/Transport_Layer_Security TLS] (1.0 and 1.1) for the [https://www.eduroam.nl/en/ Eduroam] authentication on [https://www.ru.nl/ict-uk/ ILS] LDAP servers. From then on, SUSE Linux 15.3 clients can't authenticate with U- or s-number. They only have TLS1.2 and the ILS servers offer TLS1.3 first, after that an error occurs. By using the Science-account to authenticate, these users succeed in connecting to Eduroam.
 +
[/en]
 +
 
 +
=== [Netwerkswitch voor Astro Coma cluster down][Network switch of Astro Coma cluster down] ===
 +
[nl]
 +
  Begin        : 2022-02-22 13:10
 +
  Eind          : 2022-02-22 15:47
 +
  Getroffen    : Gebruikers van het Coma rekencluster
 +
 
 +
De netwerkswitch voor het Coma cluster lijkt defect, de aangesloten nodes zijn afgesloten van de rest van het netwerk. We zetten z.s.m. een vervangende switch in en zullen het probleem verder (laten) analyseren.
 +
[/nl]
 +
[en]
 +
  Begin        : 2022-02-22 13:10
 +
  End          : 2022-02-22 15:47
 +
  Affected      : Users of the Coma compute cluster
 +
 
 +
The network switch of the Coma cluster seems to be broken, all attached nodes are separated from the rest of the network. We'll replace the switch a.s.a.p. and (let) analyze the problem after that.
 +
[/en]
 +
 
 +
=== [Verbroken verbinding naar nieuwe datacenter switches][Interrupted link to new datacenter switches] ===
 +
[nl]
 +
  Begin        : 2021-12-15 12:45
 +
  Eind          : 2021-12-15 13:42
 +
  Getroffen    : alle 25 gigabit aangesloten machines (shares, websites, clusternodes)
 +
 
 +
Door een menselijke fout is de verbinding tussen onze nieuwe datacenter switches en de centrale router verbroken geweest.
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-12-15 12:45
 +
  End          : 2021-12-15 13:42
 +
  Affected      : all 25 gigabit connected machines (shares, websites, clusternodes)
 +
 
 +
Due to human error, the connection between the new datacenter switches and the central router was interrupted.
 +
[/en]
 +
 
 +
=== [vmhost07 poweroff][vmhost07 poweroff] ===
 +
<itemTags>medewerkers</itemTags>
 +
[nl]
 +
  Begin        : 2021-12-02 13:10
 +
  Eind          : 2021-12-02 13:20
 +
  Getroffen    : Gebruikers van een van onderstaande services
 +
 
 +
Door een menselijke fout is vmhost07 kortstondig uitgezet.
 +
labservanttest
 +
neurotech2
 +
printvm
 +
msql01
 +
indicoimapp
 +
ldap2
 +
eftw
 +
jupytervm
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-12-02 13:10
 +
  End          : 2021-12-02 13:20
 +
  Affected      : Users of one of services mentioned
 +
 
 +
Vmhost07 was accidentally shut down. Cause: human error.
 +
labservanttest
 +
neurotech2
 +
printvm
 +
msql01
 +
indicoimapp
 +
ldap2
 +
eftw
 +
jupytervm
 +
[/en]
 +
 
 +
=== [Ceph opslag uitbreiding veroorzaakte performance problemen][Ceph storage expansion caused performance issues] ===
 +
<itemTags>medewerkers</itemTags>
 +
[nl]
 +
  Begin        : 2021-11-16
 +
  Eind          : 2021-11-17
 +
  Getroffen    : gebruikers van Ceph filesystemen en websites op webvm01
 +
 
 +
Bij de uitbreiding van het Ceph storage cluster zijn er performance en beschikbaarheidsproblemen ontstaan. De problemen zijn in de loop van vanmorgen opgelost.[/nl]
 +
[en]
 +
  Begin        : 2021-11-16
 +
  End          : 2021-11-17
 +
  Affected      : users of Ceph filesystems and websites on webvm01
 +
 
 +
As a result of the expansion of the Ceph storage cluster, the cluster had performance and availability issues. The problems were resolved this morning.
 +
[/en]
 +
 
 +
===[Netwerkswitch van serverruimte stroomloos][Server room network switch powerless] ===
 +
<itemTags>medewerkers,studenten,docenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-10-12 11:50
 +
  Eind          : 2021-10-12 12:05
 +
  Getroffen    : Gebruikers van een van de vele servers achter deze switch
 +
 
 +
Twee van de modules van een belangrijke switch in de belangrijkste C&CZ serverruimte werden stroomloos tijdens het voorbereiden van gepland onderhoud. Hierdoor raakte ca. 75% van de servers in deze ruimte hun netwerkverbinding kwijt. Door het omzetten naar nieuwe PDU's kon de storing tot ca. 15 minuten beperkt worden.
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-10-12 11:50
 +
  End          : 2021-10-12 12:05
 +
  Affected      : Users of one of the many servers behind this switch
 +
 
 +
Two modules of an important switch in the main C&CZ server room lost power during the preparation of planned maintenance. This disconnected ca. 75% of the servers in the room from the network. Moving the modules to new PDU's kimited the downtime to ca. 15 minutes.
 +
[/en]
 +
 
 +
=== [Licentieserver probleem][License server problem] ===
 +
<itemTags>medewerkers,docenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-10-11 04:40
 +
  Eind          : 2021-10-11 08:26
 +
  Getroffen    : Gebruikers van een van de licenties van deze server
 +
 
 +
Een fout in de beheersoftware zorgde ervoor dat bij de herstart van de licentieserver geen enkel licentieproces goed opstartte. Pas na reparatie waren de licenties weer beschikbaar.
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-10-11 04:40
 +
  End          : 2021-10-11 08:26
 +
  Affected      : Users of one of the licenses of this server
 +
 
 +
An error in the management software prevented all license processes from starting correctly at the reboot of the license server. After fixing this error, all licenses were available again.
 +
[/en]
 +
 
 +
=== [Fileserver 'flock' overbelast][Fileserver 'flock' overloaded] ===
 +
<itemTags>medewerkers,docenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-09-17 14:30
 +
  Eind          : 2021-09-17 15:30
 +
  Getroffen    : Gebruikers van een van de ca. 100 netwerkschijven van deze server
 +
 
 +
Vooraf geteste cursussoftware veroorzaakte bij het gebruik door 100 studenten een te grote belasting op de fileserver. Alle gebruikers van deze fileserver hadden hier last van.
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-09-17 14:30
 +
  End          : 2021-09-17 15:30
 +
  Affected      : Users of one of the
 +
 
 +
Course software that had been tested caused an overload of the fileserver when it was used by 100 students. The performance of the fileserver was impaired for all users of network shares of this server.
 +
[/en]
 +
 
 +
=== [VPN server unreachable][VPN onbereikbaar] ===
 +
<itemTags>medewerkers,docenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-04-24
 +
  Eind          : 2021-04-26 09:35
 +
  Getroffen    : VPNsec gebruikers
 +
 
 +
Door een kapotte PDU is een switch uitgegaan en is de VPN server onbereikbaar (en nog meer dingen, waar gebruikers geen last van hebben).
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-04-24
 +
  End          : 2021-04-26 09:35
 +
  Affected      : VPNsec users
 +
 
 +
A broken PDU has offlined a switch, which has caused the VPN server to be unreachable (and several other things, which don't affect users).
 +
[/en]
 +
 
 +
=== Central E-mail/Calendar disruption (exchange) ===
 +
<itemTags>medewerkers,docenten,students</itemTags>
 +
  Begin        : 2021-04-14    09:30
 +
  Eind          : 2021-04-14    13:30
 +
  Getroffen    : All users of Exchange (e-mail and calendar)
 +
 
 +
Due to an emergency maintenance, the central microsoft exchange server is unavailable for 4 hours. This may also affect systems that are dependent on exchange.
 +
E-mail and calendar functionality is expected to be restored when the maintenance is done around 13:30 Today.
 +
 
 +
=== [Ceph probleem][Ceph problem] ===
 +
<itemTags>medewerkers</itemTags>
 +
[nl]
 +
  Begin        : 2021-03-24 19:00
 +
  Eind          : 2021-03-24 21:00
 +
  Getroffen    : gebruikers van Ceph filesystemen
 +
 
 +
Bij een routine upgrade proces bleek dat er een bug in de laatste versie zit waardoor de ceph manager onbereikbaar werd. Het upgrade proces is afgebroken en met hulp van de ceph-users mailinglijst is alles weer bereikbaar door een work-around.[/nl]
 +
[en]
 +
  Begin        : 2021-03-24 19:00
 +
  End          : 2021-03-24 21:00
 +
  Affected      : users with ceph based filesystems
 +
 
 +
During a routine upgrade of ceph, a bug in the latest version manifested itself and made the ceph manager unreachable. After aborting the upgrade and with help from the ceph-users mailinglist, everything became available again using a workaround.[/en]
 +
 
 +
=== [Windows 7 computers disabled in B-FAC domain][Windows 7 computers disabled in B-FAC domain] ===
 +
<itemTags>medewerkers,docenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-03-24
 +
  Eind          : na upgrade naar ander OS
 +
  Getroffen    : gebruikers van Windows 7 in het B-FAC domein
 +
 
 +
I.v.m. het verscherpen van de beveiliging worden de laatste Windows 7 machines per 24-03-20221 in het Active Directory Domain B-FAC gedisabled.
 +
Verzoek is al sinds lang om de betreffende machines naar een meer up-to-date OS te upgraden.
 +
Zie evt. eerdere aankondigingen over [https://wiki.cncz.science.ru.nl/Nieuws#.5BMicrosoft_Windows_10_upgrade.5D.5BMicrosoft_Windows_10_upgrade.5D Windows 10]
 +
en [https://wiki.cncz.science.ru.nl/Nieuws_archief#.5BWindows_7_stopt_januari_2020:_Upgrade_nu.21.5D.5BWindows_7_ends_January_2020:_Upgrade_now.21.5D het einde van Windows 7].
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-03-24
 +
  End          : after upgrade to other OS
 +
  Affected      : users of Windows 7 in the B-FAC domain
 +
 
 +
Because of security issues the last remaining Windows 7 machines wil be disabled, effective 24-03-2021, as member of the Active Directory Domain B-FAC.
 +
Please upgrade these computers to a more up-to-date OS.
 +
See also previous announcements on [https://wiki.cncz.science.ru.nl/Nieuws#.5BMicrosoft_Windows_10_upgrade.5D.5BMicrosoft_Windows_10_upgrade.5D Windows 10]
 +
and
 +
[https://wiki.cncz.science.ru.nl/Nieuws_archief#.5BWindows_7_stopt_januari_2020:_Upgrade_nu.21.5D.5BWindows_7_ends_January_2020:_Upgrade_now.21.5D the end of Windows 7].
 +
[/en]
 +
 
 +
=== [Lilo7 herstart][Lilo7 restart] ===
 
<itemTags>medewerkers,studenten</itemTags>
 
<itemTags>medewerkers,studenten</itemTags>
 
[nl]
 
[nl]
   Begin        : 2021-02-27 08:00
+
   Begin        : 2021-03-17 21:00
   Eind          : 2021-02-27 20:00
+
   Eind          : 2021-03-17 21:15
   Getroffen    : gebruikers van RU-diensten
+
  Getroffen    : gebruikers van lilo
 +
 
 +
Om het netwerk van lilo7 aan te passen, is het helaas noodzakelijk om deze loginserver te herstarten. Wie gedurende deze onderhoudstijd een stabiele verbinding wil hebben met een loginserver, kan beter lilo6 of de binnenkort uitgefaseerde lilo5 gebruiken. Zie evt. [https://wiki.cncz.science.ru.nl/index.php?title=Hardware_servers&setlang=nl#Linux_.5Bloginservers.5D.5Blogin_servers.5D de pagina over de C&CZ loginservers].
 +
[/nl]
 +
[en]
 +
  Begin        : 2021-03-17 21:00
 +
  End          : 2021-03-17 21:15
 +
  Affected      : users of lilo
 +
 
 +
To change the network of lilo7, we need to reboot this loginserver. If you want a stable connection to a loginserver during this downtime, please use lilo6 or the soon to be taken down lilo5. For more info see [https://wiki.cncz.science.ru.nl/index.php?title=Hardware_servers&setlang=en#Linux_.5Bloginservers.5D.5Blogin_servers.5D the page on C&CZ loginservers].
 +
[/en]
 +
 
 +
=== [Server met diverse services (virtuele servers, waaronder Roundcube en websites) stuk][Host of several virtual servers broken: Roundcube, websites and others] ===
 +
<itemTags>medewerkers,studenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-03-05 07:45
 +
  Eind          : 2021-03-05 09:40
 +
   Getroffen    : gebruikers van de virtuele servers: Roundcube, websites met een database op deze server, ...
  
ISC netwerkbeheer [https://www.ru.nl/systeem-meldingen/ kondigde aan] dat op a.s. zaterdag 27 februari gepland groot onderhoud aan het RU-netwerk uitgevoerd zal worden, waardoor alle RU-diensten diverse keren maximaal een uur lang niet bereikbaar zullen zijn. Dit gaat om alle RU-diensten: RU e-mail, VPN, wifi, BASS, OSIRIS, Brightspace, Syllabus+, Corsa, etc.
+
Gisteravond gaf de SSD-opstartschijf van deze VM-host al de eerste signalen van problemen, vanochtend stopten daardoor de virtuele servers die op deze VM-host draaien. Door de VM's te verhuizen naar een andere VM-host is het probleem opgelost. Er wordt nagedacht hoe we dit probleem in de toekomst het beste kunnen voorkomen of de impact ervan kunnen beperken.
 
[/nl]
 
[/nl]
 
[en]
 
[en]
   Begin        : 2021-02-27 08:00
+
   Begin        : 2021-03-05 07:45
   End          : 2021-02-27 20:00
+
   End          : 2021-03-05 09:40
   Affected      : users of RU services
+
   Affected      : users of the virtual servers: Roundcube, websites with databases on this server, ...
  
The ISC [https://www.ru.nl/systeem-meldingen/ announced] that Saturday February 27 08:00-20:00 major RU network maintenance work will be carried out. This will mean that all RU services will be unavailable several times for at most an hour. This concerns all RU services: RU e-mail, VPN, wifi, BASS, OSIRIS, Brightspace, Syllabus+, Corsa, etc.
+
Yesterday the SSD bootdisk of this VM host reported the first problems. This morning this had the effect of stopping all VMs running on this host. By moving the VMs to a different VM host, the problem has been solved. We will investigate how to best prevent this problem in the future or lessen its impact.
 
[/en]
 
[/en]
  
=== [EduroamCAT niet bruikbaar met Science accounts][EduroamCAT not working with Science accounts] ===
+
=== [Lilo6 stuk][Lilo6 down] ===
 
<itemTags>medewerkers,studenten</itemTags>
 
<itemTags>medewerkers,studenten</itemTags>
 
[nl]
 
[nl]
   Begin        : 2019-02-28 00:00
+
   Begin        : 2021-02-25 17:30
   Eind          : ?
+
   Eind          : 2021-03-04 16:45
   Getroffen    : EduroamCAT-gebruikers met Science accounts
+
   Getroffen    : gebruikers van lilo
  
[https://cat.eduroam.org/ EduroamCAT] is de Eduroam configuratie-assistent (Configuration Assistant Tool) voor [https://www.eduroam.org/configuration-assistant-tool-cat/ veel soorten devices], waarmee gebruikers eenvoudig verbinding kunnen maken met Eduroam. Dit is echter (nog) niet ingericht voor gebruik van Science accounts (loginnaam@science.ru.nl). C&CZ zoekt naar een oplossing. In de tussentijd kan men verbinding maken via handmatige instelling (zie [https://www.ru.nl/draadloos www.ru.nl/draadloos)] of U/S/E-nummer gebruiken.
+
Sinds donderdagmiddag is lilo6 door hardware problemen offline. Omdat dit de default linux login server was (lilo verwees naar lilo6) is dit voor veel gebruikers van lilo opvallend. De impact is beperkt, omdat er nog twee lilo's zijn, namelijk lilo5 en lilo7.
 +
Lilo7 is vervroegd de nieuwe lilo geworden, dus kun je een melding verwachten dat ssh een waarschuwing geeft over DNS SPOOFING, lilo7 heeft<br/> ECDSA <tt>SHA256:si3g2elo5m6TShx3PjX0+vF50pZ8NK/iXz/ESB+ZeP0</tt>
 
[/nl]
 
[/nl]
 
[en]
 
[en]
   Begin        : 2019-02-28 00:00
+
   Begin        : 2021-02-25 17:30
   End          : ?
+
   End          : 2021-03-04 16:45
   Affected      : EduroamCAT users with Science accounts
+
   Affected      : users of lilo
  
[https://cat.eduroam.org/ EduroamCAT] is the Eduroam Configuration Assistant Tool for [https://www.eduroam.org/configuration-assistant-tool-cat/ many different devices]. However, this hasn't (yet) been set up for the use of Science accounts (username@science.ru.nl). C&CZ is looking for a solution. In the meantime Eduroam connections have to be configured manually (please consult [https://www.ru.nl/wireless www.ru.nl/wireless)] or using the U/S/E number.
+
As of Thursday afternoon, the lilo6 is down due to hardware issues. Because lilo6 was the default linux login server (lilo referred to lilo6), this affected many users of lilo. The impact is limited, because we have alternative lilo's, namely lilo5 and lilo7. As of March 1st lilo now refers to lilo7, ssh will warn about DNS SPOOFING, which is due to the difference host keys for lilo7 <br/> <tt>ECDSA SHA256:si3g2elo5m6TShx3PjX0+vF50pZ8NK/iXz/ESB+ZeP0</tt>
 
[/en]
 
[/en]
  
== [Recent Verholpen Storingen en Onderhoud][Recently Resolved Service Interruptions and Maintainance] ==
+
=== [Groot RU netwerkonderhoud zaterdag 27 februari 08:00-20:00][Major RU network maintenance Saturday Feb. 27 08:00-20:00] ===
 +
<itemTags>medewerkers,studenten</itemTags>
 +
[nl]
 +
  Begin        : 2021-02-27 08:00
 +
  Eind          : 2021-02-27 20:00
 +
  Getroffen    : gebruikers van het RU-netwerk of -diensten
  
[nl]Voor het snel ge&iuml;nformeerd worden over storingen kan men zich abonneren op de [/nl]
+
ISC netwerkbeheer [https://www.ru.nl/systeem-meldingen/ kondigde aan] dat a.s. zaterdag 27 februari gepland groot onderhoud aan het RU-netwerk uitgevoerd zal worden, waardoor alle RU-diensten diverse keren maximaal een uur lang niet bereikbaar zullen zijn. Dit gaat om alle RU-diensten, inclusief die van FNWI/C&CZ: e-mail, VPN, wifi, BASS, OSIRIS, Brightspace, Syllabus+, Corsa, etc.
[en]To be quickly informed about service interruptions one can subscribe to the [/en]
+
[/nl]
[http://mailman.science.ru.nl/mailman/listinfo/CPK CPK mailinglist].
+
[en]
 +
  Begin        : 2021-02-27 08:00
 +
  End          : 2021-02-27 20:00
 +
  Affected      : users of the RU network or services
  
<startFeed />
+
The ISC [https://www.ru.nl/systeem-meldingen/ announced] that Saturday February 27 08:00-20:00 major RU network maintenance work will be carried out. This will mean that all RU services will be unavailable several times for at most an hour. This concerns all RU services including those of FNWI/C&CZ: e-mail, VPN, wifi, BASS, OSIRIS, Brightspace, Syllabus+, Corsa, etc.
 +
[/en]
  
 
=== [DNS-problemen vanaf buiten met ru.nl][DNS problems from outside with ru.nl] ===
 
=== [DNS-problemen vanaf buiten met ru.nl][DNS problems from outside with ru.nl] ===

Latest revision as of 12:39, 13 May 2022


Standard RU IT maintenance windows

The ISC announces the IT maintenance windows for the current academic year in time.

Report a problem

Use this form to report less urgent problems. For urgent problems, call 20000 (helpdesk).

Current Service Interruptions and Planned Maintenance

Saturday May 14 adjacent buildings (Mercator, Proeftuin, Logistiek) 5 minutes without network

 Begin         : 2022-05-14 09:00
 End           : 2022-05-14 10:00
 Affected      : all network oulets in Mercator, Proeftuin and Logistiek Centrum will be down for max. 5 minutes

RU/ILS network management will switch to new hardware. This will lead to a network interruption of at most 5 minutes.


Recently Resolved Service Interruptions and Maintainance

To be quickly informed about service interruptions one can subscribe to the CPK mailinglist.

Coma, coma01 and coma46 network problem

 Begin         : 2022-05-03 13:47
 End           : 2022-05-03 14:55
 Affected      : Users of coma, coma01 en coma46

This afternoon three coma nodes lost their network because of an incorrect network configuration. They must have shown intermittent network problems earlier. It took us some time to find out what caused this network problem, but when found, it was easy to fix.

Astro.ru.nl DNS(SEC) service down

 Begin         : 2022-04-28 12:02
 End           : 2022-05-13 13:00
 Affected      : Users wanting to communicate with of from astro.ru.nl

During the regular change of DNSSEC keys that secure DNS traffic, an incorrect key was introduced in the external DNS of ru.nl for astro.ru.nl. This made astro.ru.nl disappear from the internet. This error was partly corrected 2022-05-02 at ca. 14:00 hours, but the automatic process used an not accepted encryption. It took ILS until 2022-05-22 to correct that by hand after we eventually noticed the error. Because the DNS answer that astro.ru.nl doesn't exist may be cached for 24 hours, the problem was not completely over until 2022-05-13 13:00. The main problems for users were that mail from an @astro.ru.nl address bounced.

SUSE Linux 15.3 Eduroam doesn't work with U- or s-number, but does with Science account

 Begin         : 2022-02-14
 End           : ?
 Affected      : Eduroam users with SUSE Linux 15.3

February 14, ILS switched off antique versions of TLS (1.0 and 1.1) for the Eduroam authentication on ILS LDAP servers. From then on, SUSE Linux 15.3 clients can't authenticate with U- or s-number. They only have TLS1.2 and the ILS servers offer TLS1.3 first, after that an error occurs. By using the Science-account to authenticate, these users succeed in connecting to Eduroam.

Network switch of Astro Coma cluster down

 Begin         : 2022-02-22 13:10
 End           : 2022-02-22 15:47
 Affected      : Users of the Coma compute cluster

The network switch of the Coma cluster seems to be broken, all attached nodes are separated from the rest of the network. We'll replace the switch a.s.a.p. and (let) analyze the problem after that.

Interrupted link to new datacenter switches

 Begin         : 2021-12-15 12:45
 End           : 2021-12-15 13:42
 Affected      : all 25 gigabit connected machines (shares, websites, clusternodes)

Due to human error, the connection between the new datacenter switches and the central router was interrupted.

vmhost07 poweroff

 Begin         : 2021-12-02 13:10
 End           : 2021-12-02 13:20
 Affected      : Users of one of services mentioned

Vmhost07 was accidentally shut down. Cause: human error.

labservanttest
neurotech2
printvm
msql01
indicoimapp
ldap2
eftw
jupytervm

Ceph storage expansion caused performance issues

 Begin         : 2021-11-16
 End           : 2021-11-17
 Affected      : users of Ceph filesystems and websites on webvm01

As a result of the expansion of the Ceph storage cluster, the cluster had performance and availability issues. The problems were resolved this morning.

Server room network switch powerless

 Begin         : 2021-10-12 11:50
 End           : 2021-10-12 12:05
 Affected      : Users of one of the many servers behind this switch

Two modules of an important switch in the main C&CZ server room lost power during the preparation of planned maintenance. This disconnected ca. 75% of the servers in the room from the network. Moving the modules to new PDU's kimited the downtime to ca. 15 minutes.

License server problem

 Begin         : 2021-10-11 04:40
 End           : 2021-10-11 08:26
 Affected      : Users of one of the licenses of this server

An error in the management software prevented all license processes from starting correctly at the reboot of the license server. After fixing this error, all licenses were available again.

Fileserver 'flock' overloaded

 Begin         : 2021-09-17 14:30
 End           : 2021-09-17 15:30
 Affected      : Users of one of the 

Course software that had been tested caused an overload of the fileserver when it was used by 100 students. The performance of the fileserver was impaired for all users of network shares of this server.

VPN onbereikbaar

 Begin         : 2021-04-24
 End           : 2021-04-26 09:35
 Affected      : VPNsec users

A broken PDU has offlined a switch, which has caused the VPN server to be unreachable (and several other things, which don't affect users).

Central E-mail/Calendar disruption (exchange)

 Begin         : 2021-04-14    09:30
 Eind          : 2021-04-14    13:30
 Getroffen     : All users of Exchange (e-mail and calendar)

Due to an emergency maintenance, the central microsoft exchange server is unavailable for 4 hours. This may also affect systems that are dependent on exchange. E-mail and calendar functionality is expected to be restored when the maintenance is done around 13:30 Today.

Ceph problem

 Begin         : 2021-03-24 19:00
 End           : 2021-03-24 21:00
 Affected      : users with ceph based filesystems

During a routine upgrade of ceph, a bug in the latest version manifested itself and made the ceph manager unreachable. After aborting the upgrade and with help from the ceph-users mailinglist, everything became available again using a workaround.

Windows 7 computers disabled in B-FAC domain

 Begin         : 2021-03-24
 End           : after upgrade to other OS
 Affected      : users of Windows 7 in the B-FAC domain

Because of security issues the last remaining Windows 7 machines wil be disabled, effective 24-03-2021, as member of the Active Directory Domain B-FAC. Please upgrade these computers to a more up-to-date OS. See also previous announcements on Windows 10 and the end of Windows 7.

Lilo7 restart

 Begin         : 2021-03-17 21:00
 End           : 2021-03-17 21:15
 Affected      : users of lilo

To change the network of lilo7, we need to reboot this loginserver. If you want a stable connection to a loginserver during this downtime, please use lilo6 or the soon to be taken down lilo5. For more info see the page on C&CZ loginservers.

Host of several virtual servers broken: Roundcube, websites and others

 Begin         : 2021-03-05 07:45
 End           : 2021-03-05 09:40
 Affected      : users of the virtual servers: Roundcube, websites with databases on this server, ...

Yesterday the SSD bootdisk of this VM host reported the first problems. This morning this had the effect of stopping all VMs running on this host. By moving the VMs to a different VM host, the problem has been solved. We will investigate how to best prevent this problem in the future or lessen its impact.

Lilo6 down

 Begin         : 2021-02-25 17:30
 End           : 2021-03-04 16:45
 Affected      : users of lilo

As of Thursday afternoon, the lilo6 is down due to hardware issues. Because lilo6 was the default linux login server (lilo referred to lilo6), this affected many users of lilo. The impact is limited, because we have alternative lilo's, namely lilo5 and lilo7. As of March 1st lilo now refers to lilo7, ssh will warn about DNS SPOOFING, which is due to the difference host keys for lilo7
ECDSA SHA256:si3g2elo5m6TShx3PjX0+vF50pZ8NK/iXz/ESB+ZeP0

Major RU network maintenance Saturday Feb. 27 08:00-20:00

 Begin         : 2021-02-27 08:00
 End           : 2021-02-27 20:00
 Affected      : users of the RU network or services

The ISC announced that Saturday February 27 08:00-20:00 major RU network maintenance work will be carried out. This will mean that all RU services will be unavailable several times for at most an hour. This concerns all RU services including those of FNWI/C&CZ: e-mail, VPN, wifi, BASS, OSIRIS, Brightspace, Syllabus+, Corsa, etc.

DNS problems from outside with ru.nl

 Begin         : 2021-02-21 07:10
 End           : 2021-02-23 14:30 (?)
 Affected      : everyone trying to access something in ru.nl from off-campus.

The central DNS servers for ru.nl for external requests had problems, because they received too many requests, which resulted in science.ru.nl and others not being found. DNS names within ru.nl then will not resolve to an IP address. We enlarged some TTLs (Time-To+lives) to try to lessen the problem. These small TTLs were meant to be able to move a service to a new server in case of problems, but now they just make the problem bigger. After starting VPN you won't notice this problem, because the internal DNS servers that you use then are not affected. Changes to the RU DNS servers hopefully lessened or removed the problems as of 2021-02-23 14:30.

DNS broken for subdomains of ru.nl

 Begin         : 2021-02-11 ~11:15
 End           : 2021-02-11 ~13:00
 Affected      : everyone trying to resolve *.science.ru.nl *.astro.ru.nl etc.

DNS-servers for ru.nl did not serve information about subdomains such as science.ru.nl. Thus no DNS-name will resolve to an IP address at FNWI. A workaround is to use as DNS servers: 131.174.224.4 en 8.8.8.8. If you try to connect to a service for the first time after ca 11:15, you'll get an error like: "No such domain" or "Cannot resolve". Restarting RU DNS servers at 12:45 may have fixed the problem. Without a real explanation, the problem went away after a few hours.

Gitlab upgrade

  Begin         : 2021-02-07  04:00
  End           : 2021-02-07  12:50
  Affected      : GitLab and Mattermost users

Services will not be available for a while because of a GitLab and Mattermost upgrade.

Science VPNsec disruption

 Begin         : 2021-02-03 13:00
 End           : 2021-02-03 14:02 (for Apple macOS/iOS last fix on February 10)
 Affected      : Users of Science VPN

The expiration date of the certificate of our VPNsec service was apparently not yet checked regularly. This made it possible for the certificate to expire. We put a new certificate into place within an hour. Of course we will check this certificate regularly from now on. For Apple/Mac we needed to construct a new mobileconfig, this took some time, because in the meantime RU had moved to a different Certificate Authority. For Apple macOS this was ready at the end of Feb. 4, with a new installation procedure. For Apple iOS (iPhone/iPad) the old profile has to be deleted and a new mobileconfig has to be installed.

DIY temporarily not usable

  Begin         : 2021-01-25 07:15
  End           : 2021-01-25 07:45
  Affected      : Users wanting to manage their science account

Due to a management operation (planned around this time), the DIY website was unusable. Since the time was very early, it's expected nobody was inconvenienced by this temporary unavailability.

Science smtp service temporarily not usable

  Begin         : 2021-01-22 10:00
  End           : 2021-01-22 10:30
  Affected      : Science mail users wanting to send mail

A configuration change unwantedly made the smtp service unusable. When we noticed this, it was repaired immediately.

Very long mail aliases temporarily not usable

  Begin         : 2021-01-21 15:52
  End           : 2021-01-22 09:55
  Affected      : Science mail aliases of more than 1024 characters

A configuration change had as unwanted effect the disappearance of all very long mail aliases. When this was reported next morning, it was repaired immediately.

Switch crash; gitlab+mattermost, licenses and DHZ

  Begin         : 2021-01-07 ~14:30
  End           : 2021-01-07 ~15:00
  Affected      : GitLab and Mattermost users, Licenses, DHZ (diy)

Due to a simple management command the switch (as-ak008-04) crasht and had to be reset manually. The switch sits between the network and servers for gitlab+mattermost, licenses and the database for DHZ(diy).

Gitlab upgrade

  Begin         : 2020-11-27  04:00
  End           : 2020-11-27 ~08:00
  Affected      : GitLab and Mattermost users (including PEP)

Services will not be available for a while because of a GitLab and Mattermost upgrade.

Eduroam problem on campus

 Begin         : 2020-07-10 evenng
 End           : 2020-07-10 evening
 Affected      : Eduroam users on campus

The ISC announced: For security reasons, the certificate of the wifi server will be replaced in the evening of Friday, July 10. This has consequences for connecting your mobile device to Eduroam when you’re on campus:

• If you get the message that you have to accept the new certificate to use eduroam, choose 'yes'. You can then use eduroam again;

• If you don't get this message and can't connect to Eduroam, choose the wireless network 'eduroam-config'. Accept the terms and conditions. Follow the instructions to reinstall Eduroam.

More information can also be found at www.ru.nl/ict-uk/eduroam (you will need an internet connection for this).

If you have any questions, please contact the ICT Helpdesk (024 - 36 22222).

RU mail erroneously in Spam folder

 Begin         : 2020-03-25 17:52
 End           : 2020-07-07 13:13
 Affected      : FNWI employees with Science mail

March 25, a rule "2020 Radboud Universiteit" was added to the Science spamfilter. Recently, this matched RU-central mailings. Therefore RU-wide mailings from e.g. the RU Board and Radboud Recharge have erroneously been delivered in the Spam folder of Science emplyees. The Science spamfilter tries to fight spam and phishing, this is partly manual work in which errors can't be excluded. C&CZ apologizes for the inconvenience this has caused.

Webserver 'havik' offline

 Begin         : 2020-06-18 15:45
 End           : 2020-06-18 16:25
 Affected      : Users of various websites.

Several parts have been replaced. We assume the problem, that occurred twice, is now resolved. For dual-boot pcs, the boot menu was served by an alternative method during the repair.

Science radius disruption

 Begin : 2020-06-17 11:11
 End   : 2020-06-17 11:56

Affected : Users of Science VPN and Eduroam based on science account

The certificate of the LDAP servers has been replaced this morning, this has also changed the certificate chain. The radius server uses LDAP as authentication backend and in the radius configuration the certificate chain had to be changed too. This was initially overlooked. Radius is the authentication mechanism used by all VPN servers and Eduroam

Webserver 'havik' offline

 Begin         : 2020-06-17 03:38
 End           : 2020-06-17 08:52
 Affected      : Users of dual boot PC's (the dual-boot menu is served by a website) and various websites.

The server went down in the same way as the previous time (3rd of June 2020). The cause is most likely a system board problem. This part will be replaced Tomorrow by a support engineer.

Webserver 'havik' offline

 Begin         : 2020-06-03 06:30
 End           : 2020-06-03 10:12
 Affected      : Users of dual boot PC's (the dual-boot menu is served

by a website) and various websites.

The server couldn't be reached after the scheduled weekly reboot, not even on its management interface. Because also C&CZ employees work from home and the interruption didn't get enough urgency fast enough, the interruption lasted too long, apologies for that. The support partner has been contacted and the server has been updated, but the origin of the problem is still unclear. We will also look at making these services more redundant or more easily movable to a different server.

CN00 Slurm master ubuntu 16.04 down

 Begin         : 2020-05-18 09:50
 End           : 2020-05-19 12:15
 Affected      : slurm on ubuntu 16.04 (cn07)

Due to a failed BIOS upgrade, the hardware of the database server appears to be bricked. We transfered the disks to another machine (cn00) and all database services are now up again, at the cost of not having cn00 running. When the hardware is working well again, we will swap it all back and restore the original situation.

Sperwer Database server storing

 Begin         : 2020-05-18 06:30
 End           : 2020-05-18-10:00
 Affected      : various websites and slurm

Due to a failed BIOS upgrade, the hardware of the database server appears to be bricked. We transfered the disks to another machine (cn00) and all database services are now up again, at the cost of not having cn00 running. When the hardware is working well again, we will swap it all back and restore the original situation.

Update May 19th, 12:15 : hardware fixed, situation back to the original state.

Science VPN disruption

 Begin         : 2020-05-06 05:00
 End           : 2020-05-06 08:00
 Affected      : Users of Science VPN

Unexplained crashes starting around 5am on the host system. System offline at around 6am. After a hard reset around 08:00, all seems to be all right again.

Science datacenter network problem

 Begin         : 2020-04-30 12:08
 End           : 2020-04-30 21:44
 Affected      : Users of Ceph storage and a few new compute clusternodes

A broken transceiver caused flapping af a 100 Gb/s connection between two C&CZ datacenters. Hours later the flapping increased, which took down the complete redundant new connection between the two server rooms. When this was noticed, a workarpound was found quickly by shutting down the interface with the broken transceiver. With this the connection was restored. De broken transceiver has been replaced thanks to a swift action from our supplier. Now we have these spare parts ready to use. We asked the supplier whether a configuration change will make the connection more redundant, that just one broken transceiver will not take down the connection.

Jitsi.science.ru.nl not working properly

 Begin         : 2020-04-19 15:00
 End           : 2020-04-20 11:40
 Affected      : Users of jitsi.science.ru.nl

Due to performance tuning having gone wrong the jitsi.science.ru.nl conference rooms cannot be joined by more than one person at the moment. Solved by reinstalling server.

Mailserver certificate problem

 Begin         : 2020-04-13 14:00
 End           : 2020-04-13 14:35
 Affected      : Users Science mail

The new certificate of the Science mailserver hadn't yet been placed in the right place. The expiration of the old certificate caused a problem for Science mail users, that was resolved by replacing the old certificate.

Problems with a virtual host

 Begin         : 2020-02-18 05:30
 End           : 2020-02-18 09:08
 Affected      : Users of mx3, smtp3, crestron, gitlab (PEP), goudsmit, msql01 and labservanttestvm.

The virtual machine host 'oscar' could not boot. Again, a broken LVM snapshot caused the problem.

Archived service interruptions can be found in the service interruptions archive.