Velankani’s Blog

Data Center Process Automation with Auto Remediation

In a typical data center Global Network Operations Center (GNOC). NOC operators monitor the network and applications being managed by a service provider. The monitoring data is provided by network and application software that sends fault and outage notifications to the NOC console.

Managing information from multi-modes manually is challenging and makes it very difficult to meet strict Service Level Agreements (SLAs).  Figure 1 illustrates a typical manual process.

illustration of trouble ticket lifecycle

Figure 1: Typical trouble ticket lifecycle. NOCVue’s QuickChassis can be used to identify and verify the issue manually.

The tasks performed by the GNOC and some of the operation and troubleshooting done by the technical support personnel are areas suited for automation. In this Blog, we discuss automation that can be achieved in critical areas of NOC operations and trouble-ticket management to achieve consistent SLA adherence with troubleshooting and auto remediation.

There are five key areas to automate:

  • Trouble ticket creation
  • Escalation process
  • Non-Intrusive troubleshooting
  • Ticket update
  • Remediation

The tasks performed by the GNOC and some of the operation and troubleshooting done by the technical support personnel are areas suited for automation. In this Blog, we discuss automation that can be achieved in critical areas of NOC operations and trouble-ticket management to achieve consistent SLA adherence with troubleshooting and auto remediation.

There are five key areas to automate:

  • Trouble ticket creation
  • Escalation process
  • Non-Intrusive troubleshooting
  • Ticket update
  • Remediation

The NOC process automation flow, shown in Figure 2, explains the components which function as entities to automate data monitoring.

Data center process automation with auto remidiation
Figure 2: Architecture of data center process automation with auto remediation

Automating trouble ticket creation.  Creating trouble tickets can be automated by bringing in a common adaptor component that will be plugged into the NOC process flow. This component converts network fault and application failure messages into a common format.  The format can be chosen from any those widely available, but most often XML is used.   The fault and application failure messages are incorporated into trouble ticketing system which automatically generates a trouble ticket and passes it to the escalation engine and the customer-specific flow handler.

Automatic escalation process.  The function of the escalation engine is to modify the trouble ticket according to the escalation rules defined in the database.  The pool of scheduled tasks is run against the escalation rules to identify trouble tickets and move them forward.  The category and level of network faults is matched to the appropriate support personnel without manual intervention.  The escalation engine also initiates schedules to perform periodic checks, identify problems and send trouble tickets to the support personnel by both SMS and email.

Non-intrusive troubleshooting.  When technical support personnel receives trouble ticket, the next step is to manually review the ticket and troubleshoot the problem by logging on to the network device or application and validating the status and configuration.

To automate, the troubleshooting steps are configured into a data dictionary based on the network or application fault category. The categorization for network faults can be both vendor and device based (for example, Cisco Router ASR, Juniper Switch etc.).  This flexibility allows support personnel to add troubleshooting steps in detail.    Application troubleshooting steps can be identified by type of application—for example, separate steps for Oracle DB server 11.x, Tomcat Server 7.3 etc., ) and the step added to the data dictionary.

Non-intrusive troubleshooting includes only those commands that gather information pertaining to the fault, allowing support personnel to more quickly identify the problem.

The data dictionary can be maintained as files or in database. The implementation described here maintains the intelligence in database tables.

Automating trouble ticket update.  The troubleshooting module connects to the router or the server managing the application via SSH or an application-monitoring console to perform the necessary actions.  The results are added to the trouble ticket via  API calls to the ITIL sub-system. The different APIs allow for updating the trouble ticket at various levels as actions against the particular trouble ticket are performed.  This provides the support personnel with a complete trouble ticket history and enables the support team to perform actions to further identify and rectify the reported issue.

Auto remediation.  Automatic remediation of a trouble ticket is achieved by the automation system by referencing the data dictionary. The steps to resolve a problem are defined based on network and application categories. These steps are maintained in the database. The remediation system goes through the knowledge repository and executes the steps according to the rules defined in the database. Auto remediation is recommended for problems which do not affect customer service.

Conclusion.  The process of automating the flow from fault detection to ticket creation, escalation and automatic remediation will lead to consistent adherence to SLAs specified by the customers.  The automation will also benefit the GNOC provider by making it possible for a limited GNOC staff to support a large volume of customers. The customer tickets and automatic escalations will prevent errors and SLA slippages caused by manual intervention.

Velankani Communications Technologies, Inc. has provided solutions for telecommunications equipment manufacturers and service providers for over 25 years. Velankani delivers carrier-grade solutions that are deployed in large networks and then upgraded through multiple releases. We understand real network behaviors and possess the subject matter expertise needed to make the appropriate technology, design and tool choices.

For more information, contact   or Rekha Poosala ().

Venugopal Gangadharan

Written by
Venugopal Gangadharan
Network Technical Manager, SME – ITIL,  
Network Management and Solution Architect 
Velankani Communications Technologies Inc.

Share Your Thoughts!

*

Copyright © 2013 | Privacy Policy | Sitemap | Velankani Communications Technologies Inc. All rights reserved.