"MacGyver-ising" Application Management
Imagine the beginning of a semester at an university with 20,000 students. A time when couple of areas are under huge stress—Registrar’s office and the IT department. This is the time when students are adding and dropping courses (thousands of transactions in a second), paying their bills, meeting with their advisors to either validate the course they are taking that will help them graduate in four years, updating or checking on their financial aid status and many more activities. All these happen on one of the mission critical systems in a university IT ecosystem. During this time, we spend a lot of time making sure the Student and Finance system of our ERP is capable of handling the dynamic volume of transactions. We add more servers to the pool, more memory on systems, and a number of these types of changes/ additions based on the health of the system at that point in time. Virtualizing these systems had helped us be on top of this reactive part, when something happens or tends towards the negative; we jump in and quickly make adjustments to make the app perform better.
Application management (AM) has never been at the forefront when large complex applications have been rolled out. The only piece of AM that we all do well is the maintenance or day-to-day operations of an application. However, AM also includes managing the performance, versioning and upgrading of the application. In short, handling the whole lifecycle of an application. As applications grow complex with the volume of data collected growing exponentially especially with a number of sensors collecting information and the users’ expectation to be able to access these volumes of data at anytime to make an important decision, AM becomes quite important. As users’ expectations grow, especially in business-critical and mission-critical applications, it is important to feel the pulse of these applications and more importantly to have an early warning system that informs if something is going to fail.
My vision for an Application Management solution is one that is entirely proactive
Before getting into the specifics of AM solutions, I would like to layout my vision for an app management system. Like a human body, the hands and legs would not work if the brain does not send them the right signals. Likewise, an application cannot work independently of the hardware it is running on. Both work synergistically to make the application perform what it needs to do. I remember around 2004-2005; I had implemented an application management solution that was way ahead of its time. It was a small company out of Boston who used artificial intelligence (AI) in their product. They installed “probes” at each tier of the applications (database, core app, etc) to collect log data along with logs from hardware. When an anomaly would occur, the AI would kick in to figure out the root cause of the issue and give an alert about it suppressing all the alerts from different parts of the application upstream from it. This helped us in getting to address the issue much faster and more efficiently.
My vision for an AM solution is one that is entirely proactive. The first part of an AM falls under governance along with business stakeholders (updating and versioning). The second part, tactical operation (performance and maintenance) of the application should be proactive. Having the ability to know that something is going to go wrong before it happens and to identify the root cause. The root cause finding should be holistic rather than just capturing the application failure. So, this system would look at information across the application and the hardware and find what could be causing the issue when it happens. The second part of the vision is the ability to send easy to understand information to multiple dashboard, one to the helpdesk and the other to the level 3 group (developers/engineers) who manages and develops the application.
There are some good solutions which can cater to certain requirements, however when it comes to alerts and information it is a bunch of cryptic dials and sentences after something breaks. This information is also discrete, each part of the application represented by its own dial. Our need and goal to be proactive led me to start building an application management system the way we wanted based on the sets of tools we had. We did not want to build something from scratch but look at what we already have and use them to build a system that would proactively inform the necessary teams in a user-friendly way.
A number of tools are used to discretely manage applications today in our environment like Oracle enterprise manager (OEM), Solarwinds with probes into applications and systems, MRTG, Tableau (for visualization) and other solutions. All these manage a number of discrete points in each application/system creating discrete sets of information. We took these discrete data points and aggregated them based on applications with two different types of dashboards. The first dashboard helped the client services team to respond quickly to an end user calling or to raise the flag to a level 3 developer. This dashboard (figure 1) is very high level and does not provide any specific detail on the issue but that there is an issue in one or multiple sub-parts which make up that application.
The other detailed dashboard provides the health of the application from all aspects. It shows time-based information for usage, growth and other vectors. When there is an issue, that particular application information moves to the top and highlights in red with all pertinent details that would enable a developer to get it fixed. In addition, we have built some predictive models (figure 2) based on the data, which in many cases are non-linear, to estimate when an issue might happen which could degrade performance or cause failure. This helps us to take proactive measures to fix the issue before it becomes one. Going back to the scenario given above, when we have these dashboards available, we are now able to proactively fix issues and the users have a pleasant experience.
The next step in this would be taking this one level up to introduce machine learning that would start at the problem, understand the symptoms and find the root cause or better yet, when symptoms occur the AM solution finds the root cause creating those symptoms and alerts developers. This enables it to fix the issue way before it becomes a problem. As AI is becoming prevalent, you will see these types of management system coming in the near future. We like to be MacGuyver in our approach— use what you have to solve a need.