Main image of article Case Study: A Slowdown in Software
shutterstock_317896172 One of my clients in London operates a taxi firm and uses a proprietary system developed quite a few years ago to accept bookings and dispatch taxi drivers. This system also includes accounting and management features. The business depends on it, and it has evolved in its scope and feature set (usually in response to regulations and requirements). The system uses a MariaDB database running on a separate server. (MariaDB is a MySQL clone created by Monty Widenius, the original creator of MySQL.) There are a dozen instances of the system on six locally networked PCs, with six more PCs at satellite offices located around East London, each connected via a VPN to the main office. Though it's the same software on all PCs, one is configured as a dispatcher; in addition, all taxi driver apps (running on Android phones) communicate with the dispatcher using short TCP/IP messages sent over UDP. The rest of the PCs are for entering job details received over the phone, or from customers at the satellite offices, which are located near tube stations.

The Problem

The firm’s night staff reported that occasionally the dispatch screen would start running very slowly; soon enough, the day staff also started reporting the same thing. It would run so slowly that it stopped auto-dispatching jobs. Auto-dispatch is a software setting that offers jobs to the first empty car in each area without operator intervention. If that taxi doesn't accept the job, it gets offered to the next empty car in the area or closest nearby area, and so on. Cars are queued up automatically in an area when the driver's app sends a message saying that the passenger has been dropped off. All cars report their GPS location so the dispatcher can track which of the 30 or so areas they are in. At busy times, such as during the day when schools are letting out, or in the evening when bars are closing, it's quite a burden for the dispatch operator to manually dispatch jobs… but that's what it took to clear the backlog. The company had a problem.

Preliminary Investigation

The dispatcher is connected to the internet via a 10 Mbs leased line. That doesn't sound very fast, but normal operating bandwidth is about 200 Kb in and 400 Kb out, so it’s hardly swamped. I logged into the router’s admin website, which displays live traffic volume. When the slowdown was occurring, it showed 7-9 Mbs, or roughly 15x the normal level of traffic. Clearly something was amiss. Were we under some kind of intermittent Denial of Service (DoS) attack? As I'm a developer and not a network expert, I don’t have much knowledge of how DoS attacks work. Nonetheless, I started looking into the issue. The router administrative webpages showed a list of connected IP addresses; by manually copying and pasting IPs into geolocation websites, I found that some of the connections were from outside of the UK, mainly the United States and Saudi Arabia. Disconnecting them reduced the traffic level slightly (by about 1/2- 1 Mbs), but within 15 minutes the level rose again. My guess was that they were from scripts probing IP addresses. There were typically 150-200 connections at any time, but with up to 100 car drivers, that's not an unreasonable number. Eventually I recognized the IP ranges from UK mobile operators, so it became apparent that the number of 'attacks' was actually minimal. It wasn’t a DoS situation, just the usual minor probes that you expect from any computer connected to the internet. Yet the slowdown problem was still in effect. The original programmer, now involved in the troubleshooting, made changes so we could get a better idea of timings. It turned out that a couple of the offices were taking up to four seconds to refresh their job and car lists. (The dispatch screen shows all cars plotted in all areas, along with a list of current jobs; the data is retrieved from the database by a single query. The dispatcher sends out a message to each of the other terminals telling them to refresh their job and cars lists; each then runs a query on the database.) I logged into the MariaDB database, which runs on a Linux server, and ran the iftop utility. This let me view where exactly the traffic was coming from, and that's how I realized it was coming from the satellite office PCs. We examined the code and found that the database query was just returning the entire table’s data: some 500 rows by 100 columns’ worth, of which 90 percent wasn't needed. We told all the satellite offices to close their programs down, and internet traffic dropped to about 100 KB. After restarting the software in the satellite offices, data use climbed back up: proof that the high traffic levels were due to lists data fetched by the satellite offices. Local network PCs also fetched this data, but as they’re on a gigabyte speed network, there was no effect.

The Problem Identified

The dispatcher software uses Winsock control to send data to other PCs. Winsock by default runs in blocking mode, and for some PCs, it could take up to four seconds to send a message to one PC to refresh its job list. Ideally, that process should have taken less than 1/10th of a second. At busy times, with 100 taxis executing their tasks, the job and car lists were refreshing maybe once a second; combined with a four-second delay, that was what caused the slowdown. The dispatcher would cycle through the list of PCs, sending an update job list message to each PC, but with that process taking so long, it had to do it immediately again. When I checked the PC, I found it was running very slowly, with RAM between 70-90 percent full. Once Windows memory hits around 50 percent, it starts swapping memory to disk via the swap file, and runs slowly. I removed a couple of programs, ran an optimization program, and reduced RAM use to 43 percent. The developer is now putting in a software fix that should return just the 10 percent of the needed job/car list data. He is also considering using one Winsock control for each PC, as the time needed to switch the single control from PC to PC may be slow, impacting overall speed.

Conclusions

No matter what you think the problem is, find evidence to back it up. The root of our elevated traffic wasn’t a DoS attack, as some might have assumed initially, but slow refreshes. Taking time to correctly evaluate a software problem can save you a lot of money and effort later. Resist pressure to enact a solution until you’re absolutely sure that it will solve that problem.