Faster calculation results without fear of errors | MIT News

Researchers have developed a technique that can dramatically speed up certain types of computer programs automatically, while keeping program results accurate.

Their system increases the speeds of programs running in the Unix shell, a ubiquitous programming environment created 50 years ago and still widely used today. Their method parallelizes these programs, meaning it splits program components into pieces that can be run on multiple computer processors simultaneously.

This allows programs to perform tasks such as web indexing, natural language processing, or analyzing data in a fraction of their original runtime.

“There are so many people who use these kinds of programs, such as data scientists, biologists, engineers and economists. Now they can automatically speed up their programs without fear of getting incorrect results,” said Nikos Vasilakis, a research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT.

The system also makes it easy for programmers developing tools that data scientists, biologists, engineers, and others use. They don’t have to specially modify their program commands to enable this automatic, error-free parallelization, adds Vasilakis, who chairs a committee of researchers from around the world who have been working on this system for nearly two years.

Vasilakis is the group’s senior author latest research paperincluding MIT co-author and CSAIL graduate student Tammam Mustafa, and will be presented at the USENIX symposium on operating system design and implementation. Co-authors include lead author Konstantinos Kallas, a graduate student at the University of Pennsylvania; Jan Bielak, a student at Staszic High School in Warsaw; Dimitris Karnikis, a software engineer at Aarno Labs; Thurston HY Dang, a former MIT postdoc who is now a software engineer at Google; and Michael Greenberg, assistant professor of computer science at the Stevens Institute of Technology.

A decades old problem

This new system, known as PaSh, targets programs or scripts that run in the Unix shell. A script is a series of commands that instruct a computer to perform a calculation. Correct and automatic parallelization of shell scripts is a thorny problem that researchers have struggled with for decades.

The Unix shell remains popular in part because it is the only programming environment that allows one script to be created from functions written in multiple programming languages. Different programming languages ​​are better suited to specific tasks or types of data; if a developer uses the right language, solving a problem can be much easier.

“People also like to develop in different programming languages, so putting all these components together in one program is something that happens very often,” Vasilakis adds.

While the Unix shell allows for multilingual scripts, its flexible and dynamic structure makes it difficult to parallelize these scripts with traditional methods.

Parallelizing a program is usually tricky because some parts of the program depend on others. This determines the order in which components should run; get the order wrong and the program fails.

When a program is written in one language, developers have explicit information about its functions and language that helps them determine which components can be parallelized. But those tools don’t exist for scripts in the Unix shell. Users cannot easily see what is happening in the components or extract information that would aid in parallelization.

A just-in-time solution

To solve this problem, PaSh uses a preprocessing step that inserts simple annotations into program components that it believes can run in parallel. Then PaSh tries to parallelize those parts of the script as the program runs, at the exact moment it reaches each part.

This avoids another problem with shell programming: it is impossible to predict the behavior of a program in advance.

By parallelizing program components “just in time”, the system avoids this problem. It is capable of speeding up many more components effectively than traditional methods that try to perform parallelization in advance.

Just-in-time parallelization also ensures that the accelerated program still produces accurate results. If PaSh arrives at a program component that cannot be parallelized (perhaps it depends on a component that has not yet been executed), it simply runs the original version and avoids causing an error.

“Regardless of the performance benefits — if you promise to run something in a second instead of a year — if there’s a chance that incorrect results will be returned, no one will use your method,” Vasilakis says.

Users do not need to make any changes to use PaSh; they can just add the tool to their existing Unix shell and tell their scripts to use it.

Acceleration and Accuracy

The researchers tested PaSh on hundreds of scripts, from classic to modern programs, and it didn’t break any. The system was able to run programs on average six times faster compared to unparalleled scripts, and it reached a maximum speed of almost 34 times.

It also increased the speeds of scripts that other approaches couldn’t parallelize.

“Our system is the first to show this kind of completely correct transformation, but there is also an indirect benefit. The way our system is designed allows other researchers and users in the industry to build on this work,” said Vasilakis.

He’s excited to get additional feedback from users and see how they’re improving the system. The open-source project joined the Linux Foundation last year, making it widely available to users in industry and academia.

In the future, Vasilakis wants to use PaSh to tackle the distribution problem – distributing a program to run on many computers instead of many processors within one computer. He also wants to improve the annotation scheme so that it is more user-friendly and can better describe complex program components.

“Unix shell scripts play a key role in data analysis and software engineering tasks. These scripts could run faster by ensuring that the various programs they call use the multiple processing units available in modern CPUs. However, the dynamic nature of the shell makes it difficult to
devise parallel implementation plans in advance,” said Diomidis Spinellis, a professor of software engineering at Athens University of Economics and Business and professor of software analytics at Delft University of Technology, who was not involved in this research. “Through just-in-time analysis, PaSh-JIT manages to overcome the dynamic complexity of the shell, reducing script execution times while preserving the correctness of the associated results.”

“As a replacement for a regular shell that orchestrates steps but doesn’t rearrange or split them, PaSh provides a hassle-free way to improve the performance of big data processing tasks,” adds Douglas McIlroy, adjunct professor in the Department of Computer Science at Dartmouth College. , who previously headed the Computing Techniques Research division at Bell Laboratories (the birthplace of the Unix operating system). “Hand-optimization to take advantage of parallelism has to be done at a level for which common programming languages ​​(including shells) do not provide clean abstractions. The resulting code mixes things of logic and efficiency. It is difficult to read and difficult to maintain in light of changing requirements. PaSh cleverly intervenes at this level, preserving the original logic on the surface while at the same time achieving efficiency when the program is executed.”

This work was supported in part by Defense Advanced Research Projects Agency and the National Science Foundation.

Leave a Comment

Your email address will not be published. Required fields are marked *