Category Archives: Computer Programming

Side effects – Xtreme End-to-End thinking!

I think that the one of the most important habits of good design and engineering is end-to-end thinking. Solving problems is fine, but they need to be the right problems, and this means they need to be solutions all the way to the actual ends of the system.

Pat Helland writes this month about an even more funky, extreme version of this principle: Side-Effect-Thinking.

He focuses on digital transactions which are the backbone of the digital world.

The basic observation is that digital systems are composed of many pieces and layers which interact through APIs that are specifically designed to hide the TMI. These systems are also connected to real world systems, which may act in response to a transaction.

The main point is, of course,

One system’s side effect is another’s meat and potatoes.” ([1], p. 39)

The whole idea of APIs is to hide how the system works. By design, we have no idea exactly what happens when we place an order, or change an order, or whatever. Lot’s of stuff goes on behind the scenes, and we neither know nor need to know about it.

Helland points out that a side-effect of this design strategy is that a transaction may have knock on effects much wider than we might guess. He describes a scenario of reserving a hotel room. This may trigger the hotel to increase staff for that period, to order additional supplies, reschedule maintenance, and so on. Suppliers may respond with new orders and reserve capacity. And so on. All I asked for is a room!

It becomes more interesting when you cancel the room request. In one sense, the transaction is reversed, and all effects undone—in the reservation systems. But the side effects are not and some cannot be undone.

“cancelled’ transactions don’t undo side effects

He also points out that “idempotent” operations are idempotent only from the viewpoint of the transaction system. Each execution may produce the same result, but each will have side effects. If nothing else, the traffic will be logged (many times and at many levels of the system).

(Historic note: in the early days of the web, it wasn’t uncommon for web servers to swamp their logging and other ‘hidden’ systems, even when there was only a few pages that never changed. The information delivered was trivial, all the work was side effects.)

As I said, all this stems from the completely sound and reasonable design principle of information hiding. It would be impossible to build distributed systems any other way. On the other hand, designers implementing the system would do well to think about side effects that the callers might cause.

I would note that side-effects like those described by Helland have been the source of break-ins and denial of service attacks. Any of the scenarios in the article could, if repeated many times, become a denial of service attack. Some of them would be weird and unprecedented attacks, such as creating local petrol shortages by spamming hotel reservation systems.

Side effects and their ramifications will be even more problematic as the Internet of Things deploys. I don’t really know what the IoT will actually be, but most visions of it imagine that everyday actions will trigger waves of automatic transactions among the “Things”, which will also cause real world actions. TMI to the TMth power!

Following the spirit of Helland’s scenarios, imagine that when you cancel your hotel reservation (triggering mischief in France and memory fragmentation in some database), your home infers that you will be home that day instead of away. The “smart home” had planned to stand down many systems and conduct maintenance, but now anticipating what you will need, and proactively purchases extra electricity and utilities, cancels a maintenance visit, and orders food. These orders in turn trigger a wave of transactions and just-in-time orders by the services. And so on.

If this sounds like a chaotic system, it might very well be.

Remind again, me why the IoT is a good idea?


  1. Pat Helland, Side effects, front and center. Communications of the ACM, 60 (7):36-39, 2017. http://queue.acm.org/detail.cfm?id=3099561

Orchestrating Internet of Things Services

Zhenyu Wen and colleagues write in IEEE Internet Computing about “Fog Orchestration for Internet of Things Service[1]

Don’t you thing “Fog Orchestra” is a great name for a band?

After laughing at the unintentionally funny title, I felt obliged to read the article.

The basic topic is about the “Internet of Things”, which are “sensors, devices, and compute resources within fog computing infrastructures” ([1], p. 16) As Arieff quipped, this might be called “The Internet of Too Many Things”.

Whether this is a distinct or new technology or architecture is debatable, but the current term or art, “fog computing” is, for once, apt. It’s kind of like Cloud Computing, only more dispersed and less organized.

Wen and colleagues are interested in how to coordinate this decentralized fog, especially, how to get things done by combining lots of these little pieces of mist. Their approach is to create a virtual (i.e., imaginary) centralized control, and use it to indirectly control pieces of the fog. Basically, the fog and its challenges is hidden by their system, giving people and applications a simpler view and straight forward ways to make things happen. Ideally, this gives the best of both worlds, the flexibility and adaptability of fog, and the pragmatic usability of a monolithic application.

(Pedantic aside: almost anything that is called “virtual” something, such as “virtual memory” or a “virtual machine” or a “virtual private network”, is usually solving this general problem. The “virtual” something is creating a simpler, apparently centralized, view for programmers and people, a view that hides the messy complexity of the underlying system.

Pedantic aside aside: An exception to this rule is “Virtual Reality”, which is “virtual” in a totally different way.)

The authors summarize the key challenges, which include:

  1. scale and complexity
  2. security
  3. dynamicity
  4. fault detection ans handling

This list is pretty much the list of engineering challenges for all computing systems, but solving them in “the fog” is especially challenging because it is loosely connected and decentralized. I.e., it’s so darn foggy.

On the other hand, the fog has some interesting properties. The components of the system can be sprinkled around wherever you want them, and interconnected in many ways. In fact, the configuration can change and adapt, to optimize or recover from problems. The trick, of course, is to be able to effectively use this flexibility.

The researchers refer to this process as “orchestration”, which uses feedback on performance to optimize placement and communication of components. They various forms of envision machine learning to automatically optimize the huge numbers of variables and to advise human operators. This isn’t trivial, because the system is running and the world is changing even as the optimization is computed.

I note that this general approach has been applied to optimizing large scale systems for a long time. Designing networks and chips, optimizing large databases, and scheduling multiprocessors use these kinds of optimization. The “fog” brings the additional challenges of a leap in scale, and a need for continuous optimization of a running system.

This is a useful article, and has a great title!


  1. Zhenyu Wen, Zhenyu, Renyu Yang, Peter Garraghan, Tao Lin, Jie Xu, and Michael Rovatsos, Fog Orchestration for Internet of Things Services. IEEE Internet Computing, 21 (2):16-24, 2017. https://www.computer.org/internet-computing/2017/05/05/fog-orchestration-for-internet-of-things-services/

Yet More On Reproducibility of Computational Results

I have commented before about the difficult problem of understanding and reproducing computations. This is a deep philosophical problem for computational thinking in general, and a critical concern for practical science. When a measurement, finding, or argument depends on a complicated computational workflow, how can we make it possible to accurately reconstruct and reproduce the computation in order to test or modify it?

This month Cedric Notredame and colleagues from the Centre for Genomic Regulation (CRG) in Barcelona, http://www.crg.eu/en/ published a description of a system that addresses many aspects of this problem, at least for common genomic analyses. (These computations need to be highly reproducible for both safety and business purposes, as well as scientific validation.)

One of the key problems they tackle is the subtle issue of numerical stability of the computations. Running the same program on different computers, or with slightly different support libraries, or slightly different settings, can result in accumulations of rounding errors that may change the final result. This issue gives undergraduate Computer Science majors migraines, and most of us try to forget about it as much as possible. Dealing with this requires considerable expert knowledge and a certain amount of practical art.

The Nextflow software essentially captures this knowledge, and makes it easier to get the same results every time. These are not necessarily the correct results. They are repeatable results,

Repeatability is necessary, if not sufficient, to seriously testing their validity.

It is very clear why this product was created, and what advantages it offers investigators.

Looking at the technology, this is built upon Docker container technology, which I’m not familar with, but looks really cool and useful. It encapsulates the dependencies of a particular software component. (You youngsters don’t appreciate how lucky you are. Over my career, I probably spent a cumulative decade or more dealing by hand with these fiddly details of just keeping the darn software running.)

This kind of reproducibility is very good for keeping track of what was done reusing work, and cross checking results (a la Sensei Carole Goble and others).

There is a problem, though. When you successfully reproduce the results, you depend on a massive amount of software and settings. Most of those dependencies are supposed to be irrelevant to the conceptual result. By analogy, if you get the result only when you use one specific microscope, or even one model of microscope, then the instrument is confounded with the result.

This result is fragile, because it depends on the exact method that produced it. For robust understanding, it is always important to reproduce results using alternative methods. I.e., you should get the same result with any microscope. Ideally, we want to recreate the same results even with different software components.

A danger with these complex workflows is that it is very difficult to understand the dependencies hidden inside.  Great software like Nextflow (and Docker and everything that it depends on!), is  that it makes it easy to reuse components without knowing what is inside. It is so easy to be blind to potential confounding factors hidden in the software or even in the hardware.

I think the next step will be capabilities that explain the component, and reason about the workflow to explain the what it is built on top of, and what it “assumes” about the data and computation. For example, if a computation uses a specific library to sift though a database, it implicitly assumes that this library works correctly with this database.

A second improvement will be tools for comparing two workflows, to identify the differences. Some differences should be unimportant (we assume), others are obviously significant. This tool would also help create alternative workflows that use meaningfully different software to do “the same” computation. These can give convergent validity to both results and software.

These ideas are scarcely new.  But Nextflow and other contemporary systems have advanced to the point where it is realistic to really do these things, and do them well.

This stuff is really hard, but really, really important.

  1. Paolo Di Tommaso, Maria Chatzou, Evan W. Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame, Nextflow enables reproducible computational workflows. Nature Biotechnology, 35 (4):316-319, 04//print 2017. http://dx.doi.org/10.1038/nbt.3820

 

More on Software Obfuscation

In these days of abundant computing cycles, we see some old ideas come to fruition, simply because they are now feasible. For instance, automatic translation and other language processing are becoming common, though the ideas behind them aren’t new. What is new is the amount of computation and data that can be managed at a reasonable cost (and cheaper than paying humans to do it).

 

Evidently, one of the old ideas that is becoming new again is “obfuscation”, which potentially includes many different approaches, with different goals, including cloaking identity or location of whistle blowers [1].

A more technical form, “software obfuscation” has been around in one form or another, for a long time (e.g., license keys that unlock “obfuscated’ code), but appear to be enjoying a renaissance. In some cases, this computing power has made possible some amazing techniques, but increased computation mayalso  lower the difficulty of techniques that are actually fundamentally flawed.

Hui Xu and Michael Lyu of the Chinese University of Hong Kong published a useful summary of why general software obfuscation is difficult. They explain that automatically obfuscating code requires significant levels of analysis of the code, and the ability to modify the code in ways that preserve the correctness of the output (i.e,, the scrambled program has to still work).

Furthermore, they point out that some of the academic literature is rather flawed, focusing on limited problems and failing to consider end-to-end issues in the real world. For example, they say that common formulations of cryptographic obfuscation “assume less powerful adversaries” than is realistic. They point to the case of a licensing mechanism, where studies of cryptographic obfuscation consider that “a successful cracking implies key leakage…while practical adversaries might only need to locate the code that bypasses the license verification” (p. 82) In other words, the techniques might “succeed” but not accomplish the real world goal at all.  (Such studies are literally “academic”.)

Xu and Lyu suggest that the problem needs to be restated, in the form of “possibly attainable security properties that are meaningful for practical software obfuscation techniques”. Their general idea is surely a more end-to-end approach, as well as practical defenses against “specific deobfuscation techniques”.


  1. Finn Brunton and Helen Nissenbaum, Obfuscation: A User’s Guide for Privacy and Protest, Cambridge, The MIT Press, 2015.
  2.  Hui Xu and Michael R. Lyu, Assessing the Security Properties of Software Obfuscation. IEEE Security and Privacy, 14 (5):80-83,  2016. http://doi.ieeecomputersociety.org/10.1109/MSP.2016.112

 

(PS.  Wouldn’t “Specific Deobfuscation Techniques” be a good name for a band?)

Hackerrank ranks schools that generate “coders”

This month ‘Hackerrank” released a list of the Universities with the “best coders” in the world.

My initial glance gives the list at least some face validity, because University of Illinois at Urbana Champaign is ranked 14 in the world and third in the US schools. I would expect no less! (And where, indeed, are MIT, Stanford, and CMU in the list???  Probably busy starting companies and schmoozing with venture capitalists.)

Naturally the list is dominated by China and India, and their #1 is a school in Moscow, which is extremely plausible, too.

We can be sure that many of the entrants at UIUC and other US schools are from overseas, as well, definitely including India and China.


While these results tickle my school pride (and certainly bear out the proud history of UIUC), what the heck is this list based on, and what does it mean, if anything?

Hackerrank appears to be sort of a talent search and placement company, that entices young programmers to tackle programming problems as a competitive game. (Just like we did when I was in school….) These tests seem to be about the ability to write software and solve problems related to programming. As they say, this is an assessment of “coders”.

While I have written plenty of code in my life, my career actually depended on a lot of skills other than coding, including working in groups, understanding user needs, explaining ideas, and thinking about long term sustainability. And the fact is that only the tiniest fraction of code is written from scratch. Most coding is modifying (and fixing) existing code, which is kind of a different skill.

My point is that this kind of coding contest isn’t a measure of all the things a successful programmer needs to know.

That said, anyone who does well at these games has a lot of practical knowledge under their command, which is a good foundation.

It may be even more significant that these high scoring institutions have lots of strong coders, and a culture steeped in knowledge of software and problem solving. That is generally a key to the development of cool stuff. (Such as Mosaic.)

The “Hour of Code” Coming Everywhere

This week is the start of the annual “Hour of Code” (TM) festivities, featuring thousands of events around the world. The main idea is to offer a tutorial which is “a one-hour introduction to computer science designed to demystify coding for students and encourage them to pursue technology careers.”

hourofcode_logo_rgb
“The ‘Hour of Code™’ is a nationwide initiative by Computer Science Education Week[csedweek.org] and Code.org[code.org] to introduce millions of students to one hour of computer science and computer programming.”
These tutorials are held in 40 languages, pretty much everywhere.

Obviously, this is a great thing. Everyone should try a little coding—it really isn’t that mystifying, and most people can do it if they can tolerate the tedium and the obsessive-compulsive attention to details. Trust me, computers are stupid, you are smarter than them.

For that matter, creating software, like making anything, is intrinsically motivating: there is a thrill to make something that actually works, and say to yourself, I made that. And software is the pure white powder, uncut: your software can be a whole new world, you can make it do whatever you want it to. Physics, smizhics. Gravity, smavity. Virtual worlds can be anything you can imagine.

Come on in, the coded water is fine.


I do have a bit of a problem with the grander claims made for this exercise. For one thing, basic coding is scarcely the heart of “computer science”. I’ve met far too many “coders” who know almost nothing about computers, or compute science.   Still, if you like coding, you’ll want to learn more about computers and how to make them do what you want, not what they want.

I have an even bigger problem with idea that a one hour tutorial, any one hour tutorial, amounts to “learning to code”. For instance, I cringed when I saw the headline “100 million students worldwide will learn to code this week for Hour of Code” Yoiks!

Look, you would say that a student “learned to write English”, based on taking a one hour tutorial on ABCs and hand writing. It takes hundreds of hours over years to become competent in reading and writing any language, and computer coding is just the same. Moreover, to be good at it you have to know a lot of stuff besides coding. As the HoC says, you need logic and problem solving skills, and also an understanding of the astonishingly complex world of software out there. A healthy dose of paranoia (what can go wrong, will go wrong), and ability to work with others is a plus, too, IMO.

So, let’s agree with why we want to do this, without going overboard about alleged benefits of the one hour tutorial.

Every student should have the opportunity to learn computer science. It helps nurture problem-solving skills, logic and creativity. By starting early, students will have a foundation for success in any 21st-century career path.

 

Randomized System Code

This month David Williams-King at Columbia U. and colleagues demonstrated something that I’ve dreamt of for many years:   “Shuffler: Fast and Deployable Continuous Code Re-Randomization” [1].

That’s right, the operating system continuously shuffles its code around as it runs, making it much more difficult to hack. I imagined doing this long, long ago, back when I was building operating systems, but it was too mind-bendingly hard for my tiny little brain. How can I not be impressed with this work!

The basic idea is to defeat security threats that exploit bugs in code by overwriting memory to force the code to jump to the wrong address, where it executes the attacker’s code. Some systems use obfuscation or encryption to scramble the code, which makes it more difficult to understand the code in memory. But any static obfuscation is vulnerable to “cut and paste” attacks: if I know a particular chunk of code is vulnerable, I can use it even if I can’t read it clearly.

For better security, there should to be “dynamic” defenses as well, to try to assure that the code is unmodified as it is executed, and randomization so the code is never exactly the same twice. “Scrambler” is a pure form of the latter approach.

With scrambler, you never execute the same code twice.

Cool!

The basic idea is to analyze the program to determine all the blocks of code, and create tables of addresses. A thread runs continuously, rewriting the code several times a second, “creating new copies of code, fixing up instruction displacements, updating pointers in the code table, etc” (If this sounds obsessive compulsive and fiddly, it surely is. But it is all automated, so the computer does all this crazy repeated rewriting.

The effect is that, even if an attacker finds a memory corruption bug, it is impossible to know what values to write into memory to suborn the code.

The researchers have implemented a demonstration (not yet available), and the paper provides data indicating that the performance is not too bad.

Awesome!


  1. David Williams-King, Graham Gobieski, Kent Williams-King, James P Blake, Xinhao Yuan, Patrick Colp, Vasileios P Kemerlis, Junfeng Yang, and William Aiello, Shuffler: Fast and Deployable Continuous Code Re-Randomization, in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). . 2016: Savannah. http://www.cs.columbia.edu/~junfeng/papers/shuffler-osdi16.pdf