Technology at Avik’s Ruminations

Archive for the ‘Technology’ Category

Grisu

From the department of problems you did not think you had

As you probably know, real numbers are usually represented in computer languages as IEEE floating point numbers, and called floats or doubles depending on precision. Floating point numbers are stored as combination of an exponent (the 2’s power) and a mantissa (or the significand). A double typically takes 64 bits of memory, and consists of 1 bit of sign, 11 bits of exponent, and 52 bits of significand.

When you display a floating point number to screen, or store it in a log, or in any way convert it to a string, you need to convert the floating point value into a decimal representation. Since a floating point value cannot represent all possible decimal numbers exactly, there may be rounding errors in the conversion. Further, the result may need to be rounded to display a sensible value — if you input 0.3, you dont want to display, for example, 25 digits of 0.2999999999…… All of this is hard to get right. There are a couple of properties that you want your conversion to follow.

First, the conversion has to be correct, ie, the decimal value, when read back into a floating point value, must match the original:

read ( show (d) ) == d

Second, you’d want the printed representation of the number to be the smallest possible string that solves the correctness condition.

The seminal work in this area is a 1990 paper by Steel and White : “How to print floating point numbers accurately“. Prior to this paper, there was a wide variation in behaviour of this functionality in different languages and implementations. Steele and White proposed an algorithm called Dragon4, that quickly became the settled standard in this matter. Once printf was standardised to be accurate, and with the existance of the venerable dtoa.c, this remained the state of the art for 20 years.

The trouble with this algorithm however is that it uses arbitrary precision arithmetic (also known as bignums) to achieve accuracy. This made the calculations relatively slow. It also made the implementations complex. dtoa.c, for example, contains its own arbitrary precision maths implementation.

Finally in 2010, Florian Loitsch published a major advance in this area in paper titled “Printing floating-point numbers quickly and accurately with integers“. He proposed an algorithm called Grisu, that accurately printed floating point numbers using only 64 bit integers. Without the use of bignums, this is about four times faster than previous implementations. The only catch is that the algorithm fails on about 0.5% of numbers, and needs to drop down to a slower algorithm for those numbers.

Grisu is now implemented as the standard floating-point-to-decimal conversion routine in the V8 javascript engine (and hence in the Chrome browser) as well as in Firefox. The implementation from V8 has been extracted by Florian into a standalone, BSD licensed project called double-conversion. Of course, Julia, with its emphasis on mathematical correctness, uses double-conversion to implement Grisu in order to display or print a float.

So there you have it, printing a floating point value is more hard work than it seems.

[Update] Since this was written, Julia no longer users double-conversion. Jacob Quinn wrote a pure Julia implementation of Grisu.

Written by Avik Sengupta

July 24th, 2012 at 5:55 pm

Posted in Technology

Planning Fallacy and the planning game.

without comments

There is some solid research around how humans are prone to underestimate the time it takes to do any task. This effect is called the planning fallacy, and the good review paper is by Buehler, Griffin and Ross This is, obviously, of great interest and concern to any programmer or project manager.

There has also been research on how to mitigate the planning fallacy. Buehler suggests thinking back to when one performed similar task in the past dramatically increases the accuracy of an estimate. Justin Kruger and Matt Evans show, in a paper titled “If you don’t want to be late, enumerate” that breaking down a task into its component parts is a very good strategy for accurate estimates.

All this provides good support for the typical agile planning process. Focussing on retrospectives and task breakdowns are indeed what we should do, and this is now supported by science!

However, another piece of research by Buehler, Messervey and Griffin show that people working in groups are much more likely to have unrealistic expectations on project times.

I think this aligns well my anecdotal observations of groups of developers in a planning game trying to estimate stories. Coming up with estimates in a room full of other developers is not a very productive exercise. Developers should instead always consider their stories for the iteration, and prepare a task breakdown and an estimate, before they come into a planning game. The planning game should then be about working with the business owners around prioritisation and planning.

PS: I should take this opportunity to recommend :59 Seconds, by Richard Wiseman for a very accessible look at modern psychology research, and how it affects our everyday lives.

Written by Avik Sengupta

June 15th, 2012 at 2:30 pm

Posted in Technology

Software engineering evidence database

without comments

My regular reader (yes, you!) will know that one of my favourite rants is on the lack of evidence based thinking in software development. Our profession seems to be particularly susceptible to anecdotes and fads. Therefore, I was excited to come across SEED: The Software Engineering Evidence Database. It is meant to be a repository of Database of Empirical Software Engineering Publications.

Unfortunately, it does not seem to be very well maintained, the last updates I can see were in 2009. But this is exactly the kind of resource that will elevate the level of discourse in our field. There are some interesting papers in there, do browse.

Written by Avik Sengupta

June 15th, 2012 at 1:10 pm

Posted in Technology

Birds and lazy evaluation

without comments

I’ve been reading Raymond Smullyan’s amazing book of puzzles, To Mock a Mockingbird. The book takes us on a journey around combinatorial logic, using the metaphor of birds. Obviously, my first reaction was that it would be interesting to follow along by implementing the combinators in Ruby. So I’ve recently started doing just that. (The code will be on my github)

After writing some basic combinators, I wanted to use them to demonstrate some of the solutions to the puzzles in the book. However, many of the proofs have an element of self-reference in them, and that quickly results in a stack overflow error in a procedural language without lazy evaluation.

For example, here is how Smullyan proves that given the existence of a composition operator $latex Cx = A(Bx)&s=-2$, and a Mockingbird $latex Mx = xx&s=-2$, there will exist a fixed point (or, in terms of the allegory of the book, a bird A is fond of bird B, when, if you call out B to A, A will call the same bird B back to you).

Take any bird A. Then there is a bird C that composes A with the Mockingbird M, since there exists a composition operator for all birds. Thus, for any bird x, Cx = A(Mx), or its equivalent, A(Mx) = Cx. Since this holds for all birds x, we can substitute C for x, thus getting A(MC) = CC. But MC = CC , by the definition of the Mockingbird M. Substituting, A(CC) = CC, which means A is fond of bird CC. QED.

Now if we convert this series of logical steps into Ruby, the point where we introduce self-reference, by applying C to itself, blows up with a stack overflow.

#Composition
Compose = lambda {| x, y|
    return lambda {|z|
    x.call(y.call(z))}
}

#The Mocking bird
M = lambda { |x| return x.call(x)}

#For any random lambda x
x=lambda { |x| return x }

c = Compose.call( b , M)
cc = c.call(c)  #Self reference, and stack overflow!
## Show b is fond of CC
assert ( b.call(cc) == cc )

Here is a situation where lazy evaluation would have made things work the way we would have expected it to. I suppose this could be fixed by wrapping things with another lambda, but we wouldn’t have to jump through hoops to follow a series of logical steps.

Taking this thought further, in the next chapter, Smullyan defines a fixed point combinator, calling it an oracle bird: $latex x(\Theta x) = \Theta x&s=-1$. A naïve translation to ruby once again, blows up with a stack overflow. (See here for a working Y combinator in ruby)

Y = lambda{ |x|
	x.call(Y.call(x))
}

However, the definition in Haskell is satisfyingly elegant.

y f = f (y f)

Written by Avik Sengupta

March 27th, 2012 at 2:11 pm

Posted in Technology

Julia

without comments

I saw Julia, a new programming language geared towards scientific computing, a week or so ago, and have played with it since. Its been a while that a new language has seemed this interesting on first glance. I just started a BigInt implementation using GMP, and its been a breeze.

A few features that stand out for me are

Multiple method dispatch
Type inference and parametric types
Fast linear algebra using BLAS/LAPACK

The language also includes macros, union types, and an explicit type coercing construct. All wrapped up in an LLLVM based JIT, so it’s quite fast. Overall, an exciting new language even at this early stage. Of course, a lot remains to be done, not the least a module/namespace system.

Hopefully the occasional R or Octave script I write will be a thing of the past.

Written by Avik Sengupta

February 28th, 2012 at 12:03 am

Posted in Technology

On Hiring

without comments

The process of hiring a great programmer is an oft discussed issue in technology circles. A lot of such discussion can however benefit from references to the published literature in this field.

Particularly important is Schmidt & Hunter’s 1998 review paper, The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. A useful summary is here. In summary, they find GMA (General Mental Ability) to be the best predictor of job performance:

“GMA can be considered the primary personnel measure for hiring decisions and one can consider the remaining personnel measures as supplements to GMA measures”

They also find that supplementing GMA tests with structured interviews provides best combination of predictors.

Note that the useful process is that of structured interviews. The usual practice of unstructured interviews has a relatively poor predictive validity. It gets worse when you consider that its been consistently found that people make up their minds about their interlocutors in the first few seconds of interaction (Ambady & Rosenthal, 1993) , and that non verbal behaviour is a crucial determinant of job interview results (Levine and Feldman, 2002).

Written by Avik Sengupta

January 6th, 2012 at 6:50 pm

Posted in Technology

SWIG is complex

without comments

Came accross this post by David Beazley, the author of Swig, on why and how Swig is so complex — primarily since it has a nearly complete C++ preprocessor that understands the type system. Interesting read. However, it reminded me of when I’d compiled a java program to a native .so using gcj, and then wrapped the result using Swig. Talk about complexity. Scary, now that I think about it.

Written by Avik Sengupta

August 29th, 2011 at 9:14 pm

Posted in Technology

Cross Site Request Forgery

without comments

So Twitter was hit by a Cross Site Request Forgery (CSRF) attack earlier today. Together with a Cross Site Scripting (XSS) attack earlier in the week (the hover-and-you’re-infected worm), this hasn’t been a good few days for twitter, or indeed for web security.

I’m surprised at the simple and basic nature of these attacks. For a high traffic site such as twitter, one would expect more due diligence. Every penetration tester I know would have found these in a few minutes. It also goes to show how oblivious most web programmers are to attacks of this nature. While the the basics of XSS attacks are better known among developers these days (though that didn’t prevent the XSS at twitter a few days ago), the nature of CSRF vulnerabilities does not seem to be very well understood.

The attack today was very simple. You clicked a link, which loaded a few lines of javascript in your browser. The script then created two iframes, and set the src attribute of the iframes to an url of http://twitter.com/share/update?status=[message goes here]. It did this twice; once with the message being a link to the page hosting the javascript, and again with a rude text. If you’re logged in to twitter in your browser (which you’re likely to be if you’re clicking the link while reading twitter), then, as the iframe loads, your tweet gets published. Every one of your followers then sees these tweets in their timelines, and when they click the link, the worm propagates through their followers.

Note that this vulnerability wouldn’t necessarily have been prevented if twitter’s publishing url had been a POST rather than a GET. Another suggested solution is to verify the referrer, but that is a very weak form of protection. In particular, it will not work over HTTPS. The canonical method of preventing a CSRF is to add a hidden variable with a secure random token when a form is generated, and then checking that token when the form is submitted. This verifies the intent of the user to submit the form.

In a rich internet (flex/silverlight/ajax based) application however, the server does not generate a form for each submission. Rather, the token must be sent and saved a the client at the beginning of a session, and then sent back for each critical security sensitive operation. It’ll also be obvious that the protection against CSRF implies keeping state on the server between requests, which means there are scalability questions. This needs to be considered for any distributed deployment of the web servers in a cluster.

As you can see, sometimes a CSRF attack may be simpler to implement than an XSS. Also, there’s an enduring myth that if a website is protected against XSS, it does not need to worry about CSRF. That is clearly not the case, as was evident today. An XSS vulnerability is not required in order to exploit a CSRF issue.

There are many other websites that continue to be affected by CSRF issues. If you write web apps for a living, please take a minute to learn about CSRF attacks, and consider your application in light of these issues. Ruby on Rails can add automatic CSRF protection to your app. For java apps, the OWASP CSRF Guard can help you implement this protection.

Written by Avik Sengupta

September 26th, 2010 at 11:27 pm

Posted in Technology

Tagged with CSRF, security

Entropic Gravity?

with one comment

The physics world is abuzz with excitement over Eric Verlinde’s Entropic Gravity Paper. It suggests that differences in entropy between parts of the universe generates a force that redistributes matter in a way that maximises entropy. This is the force we call gravity. Many other papers are now coming out with similar ideas, including one that relates gravity to quantum information for the first time. A brief summary here and a more detailed description of how entropic forces arise here

So is gravity a side effect of information? I think its a very intriguing idea, a new way to look at our fundamental forces. It’ll be interesting to watch as this gets critiqued and verified over the years.

Written by Avik Sengupta

March 28th, 2010 at 12:32 am

Posted in Technology

How do we know it?

with 3 comments

It has been one of my long term sources of frustration that developers and their managers vigorously debate various positions on how to create software better without any shred of evidence or measurement. Usually, anecdotes ad opinion masquerades as facts, and blog posts are cited as evidence. It’s as if the scientific method completely bypassed the software industry. So, when I found this presentation titled Bits of Evidence (What we actually know about software development and why be believe its true) by Greg Wilson, I was constantly nodding in agreement.

Contrasting the situation in medical research, where randomized double blind studies are the norm, he writes:

… even the best of us aren’t doing what we expect the makers of acne creams to do.

While there may be a growing emphasis on empirical studies in academic publications, that emphasis has not been transmitted to the practitioners in the field. Without a wider understanding of the emperical evidence measured in a scientific method, the industry’s state of the art will fall woefully inadequate.

I’ve written earlier about the Lutz Prechelt paper ; why would you want to get into a debate about language productivity without considering the evidence it offers? Similarly, when you estimate, or ask others for estimates, shouldn’t you consider Aranda and Easterbrook’s paper on the effects of anchoring? Or, when structuring software teams, consider Microsoft Research’s Nagappan’s papers showing bug counts does not depend on physical distrance within teams, but does depend on organizational distance.

As Wilson asks, shouldn’t our development practices be built around these facts? There is a better way. Should we call it “evidence based software development?”

Written by Avik Sengupta

October 26th, 2009 at 1:57 am

Posted in Technology

Tagged with scientific method

Avik’s Ruminations

Archive for the ‘Technology’ Category

Grisu

Planning Fallacy and the planning game.

Software engineering evidence database

Birds and lazy evaluation

Julia

On Hiring

SWIG is complex

Cross Site Request Forgery

Entropic Gravity?

How do we know it?

More...

Recently

Tools