Neopythonic: python

Showing posts with label python. Show all posts

Monday, October 28, 2013

Book review: Introduction to Computer Science Using Python (by Charles Dierbach)

After much back and forth I received a nice new Python book in the mail. The book's full title is "Introduction to Computer Science Using Python: A Computational Problem-Solving Focus", and its author is a very experienced educator, Charles Dierbach.

This is not your average Python book -- it is a college text intended for first-semester CS courses that happens to use Python. As such, in assumes absolutely no previous programming experience, and it looks like any previous computer experience is optional. Not only that, but the book starts with a step-by-step introduction to the art of computational problem solving. This is an idea that goes well beyond hacking together a website!

The book is incredibly thorough: there are exercises throughout the text (not just at the end of each chapter), and it includes a plethora of examples, screenshots, tables, charts, diagrams, and photos. (Yes, my picture is in there -- so are Alan Turing, JFK, and K&R. :-)

The author is not afraid of taking a stance; for example, he omits the 'break' and 'continue' statement because they do not fit within the paradigm of structured programming. This actually fits with the general goal of the book, which is to give an overview of many areas of computer science without getting too deep into the minutiae of any topic. I love the final chapter, which is an overview of the history of computing, starting with Charles Babbage and Ada Lovelace.

At the same time, the book gives plenty of useful practical information, such as instructions for using IDLE and an extensive explanation of turtle graphics, culminating in a horse race simulation. (The author's Baltimore roots seem to show through here. :-)

All in all, I think this book is a great text for anyone teaching CS1 or interested in familiarizing themselves with computer science through serious self-study.

Thursday, August 25, 2011

Compare-And-Set in Memcache

With the most recent release (1.5.3, last week) App Engine's Python API for Memcache has added a new feature, Compare-And-Set. This feature (with a different API) was already available in Java; it has also been available in the non-App-Engine pure-Python memcache client. In fact, I designed the App Engine Python API for this feature to be compatible with the latter, since most of the rest of the App Engine Python API also strives to be at least a superset of that package.

But what is it? There seems to be little information on how to use Compare-And-Set with memcache. It is also sometimes (incorrectly) referred to as Compare-And-Swap -- incorrect, because the cas() operation does not actually "swap" anything. The first response when we closed the bug requesting this feature was "Some examples of usage are appreciated." So here goes.

The basic use case for Compare-And-Set is when multiple requests that are being handled concurrently need to update the same memcache key in an atomic fashion. Let's assume you are managing a counter in memcache. (Actually, you could use the incr() and decr() operations to update 64-bit integer counters atomically, but just for argument's sake assume you cannot use those -- there are other data types for which the memcache service does not have built-in support.)

The naive code to update a counter would be something like this:

def init_counter(key):
. memcache.set(key, 0)

def bump_counter(key):
. counter = memcache.get(key)
. assert counter is not None, 'Uninitialized counter'
. memcache.set(key, counter+1)

(Aside: I don't want to have to think about how to get blogger to properly format Python code. I really don't. So just bear with the dots I use for indentation. Okay? Comments pointing me to solutions will be DELETED.)

(Aside 2: The assert is kind of naive; in practice you'll have to somehow deal with counter initialization. You also should implement a backup for your counter using the App Engine datastore, so that it can survive eviction by the memcache service. However interesting these details are on their own, I leave them for another time.)

Hopefully you can spot the problem in this version of bump_counter(): if two requests execute concurrently (on different instances of the same app), the sequence of operations might be as follows, labeling the two requests as A and B:

A: counter = memcache.get(key) # Reads 42
B: counter = memcache.get(key) # Reads 42
A: memcache.set(key, counter+1) # Writes 43
B: memcache.set(key, counter+1) # Writes 43

So even though two requests were executed, the counter only gets incremented by one. This is called a race condition. Various interlacings of these lines can have the same effect; a race condition occurs whenever B reads the counter before A has written it (or vice versa).

You could try to guard against this by reading the counter value back and checking that it was incremented by one; however this solution still has a race condition (see if you can figure it out for yourself). There are other solutions possible involving a separate "lock" variable, managed using the add() and delete() operations. However these in general require more server roundtrips, and it is pretty hard to manufacture a decent lock out of the basic memcache operations (try for yourself -- think about what would happen if your request was somehow aborted after acquiring the lock, without having a chance of releasing it).

Using the Compare-And-Set operation, writing a reliable bump_counter() function is a cinch:

def bump_counter(key):
. client = memcache.Client()
. while True: # Retry loop
. . counter = client.gets(key)
. . assert counter is not None, 'Uninitialized counter'
. . if client.cas(key, counter+1):
. . . break

I've highlighted the changes from the previous version. There are several essential differences:

The use of a memcache Client object instead of memcache functions
The use of a retry loop
The use of gets() and cas() instead of get() and set()

The Client object is required because the gets() operation actually squirrels away some hidden information that is used by the subsequent cas() operation. Because the memcache functions are stateless (meaning they don't alter any global values), these operations are only available as methods on the Client object, not as functions in the memcache module. (Apart from these two, the methods on the Client object are exactly the same as the functions in the module, as you can tell by comparing the documentation.)

The retry loop is necessary because this code doesn't actually avoid race conditions -- it just detects them! The memcache service guarantees that when used in the pattern shown here (i.e. using gets() instead of get() and cas() instead of set()), if two (or more) different client instances happen to be involved a race condition like I showed earlier, only the first one to execute the cas() operation will succeed (return True), while the second one (and later ones) will fail (return False). Let's spell out the events that happen when a race condition occurs:

A: counter = memcache.gets(key) # Reads 42
B: counter = memcache.gets(key) # Reads 42
A: memcache.cas(key, counter+1) # Writes 43, returns True
B: memcache.cas(key, counter+1) # Returns False
B: counter = memcache.gets(key) # Reads 43
B: memcache.cas(key, counter+1) # Writes 44, returns True

Another refinement I've left out here for brevity is to set a limit on the number of retries, to avoid an infinite loop in worst-case scenarios where there is a lot of contention for the same counter (meaning more requests are trying to update the counter than the memcache service can process in real time). You can figure out how to code this for yourself. (UPDATE: see comments #1 and #2 for a note about busy-waiting.)

Now let me explain roughly how this actually works. For some people that helps understanding how to use it: this is often the case for me when I am trying to understand some new concept.

The gets() operation internally receives two values from the memcache service: the value stored for the key (in our example the counter value), and a timestamp (also known as the cas_id). The timestamp is an opaque number; only the memcache service knows what it means. The important thing is that each time the value associated with a memcache key is updated, the associated timestamp is changed. The gets() operation stores this timestamp in a Python dict on the Client object, using the key passed to gets() as the dict key.

The cas() operation internally adds the timestamp to the request it sends to the memcache service. The service then compares the timestamp received with a cas() operation to the timestamp currently associated with the key. If they match, it updates the value and the timestamp, and returns success. If they don't match, it leaves the value and timestamp alone, and returns failure. (By the way, it does not send the new timestamp back with a successful response. The only way to retrieve the timestamp is to call gets().)

Of course, there's one more important ingredient: the App Engine memcache service itself behaves atomically. That is, when two concurrent requests (for the same app id) use memcache, they will go to the same memcache service instance (for historic reasons called a shard), and the memcache service has enough internal locking so that concurrent requests for the same key are properly serialized. In particular this means that two cas() requests for the same key do not actually run in parallel -- the service handles the first request that came in until completion (i.e., updating the value and timestamp) before it starts handling the second request.

And that's Compare-And-Set in a nutshell. If you have questions please don't hesitate to ask!

(UPDATE: The memcache API defines batch versions of most of its APIs. For example, to get multiple keys in a single call, there is get_multi(); to set multiple keys, there is set_multi(). Corresponding to cas(), there is cas_multi(). But there is no gets_multi(): instead, you can use get_multi(keys, for_cas=True). Finally, there's cas_reset(), which clears the dict used to store timestamps. But I haven't figured out what to do with it yet. :-)

Monday, July 25, 2011

Before Python

This morning I had a chat with the students at Google's CAPE program. Since I wrote up what I wanted to say I figured I might as well blog it here. Warning: this is pretty unedited (or else it would never be published :-). I'm posting it in my "personal" blog instead of the "Python history" blog because it mostly touches on my career before Python. Here goes.

Have you ever written a computer program? Using which language?

HTML
Javascript
Java
Python
C++
C
Other - which?

[It turned out the students had used a mixture of Scratch, App Inventor, and Processing. A few students had also used Python or Java.]

Have you ever invented a programming language? :-)

If you have programmed, you know some of the problems with programming languages. Have you ever thought about why programming isn't easier? Would it help if you could just talk to your computer? Have you tried speech recognition software? I have. It doesn't work very well yet. :-)

How do you think programmers will write software 10 years from now? Or 30? 50?

Do you know how programmers worked 30 years ago?

I do.

I was born in Holland in 1956. Things were different.

I didn't know what a computer was until I was 18. However, I tinkered with electronics. I built a digital clock. My dream was to build my own calculator.

Then I went to university in Amsterdam to study mathematics and they had a computer that was free for students to use! (Not unlimited though. We were allowed to use something like one second of CPU time per day. :-)

I had to learn how to use punch cards. There were machines to create them that had a keyboard. The machines were as big as a desk and made a terrible noise when you hit a key: a small hole was punched in the card with a huge force and great precision. If you made a mistake you had to start over.

I didn't get to see the actual computer for several more years. What we had in the basement of the math department was just an end point for a network that ran across the city. There were card readers and line printers and operators who controlled them. But the actual computer was elsewhere.

It was a huge, busy place, where programmers got together and discussed their problems, and I loved to hang out there. In fact, I loved it so much I nearly dropped out of university. But eventually I graduated.

Aside: Punch cards weren't invented for computers; they were invented for sorting census data and the like before WW2. [UPDATE: actually much earlier, though the IBM 80-column format I used did originate in 1928.] There were large mechanical machines for sorting stacks of cards. But punch cards are the reason that some software still limits you (or just defaults) to 80 characters per line.

My first program was a kind of "hello world" program written in Algol-60. That language was only popular in Europe, I believe. After another student gave me a few hints I learned the rest of the language straight from the official definition of the language, the "Revised Report on the Algorithmic Language Algol-60." That was not an easy report to read! The language was a bit cumbersome, but I didn't mind, I learned the basics of programming anyway: variables, expressions, functions, input/output.

Then a professor mentioned that there was a new programming language named Pascal. There was a Pascal compiler on our mainframe so I decided to learn it. I borrowed the book on Pascal from the departmental library (there was only one book, and only one copy, and I couldn't afford my own). After skimming it, I decided that the only thing I really needed were the "railroad diagrams" at the end of the book that summarized the language's syntax. I made photocopies of those and returned the book to the library.

Aside: Pascal really had only one new feature compared to Algol-60, pointers. These baffled me for the longest time. Eventually I learned assembly programming, which explained the memory model of a computer for the first time. I realized that a pointer was just an address. Then I finally understood them.

I guess this is how I got interested in programming languages. I learned the other languages of the day along the way: Fortran, Lisp, Basic, Cobol. With all this knowledge of programming, I managed to get a plum part-time job at the data center maintaining the mainframe's operating system. It was the most coveted job among programmers. It gave me access to unlimited computer time, the fastest terminals (still 80 x 24 though :-), and most important, a stimulating environment where I got to learn from other programmers. I also got access to a Unix system, learned C and shell programming, and at some point we had an Apple II (mostly remembered for hours of playing space invaders). I even got to implement a new (but very crummy) programming language!

All this time, programming was one of the most fun things in my life. I thought of ideas for new programs to write all the time. But interestingly, I wasn't very interested in using computers for practical stuff! Nor even to solve mathematical puzzles (except that I invented a clever way of programming Conway's Game of Life that came from my understanding of using logic gates to build a binary addition circuit).

What I liked most though was write programs to make the life of programmers better. One of my early creations was a text editor that was better than the system's standard text editor (which wasn't very hard :-). I also wrote an archive program that helped conserve disk space; it was so popular and useful that the data center offered it to all its customers. I liked sharing programs, and my own principles for sharing were very similar to what later would become Open Source (except I didn't care about licenses -- still don't :-).

As a term project I wrote a static analyzer for Pascal programs with another student. Looking back I think it was a horrible program, but our professor thought it was brilliant and we both got an A+. That's where I learned about parsers and such, and that you can do more with a parser than write a compiler.

I combined pleasure with a good cause when I helped out a small left-wing political party in Holland automate their membership database. This was until then maintained by hand as a collection of metal plates plates into which letters were stamped using an antiquated machine not unlike a steam hammer :-). In the end the project was not a great success, but my contributions (including an emulation of Unix's venerable "ed" editor program written in Cobol) piqued the attention of another volunteer, whose day job was as computer science researcher at the Mathematical Center. (Now CWI.)

This was Lambert Meertens. It so happened that he was designing his own programming language, named B (later ABC), and when I graduated he offered me a job on his team of programmers who were implementing an interpreter for the language (what we would now call a virtual machine).

The rest I have written up earlier in my Python history blog.

Friday, June 3, 2011

The depth and breadth of Python

As of late I'm noticing a trend: I'm spending more time having in-person in-depth conversations, and less time coding. While I regret the latter, I really enjoy the former. Certainly more than weekly meetings, code reviews, or bikeshedding email threads. (I'm not all that excited about blogging either, as you may have guessed; but some things just don't fit in 140 characters.)

Two conversations with visitors I particularly enjoyed this week were both with very happy Python users, and yet they couldn't be more different. This to me is a confirmation of Python's enduring depth and breadth: it is as far away of a one-trick language as you can imagine.

My first visitor was Annie Liu, a professor of computer science (with a tendency to theory :-) at Stony Brook University in New York State. During an animated conversation that lasted nearly three hours (and still she had more to say :-) she explained to me the gist of her research, which appears to be writing small Python programs that implement fundamental algorithms using set comprehensions, and then optimizing the heck out of it using an automated approach she summarized as the three I's: Iterate, incrementalize, and implement. While her academic colleagues laugh at her for her choice of such a non-theoretical language like Python, her students love it, and she seems to be having the last laugh, obtaining publication-worthy results that don't require advanced LaTeX skills, nor writing in a dead language like SETL (of which she is also a great fan, and which, via ABC, had some influence on Python -- see also below).

Annie told me an amusing anecdote about an inscrutable security standard produced by NiST a decade ago, with a fifty-page specification written in Z. She took a 12-page portion of it and translated it into a 120-line Python program, which was much more readable than the original, and in the process she uncovered some bugs in the spec!

Another anecdote she recounted had reached me before, but somehow I had forgotten about it until she reminded me. It concerns the origins of Python's use of indentation. The anecdote takes place long before Python was created. At an IFIP working group meeting in a hotel, one night the delegates could not agree about the best delimiters to use for code blocks. On the table were the venerable BEGIN ... END, the newcomers { ... }, and some oddities like IF ... FI and indentation. In desperation someone said the decision had to be made by a non-programmer. The only person available was apparently Robert Dewar's wife, who in those days traveled with her husband to these events. Despite the late hour, she was called down from her hotel room and asked for her independent judgement. Immediately she decided that structuring by pure indentation was the winner. Now, I've probably got all the details wrong here, but apparently Lambert Meertens was present, who went on to design Python's predecessor, ABC, though at the time he called it B (the italics meant that B was not the name of the language, but the name of the variable containing the name of the language). I checked my personal archives, and the first time I heard this was from Prof. Paul Hilfinger at Berkeley, who recounted a similar story. In his version, it was just Lambert Meertens and Robert Dewar, and Robert Dewar's wife chose indentation because she wanted to go to bed. Either way it is a charming and powerful story. (UPDATE: indeed the real story was quite different.)

Of course Annie had some requests as well. I'll probably go over these in more detail on python-ideas, but here's a quick rundown (of what I could remember):

Quantifiers. She is really longing for the "SOME x IN xs HAS pred" notation from ABC (and its sibling "EACH x IN xs HAS pred"), which superficially resemble Python's any() and all() functions, but have the added semantics of making x available in the scope executed when the test succeeds (or fails, in the case of EACH -- then x represents a counterexample).
Type declarations. (Though I think she would be happy with Python 3 function annotations, possibly augmented with the attribute declarations seen in e.g. Django and App Engine's model classes.)
Pattern matching, a la Erlang. I have been eying these myself from time to time; it is hard to find a syntax that really shines, but it seems to be a useful feature.
Something she calls labels or yield points. It seems somewhat similar to yield statements in generators, but not quite.
She has only recently begun to look at distributed algorithms (she had some Leslie Lamport anecdotes as well) and might prefer sets to be immutable after all. Though that isn't so clear; her work so far has actually benefited from mutating sets to maintain some algorithmic invariant. (The "incrementalize" of the three I's actually refers to a form of "differentiation" of expressions that produce a new set for each input.)

The contrast with my visitor the next day couldn't be greater. Through a former colleague I got an introduction to Drew Houston, co-founder and CEO of the vastly successful start-up company Dropbox. Dropbox currently has 25 million users, stores petabytes of data on Amazon S3, is profitable, and is not for sale. Drew is an easygoing MIT graduate who is equally comfortable discussing custom memory allocators, the world of venture capitalism, and how to keep engineers happy; he likes hard problems and winning.

Python plays an important role in Dropbox's success: the Dropbox client, which runs on Windows, Mac and Linux (!), is written in Python. This is key to the portability: everything except the UI is cross-platform. (The UI uses a Python-ObjC bridge on Mac, and wxPython on the other platforms.) Performance has never been a problem -- understanding that a small number of critical pieces were written in C, including a custom memory allocator used for a certain type of objects whose pattern of allocation involves allocating 100,000s of them and then releasing all but a few. Before you jump in to open up the Dropbox distro and learn all about how it works, beware that the source code is not included and the bytecode is obfuscated. Drew's no fool. And he laughs at the poor competitors who are using Java.

Next Monday I'm having lunch with another high-tech enterpreneur, a Y-combinator start-up founder using (and contributing to) App Engine. Maybe I should just cancel all weekly meetings and sign off from all mailing lists and focus on two things: meeting Python users and coding. That's the life!

Monday, January 24, 2011

Asynchronous RPC in App Engine Today

While I was laying the groundwork for a new datastore client library with support for asynchronous requests, I added some low-level support for asynchronous RPCs that you can use today. The only App Engine API with documented support for asynchronous RPCs is urlfetch, and it happens to be quite useful with that.

Suppose you want to fetch some data from a remote service. The remote service has two instances, both of which are slightly flaky. What you want to do is send off requests to both servers simultaneous (this is the easy part) and then wait for the first one to give you a result. The latter uses the new API that I'm about to describe here.


from google.appengine.api import urlfetch, apiproxy_stub_map

urls = ['http://service1.com', 'http://service2.com']  # Etc.

rpcs = []
for url in urls:
   rpc = urlfetch.create_rpc(deadline=1.0)
   urlfetch.make_fetch_call(rpc, url)
   rpcs.append(rpc)

rpc = apiproxy_stub_map.UserRPC.wait_any(rpcs)
# Now rpc is the first rpc that returned a result. Have at it!

That's all! If you're interested in learning more about this handy class method, just check out its docstring in the App Engine SDK. Note that technically you should loop until it doesn't return None.

You can also repeatedly call wait_any() to get subsequent result. Make sure to remove the rpc it returns (if any) from the list, since otherwise it will return the same rpc over and over again: the specification of wait_any() says it returns the first rpc in the given list that completes, regardless of whether you have seen it before.

Also note that there currently is no way to cancel the other RPCs, which is why I passed a low deadline to the create_rpc() call. The problem is that even if you completely ignore the other RPCs, the App Engine runtime still waits for them to finish or timeout.

Finally, there is also a similar class method UserRPC.wait_all(), which waits until all RPCs in the list you pass it are complete. (It doesn't return anything.)

PS. Don't look too closely at the implementation of these methods. It may change as we think of a better way to do it. But we're committed to the API.

Friday, January 7, 2011

A new App Engine datastore API

This post is primarily intended for App Engine users (and of those, only Python users :-).

Over the past months I've been working on a new design for the Python datastore API, under the code name Datastore Plus. The new design is very ambitious, and changes a lot of things:

New, cleaner implementations of Key, Model, Property and Query classes
High-level asynchronous API using Python generators as coroutines (PEP 342)

The design is meant to eventually replace the existing db package in the App Engine runtime library, but for now, it is just an open source project which you have to download and copy into your application.

I am not at all finished with this design, but I believe in listening to users, so I am making a preliminary version of the new API available for review. Please send me your thoughts, either in this blog, or via private mail to guido (at) google.com. Note that the implementation works, but I cannot guarantee that it won't change.

Documentation is here: http://goo.gl/D6Onw

The project to check out is here: http://goo.gl/GapXI

You must use Mercurial to check out the project, but it's fine to check it out anonymously -- I don't require anybody to log in to look or comment. (If using Mercurial is too much of a burden, there's also a zipfile on the site, but I don't plan to update it frequently.)

I'm interested in receiving any kind of feedback at all. It would help me if you could clarify whether your feedback is about an issue with the documentation, an issue with the implementation, or an issue with the API design -- though I realize you can't always tell the difference. :-)

(You can also comment on the thread in the google-appengine-python group here: https://groups.google.com/group/google-appengine-python/browse_thread/thread/454cb81d49e759f2.)

Thursday, November 5, 2009

Python in the Scientific World

Yesterday I attended a biweekly meeting of an informal a UC Berkeley group devoted to Python in science (Py4Science), organized by Fernando Perez. The format (in honor of my visit) was a series of 4-minute lightning talks about various projects using Python in the scientific world (at Berkeley and elsewhere) followed by an hourlong Q&A session. This meant I didn't have to do a presentation and still got to interact with the audience for an hour -- my ideal format.

I was blown away by the wide variety of Python use for scientific work. It looks like Python (with extensions like numpy) is becoming a standard tool for many sciences that need to process large amounts of data, from neuroimaging to astronomy.

Here is a list of the topics presented (though not in the order presented). All these describing Python software; I've added names and affiliations insofar I managed to get them. (Thanks to Jarrod Millman for providing me with a complete list.) Most projects are easily found by Googling for them, so I have not included hyperlinks except in some cases where the slides emphasized them. (See also the blog comments.)

Fernando gave an overview of the core Python software used throughout scientific computing: NumPy, Matplotlib, IPython (by Fernando), Mayavi, Sympy (about which more later), Cython, and lots more.

On behalf of Andrew Straw (Caltech), Fernando showed a video of an experimental setup where a firefly is tracked in real time by 8 camaras spewing 100 images per second, using Python software.

Nitimes, a time-series analysis tool for neuroimaging, by Ariel Rokern (UCB).

A comparative genomics tool by Brent Pedersen of the Freeling Lab / Plant Biology (UCB).

Copperhead: Data-Parallel Python, by Bryan Catanzaro (working with Armando Fox) and others.

Nipype: Neuroimaging analysis pipeline and interfaces in Python, by Chris Burns (http://nipy.sourceforge.net/nipype/).

SymPy -- a library for symbolic mathematics in Pure Python, by Ondrej Certik (runs on Google App Engine: http://live.sympy.org).

Enthought Python Distribution -- a Python distro with scientific batteries inluded (some proprietary, many open source), supporting Windows, Mac, Linux and Solaris. (Travis Oliphant and Eric Jones, of Enthought.)

PySKI, by Erin Carson (working with Armando Fox) and others -- a tool for auto-tuning computational kernels on sparse matrices.

Rapid classification of astronomical time-series data, by Josh Bloom, UCB Astronomy Dept. One of the many tools using Python is GroupThink, which lets random people on the web help classify galaxies (more fun than watching porn :-).

The Hubble Space Telescope team in Baltimore has used Python for 10 years. They showed a tool for removing noise generated by cosmic rays from photos of galaxies. The future James Webb Space Telescope will also be using Python. (Perry Greenfield and Michael Droettboom, of STSCI.)

A $1B commitment by the Indian government to improve education in India includes a project by Prabhu Ramachandran of the Department of Aerospace Engineering at IIT Bombay for Python in Science and Engineering Education in India (see http://fossee.in/).

Wim Lavrijsen (LBL) presented work on Python usage in High Energy Physics.

William Stein (University of Washington) presented SAGE, a viable free open source alternative to Magma, Maple, Mathematica and Matlab.

All in all, the impression I got was of an incredible wealth of software, written and maintained by dedicated volunteers all over the scientific community.

During the Q&A session, we touched upon the usual topics, like Python 3 transition, the GIL (there was considerable interest in Antoine Pitrou's newgil work, which unfortunately I could not summarize adequately because I haven't studied it enough yet), Unladen Swallow, and the situation with distutils, setuptools and the future 'distribute' package (for which I unfortunately had to defer to the distutil-sig).

The folks maintaining NumPy have thought about Python 3 a lot, but haven't started planning the work. Like many other projects faced with the Python 3 porting task, they don't have enough people who actually know the code base well enough do embark upon such a project. They do have a plan for arriving at PEP 3118 compliance within the next 6 months.

Since NumPy is at the root of the dependency graph for much of the software packages presented here, getting NumPy ported to Python 3 is pretty important. We briefly discussed a possible way to obtain NumPy support for Python 3 sooner and with less effort: a smaller "core" of NumPy could be ported first, which would give the NumPy maintainers a manageable task, combined with the goal of selecting a smaller "core" which would give them the opportunity for a clean-up at the same time. (I presume this would mostly be a selection of subpackage to be ported, not an API-by-API cleanup of APIs; the latter would be a bad thing to do simultaneous with a big port.)

After the meeting, Fernando showed me a little about how NumPy is maintained. They have elaborate docstrings that are marked up with a (very light) variant of Sphynx, and they let the user community edit the docstrings through a structured wiki-like setup. Such changes are then presented to the developers for review, and can be incorporated into the code base with minimal effort.

An important aspect of this approach is that the users who edit the docstrings are often scientists who understand the computation being carried out in its scientific context, and who share their knowledge about the code and its background and limitations with other scientists who might be using the same code. This process, together with the facilities in IPython for quickly calling up the docstring for any object, really improves the value of the docstrings for the community. Maybe we could use something like this for the Python standard library; it might be a way that would allow non-programmers to help contribute to the Python project (one of the ideas also mentioned in the diversity discussions).

Tuesday, July 21, 2009

Progressive vs. Conservative

[Warning: loose thoughts ahead!]

Microsoft's Eric Meijer gave a talk at Google yesterday, and afterwards I had lunch with him. One of his remarks was (I paraphrase) that Microsoft users want to be told what to do, while the Java community is more vocal or argumentative. (He didn't discuss the Python community but in my experience it falls in the latter category.)

Now, while lying sick in bed with a hacking cough, I am reading George Lakoff's "The Political Mind". This book tries to model the distinction between conservative and progressive politics on the differences between two different ideal family models: the strict father (from which most conservative moral virtues flow according to Lakoff), and the nurturing family, from which the progressive moral virtues derived.

The parallel with Microsoft users vs. Java users seems to be all too obvious: Microsoft as the strict father: If you are loyal you will be rewarded, but if you stray you will be punished; whereas in the Java (or Python) community benefits and moral goodness flow from helping each other (which includes sharing open source software, and, apparently, bikeshedding :-).

What about other companies and communities? I can't help thinking of Oracle as the ultimate strict-father company, which makes me worry about the Sun takeover. Are Linus Torvalds and Richard Stallman strict fathers?

Monday, June 15, 2009

New App Engine Book

At Google I/O I received a copy of Using Google App Engine by Charles Severance, published by O'Reilly. I haven't kept track, but this appears one of the first App Engine books to actually hit the stores -- an Amazon search for App Engine showed up one other book (Developing with Google App Engine by Eugene Ciara, published by APress) and many titles available for pre-order (including additional titles from the same publishers).

Severance's book is a quick read if you're already familiar with the basic premises of web programming. I think it would do well in an introductory course about the topic. (The author teaches at the University of Michigan so this is likely how he developed the material in the first place.) In fact, quite a bit of the book could well have come from a pre-existing earlier course: the chapters on HTML, CSS, Python and JavaScript barely mention App Engine.

Don't get me wrong, I think that's a good approach: in my experience quite a few App Engine users are new to web programming in general, or could at least use a refresher course. If you don't fall in this category, don't feel offended: you just probably aren't the intended audience for this book. On the other hand, if you've developed for the web but haven't used Python before, you could probably just skip the HTML/CSS chapter and dive right into Python and App Engine.

If you're a blank sheet where it comes to programming, don't expect to come out an experienced Python developer: the book only covers enough of the language so you can get started with App Engine without feeling you're just copying and pasting text. The same is actually true for any topic covered -- in many cases the book actually recommends that you study a topic more in-depth using other resources. But in each case the book's coverage is enough to get you started with the creation of dynamic web sites, and that's the important part. After all, you didn't learn your mother tongue by studying the rules of grammar either: you learned a few nouns, a few verbs, a few adjectives, and a few grammatical forms ("Daddy throw toy again") and you were on your way to communicating with others.

Actually, if you read this book from cover to cover, you might not be ready to create the Greate American Website, but you'll be well past the "Daddy throw toy" level. For example, you'll be creating App Engine datastore models with ease, tying them together with forms, and you'll even be able to use simple AJAX patterns. You will also have learned about the importance of caching, and you'll have more than a fleeting experience debugging problems using tracebacks and logs.

I also enjoyed some of the history bits that Severance presents (it makes me feel old to see 1990 referred to as ancient history :-). A downside is that sometimes the exercises given at the end of each chapter seem to be focused more on assessing that you were awake during class than that you actually have learned a useful skill (e.g. "Give a brief history of the major phases of the internet"). Teachers considering to use this book in the classroom might appreciate such questions; but for self-study, I would focus on the difference between the class= and id= attributes in HTML...

What's missing? The book doesn't touch Django (except for the templating facility built into App Engine's webapp package, which is based on Django). If our customer support traffic is any indication, Django is very popular with professional App Engine developers. The book also doesn't describe the various APIs offered by App Engine for things like sending mail, fetching other web resources by URL, or image processing. But arguably you can learn those directly from the App Engine docs. Oh, and the book doesn't touch on App Engine's Java support. I expect other books will fill that void.

Tuesday, May 26, 2009

So you want to learn Python?

There's never a lack of books to use for learning Python. I occasionally receive books for review, but I don't have a particularly good yardstick to judge such books by: I find that they all contain some factual errors and some oddities of presentation, but I have no idea whether those matter for the readers. Even Knuth's books are full of errors: for example the errata for Vol. 1 (2nd ed.) are a staggering 80 pages, but I doubt anybody besides Knuth himself is bothered by this knowledge.

Recently I got a review copy of "Hello World", and a colleague kindly lent me his copy of "Practical Programming". I think it's interesting to compare the two a bit, since they both claim to be teaching Python programming to people who haven't programmed before. And yet their audiences are totally different!

"Hello World", published by Manning, is written by Warren Sande and his son Carter. The subtitle is "Computer programming for kids and other beginners", but I think if you're not a kid any more you might get annoyed by the rather popular writing style. If you are a kid, well, you will probably enjoy a book written with you in mind, and you will learn plenty. The only prerequisites are reading and typing skills, a computer that wasn't built in the stone age, and a desire to learn more about what goes on inside that computer. The book uses short chapters with lots of illustrations, often cartoons and jokes. There are lots of opportunities to try out the material and learn that way. Each chapter ends with a review section, some tests, and more experiments to try. The book pays plenty of attention to typical "gotchas", so that if you get stuck at some point, there probably is help nearby to get you unstuck.

"Practical Programming" is written by Jennifer Campbell, Paul Gries, Jason Montojo, and Greg Wilson. This a team composed of three university professors and a former student of theirs. Their purported goal is to teach Computer Science (with Capital Letters), and Python is merely a teaching vehicle. But they spend about half of the book on Python itself, covering roughly the same material as any introduction to Python, including "Hello World". Their intended audience is clearly more mature than that of the Sandes, and I would think that Carter Sande and his friends would have a hard time staying focused on the material as presented by Campbell et al. -- their illustrations and diagrams are more functional but a lot less fun.

Both books present a number of projects and running examples. Again, the difference in audience makes it likely that if you love one, you'll hate the other, and vice versa. "Hello World" uses examples from computer games. The games are extremely simple though: modern computer games are some of the most complex system around, and you can't expect to approach them using PyGame and a couple hundred lines of Python. "Practical Programming" takes its example from scientific data processing with an environmental touch: for example, a numerical series is presented as whale sightings over the years and 2-dimensional data is taken from deforestation data. No doubt this is done in an attempt to appeal to a certain kind of student, though the number of potential applications is so large that some students might just as well be turned off by the specific set of choices.

In the end, "Hello World" will leave the reader with a fair amount of practical Python experience, enough to get them started on the long road to becoming a programmer if they are so inclined, or at least enough to give them some idea of what it is that programmers do. "Practical Programming" tries to go further: it presents some well-known algorithms (there's even a discussion of MergeSort), and it has introductory chapters on topics like object-oriented programming and databases. The overall focus is still on being able to use all this new knowledge in one's professional life, and I hesitate to agree with the authors' apparent view that it teaches "Computer Science". Calling it "Computer Use" would cover the contents better, I think, and that's more in line with the series title as well ("The Pragmatic Programmers", also the publisher).

So, how do you learn about Computer Science? Some would no doubt recommend "Structure and Interpretation of Computer Programs" by Abelson and Sussman here. (Someone sent me review copy of that book too.) But really, SICP (as it is often referred to) has its own agenda: convincing the reader that the most important thing computers can do is interpreting computer programs. This agenda has arguably caused the proliferation of Scheme implementations and indoctrinated many young minds with certain ideas about how to design and implement programming languages. But personally, I recommend you go straight to the source. After all these years, there is still no substitute for Knuth.

[UPDATE: fixed book titles as commenters pointed out my typos.]

Monday, April 27, 2009

Final Words on Tail Calls

A lot of people remarked that in my post on Tail Recursion Elimination I confused tail self-recursion with other tail calls, which proper Tail Call Optimization (TCO) also eliminates. I now feel more educated: tail calls are not just about loops. I started my blog post when someone pointed out several recent posts by Pythonistas playing around with implementing tail self-recursion through decorators or bytecode hacks. In the eyes of the TCO proponents those were all amateurs, and perhaps that's so.

The one issue on which TCO advocates seem to agree with me is that TCO is a feature, not an optimization. (Even though in some compiled languages it really is provided by a compiler optimization.) We can argue over whether it is a desirable feature. Personally, I think it is a fine feature for some languages, but I don't think it fits Python: The elimination of stack traces for some calls but not others would certainly confuse many users, who have not been raised with tail call religion but might have learned about call semantics by tracing through a few calls in a debugger.

The main issue here is that I expect that in many cases tail calls are not of a recursive nature (neither direct nor indirect), so the elimination of stack frames doesn't do anything for the algorithmic complexity of the code, but it does make debugging harder. For example, if your have a function ending in something like this:

if x > y:
 return some_call(z)
else:
 return 42

and you end up in the debugger inside some_call() whereas you expected to have taken the other branch, with TCO as a feature your debugger can't tell you the value of x and y, because the stack frame has been eliminated.

(I'm sure at this point someone will bring up that the debugger should be smarter. Sure. I'm expecting your patch for CPython any minute now.)

The most interesting use case brought up for TCO is the implementation of algorithms involving state machines. The proponents of TCO claim that the only alternative to TCO is a loop with lots of state, which they consider ugly. Now, apart from the observation that since TCO essentially is a GOTO, you write spaghetti code using TCO just as easily, Ian Bicking gave a solution that is as simple as it is elegant. (I saw it in a comment to someone's blog that I can't find back right now; I'll add a link if someone adds it in a comment here.) Instead of this tail call:

return foo(args)

you write this:

return foo, (args,)

which doesn't call foo() but just returns it and an argument tuple, and embed everything in a "driver" loop like this:

func, args = ...initial func/args pair...
while True:
  func, args = func(*args)

If you need an exit condition you can use an exception, or you could invent some other protocol to signal the end of the loop (like returning None).

And here it ends. One other thing I learned is that some in the academic world scornfully refer to Python as "the Basic of the future". Personally, I rather see that as a badge of honor, and it gives me an opportunity to plug a book of interviews with language designers to which I contributed, side by side with the creators of Basic, C++, Perl, Java, and other academically scorned languages -- as well as those of ML and Haskell, I hasten to add. (Apparently the creators of Scheme were too busy arguing whether to say "tail call optimization" or "proper tail recursion." :-)

Tuesday, April 7, 2009

Italia Here I Come!

That is: PyCon Italia, here I come! It's still a month away (May 8-10), but I'm already looking forward to my vacation in the historic city of Florence and on the beautiful west coast of Italy. The organizers of PyCon Italia kindly invited me to give a keynote at their annual conference, making me an offer I couldn't refuse.

Kidding aside, it looks like it will be a very exciting conference, with Googlers Fredrik Lundh and Alex Martelli also coming to speak, as well as Python core developer Raymond Hettinger. And even though much of the program will be in Italian, real-time translations will be available for the main track. Personally, I'm most looking forward to a mysterious late-night event labeled PyBirra, where the locals will try to drink me under the table. Salute!

Friday, March 6, 2009

Capabilities for Python?

I received an email recently from Mark Miller, quoting a post from Zooko to the cap-talk mailing list (which I do not read). Mark asked me to clarify my position about capabilities (in Python, presumably). Since the last thing I need is another mailing list subscription, I'm posting my clarification here. I'm sure that through the magic of search engines it will find its way to the relevant places.

In his post, Zooko seems to believe that I am hostile to the very idea of capabilities, and seems to draw a link between this assumed attitude and my experience with the use of password-based capabilities in Amoeba. This is odd for several reasons. First, the way I remember it, Amoeba's capabilities weren't based on passwords, but on one-way functions and random numbers (and secure Ethernet wall-sockets, which is perhaps why the idea didn't catch on :-). Second, I don't believe my experience with capabilities in Amoeba made a difference in how I think about capabilities being offered by some modern programming languages like E, or about the various proposals over the years to add capabilities to Python, perhaps starting with an old proposal by Ka-Ping Yee and Ben Laurie. (It would be better to think of this as a subtraction rather than an addition, since such proposals invariably end up limiting the user to a substantially reduced subset of Python. More about that below.)

But the biggest surprise to me is that people are reading so much in my words. I'm not the Pope! I'm a hacker who likes to think aloud about design problems. Often enough I get it wrong. If you think you disagree with me, or have a question about what I said, just respond in the forum where I post (e.g. python-dev or python-ideas, or this blog), but please don't go forwarding my messages to lists I don't read and speculate about them.

With that off my mind, and with the caveat that this entire post is thinking alound, let me try to expose some of my current thoughts about capabilities and Python.

Note that I'm trying to limit myself to Python. Languages specifically written to support capabilities exist (e.g. E) and may well become successful, though I expect they will have a hard time gaining popularity until they also sprout some other highly attractive features: most developers see security as a necessary evil.

This attitude, of course, is the reason why the idea of adding security features to an existing language keeps coming back: it's assumed to be much more likely to convince the "unwashed masses" to switch to a slightly different version of a language they already know, than to get them to even try (let alone adopt) a wholly new language. This argument not limited to security zealots of course. The same reasoning is common in the larger world of "language marketing": C++ made compatibility with C a principle overruling all others, Java chose to resemble C or C++ for ease of adoption, and it is well known that Larry Wall picked many of Perl's syntactic quirks because the initial target audience was already using sed and sh.

I'll be the first to admit that wasn't completely free of this attitude for Python's design, although I didn't do it with the intent of gaining popularity: whenever I borrowed from another language, I did so either because I recognized a good idea, or because I didn't think I had anything to add to current practice, but not because I was concerned about market share. (If I had been, I wouldn't have used indentation for grouping. :-)

Anyway, regardless of the merits of this idea, it keeps coming back. A recent incarnation is Mark Seaborn's CapPython. Skimming through this wiki page it seems that Mark is well aware of the limitations: the section labeled "problem areas" takes up more than half of the page. And the most recent discussion (which also triggered Zooko's post I believe) started with a blog post by Tav where he proposes (with my encouragement) some modest additions to CPython's existing restricted execution mode and challenges the world to break into it. In a follow-up post, Tav provides a better history of this topic than I could provide myself.

And yet, I remain extremely skeptical of this whole area. The various attacks on Tav's supervisor code show how incredibly subtle it is to write a secure supervisor. CPython's restricted execution model lets sandboxed (= untrusted) code call into the supervisor, where the supervisor's Python code runs with full permissions. In Tav's version, the sandbox is given access to the supervisor only through a small collection of function objects which the supervisor passes into the sandbox. Tav's proposed changes remove some introspection attributes from function and class objects that would otherwise give the sandboxed code access to data or functions that the supervisor is trying to hide from the sandbox. This basic idea works well and nobody has yet found a way to break out of the sandbox directly -- so far it looks like no other attributes need to be removed in order to secure the sandbox.

However, several attacks found non-obvious weaknesses in Tav's supervisor code itself: it is deceptively easy to trick the supervisor into calling seemingly safe built-in functions with arguments carefully crafted by the code inside the sandbox so as to make it reveal a secret. This uses an approach that was devised years ago by Samuele Pedroni to dispell doubt that restricted execution was unsafe in Python 2.2 and beyond.

Samuele's approach combines two properties of (C)Python: built-ins invoked by the supervisor run with the supervisor's permissions, and there are many places in Python where implicit conversions attempt to call various specially-named attributes on objects given to them. The sandboxed exploit defines a class with one of these "magic" attributes set to some built-in, and voila, the built-in is called with the supervisor's permissions. It takes some added cleverness to pass an interesting argument to the built-in and to get the result back, but it can be done: for details, see Tav's blog.

My worry about this approach is that a supervisor that provides a reasonably large subset of Python will have to implement some pretty complex functionality: for example, you'll have to support a secure way to import modules. My confidence in the security of the supervisor goes down exponentially as the its complexity goes up. In other words, while Tav may be able to evolve the toy supervisor in "safelite.py" into an impenetrable bastion after enough iterations of exploit-and-patch, I don't think this approach will converge in a realistic timeframe (e.g. decades) for a more fully-featured supervisor.

This lets me segue into another, perhaps more generic, concern with the idea of providing a secure subset of Python, whether it's based on restricted execution, capabilities, or restricting attribute references (like CapPython and Zope's RestrictedPython). Python's claim to fame comes largely from its standard library. People's proficiency with the language is not just measured by how well they can construct efficient algorithm implementations using lists and dicts: to a large extent it depends on how much of the standard library they master as well. Python's standard library is large compared to many other languages. Only Java seems to have more stuff that's assumed to be "always there" (except in certain embedded environments).

For a "secure" version of Python to succeed, it will need to support most of the standard library APIs. I'm distinguishing between the implementations and APIs here, for it is likely that many standard library modules use features of the language that aren't available by the secure subset under consideration. This doesn't have to be a show-stopper as long as an alternate implementation can be provided that uses only the secure subset.

Unfortunately, I expect that, due to a combination of factors, it will be impractical to provide a sufficiently large subset of the standard library for a sufficiently secure subset of Python. One problem is that Python, being a highly dynamic language, it supports introspection at many levels, including some implementation-specific levels, like access to bytecode in CPython, which has no equivalent in Jython, IronPython or other implementations. Because of the language's dynamic and introspective features, there is often no real distinction between a module's API and its implementation. While this is an occasional source of frustration for Python users (see e.g. the recent discussion about asyncore on python-dev), in most cases it works quite well, and often APIs can be simpler because of certain dynamic features of the language. For example, there are several ways that dynamic attribute lookup can enhance an API: automatic delegation is just one of the common patterns that it enables; command dispatch is another. All this leads me to think that a secure version of Python is unlikely to become complete enough to attract enough users to become viable. I'd be happy to be proven wrong, but it seems that the people most attracted to the idea are hoping that adding capabilities Python will somehow provide a shortcut to success. Unfortunately, I don't think it's a shortcut at all.

I should mention that I have some experience in this area: Google's App Engine (to which I currently contribute most of my time) provides a "secure" variant of Python that supports a subset of the standard library. I'm putting "secure" in scare quotes here, because App Engine's security needs are a bit different than those typically proposed by the capability community: an entire Python application is a single security domain, and security is provided by successively harder barriers at the C/Python boundary, the user/kernel boundary, and the virtual machine boundary. There is no support for secure communication between mutually distrusting processes, and the supervisor is implemented in C++ (crucial parts of it live in a different process).

In the App Engine case, the dialect of the Python language supported is completely identical to that implemented by CPython. The only differences are at the library level: you cannot write to the filesystem, you cannot create sockets or pipes, you cannot create threads or processes, and certain built-in modules that would support backdoors have been disabled (in a few cases, only the insecure APIs of a module have been disabled, retaining some useful APIs that are deemed safe). All these are eminently reasonable constraints given the goal of App Engine. And yet almost every one of these restrictions has caused severe pain for some of our users.

Securing App Engine has required a significant use of internal resources, and yet the result is still quite limiting. Now consider that App Engine's security model is much simpler than that preferred by capability enthusiasts: it's an all-or-nothing model that pretty much only protects Google from being attacked by rogue developers (though it also helps to prevent developers from attacking each other). Extrapolating, I expect that a serious capability-based Python would require much more effort to secure, and yet would place many more constraints on developers. It would have to have a very attractive "killer feature" to make developers want to use it...

Thursday, January 29, 2009

Detecting Cycles in a Directed Graph

I needed an algorithm for detecting cycles in a directed graph. I came up with the following. It's probably something straight from a textbook, but I couldn't find a textbook that had one, so I came up with this myself. I like the simplicity. I also like that there's a well-defined point in the algorithm where you can do any additional processing on each node for which you find that is not part of a cycle.

The function makes few assumptions about the representation of the graph; instead of a graph object, it takes in two function arguments that are called to describe the graph:

def NODES(): an iterable returning all nodes
def EDGES(node): an iterable returning all nodes reached via node's outgoing edges

In addition it takes a third function argument which is called once for each node:

def READY(node): called when we know node is not part of any cycles

The function returns None upon success, or a list containing the members of the first cycle found otherwise. Here's the algorithm:

def find_cycle(NODES, EDGES, READY):
  todo = set(NODES())
  while todo:
    node = todo.pop()
    stack = [node]
    while stack:
      top = stack[-1]
      for node in EDGES(top):
        if node in stack:
          return stack[stack.index(node):]
        if node in todo:
          stack.append(node)
          todo.remove(node)
          break
      else:
        node = stack.pop()
        READY(node)
  return None

Discussion: The EDGES() function may be called multiple times for the same node, and the for loop does some duplicate work in that case. A straightforward fix for this inefficiency is to maintain a parallel stack of iterators that is pushed and popped at the same times at the main stack, and at all times contains iter(node). I'll leave that version as an excercise.

Update: Fixed a typo in the algorithm (EDGES(top)) and renamed all to todo.

Tuesday, January 13, 2009

The History of Python - Introduction

Python is 19 years old now. I started the design and implementation of the language on a cold Christmas break in Amsterdam, in late December 1989. It started out as a typical hobby project. Little did I know where it would all lead.

With Python's coming of age, I am going to look back on the history of the language, from the conception as a personal tool, through the the early years of community building, (If Guido was hit by a bus?), all the way through the release of Python 3000, almost 19 years later. It's been quite an adventure, for myself as well as for the users of the language.

This won't be an ordinary blog post -- it'll be an open-ended series. I may invite guest writers. I'll be touching upon many aspects of the language's history and evolution, both technical and social.

I'll start with the gradual publication of material I wrote a few years ago, when I was invited to contribute an article on Python to HOPL-III, the third installment of ACM's prestigious History of Programming Languages conference, held roughly every ten years. Unfortunately, the demands of the rather academically inclined reviewers were too much for my poor hacker's brain. Once I realized that with every round of review the amount of writing left to do seemed to increase rather than decrease, I withdrew my draft. Bless those who persevered, but I don't believe that the resulting collection of papers gives a representative overview of the developments in programming languages of the past decade.

The next destination of the draft was a book on Python to be published by Addison-Wesley. Again, the mountain of raw material that I had collected was too large and at the same time too incomplete to serve as a major section of the book, despite the editing help I received from David Beazley, a much better writer than I am.

As they tell prospective Ph.D. students, the best way to eat an elephant is one meal at a time. So today I am publishing the first bit of the elephant, perhaps still somewhat uncooked, but at least it's out there. Hopefully others who were there at the time can help clear up the inevitable omissions and mistakes. I have many more chapters, each still requiring some editing, and I expect this to be a long-running series. Therefore I am starting a separate blog title for this, unimaginatively called The History of Python. Follow the link and enjoy!

Friday, November 14, 2008

Overheard

"All you can do with a shell script is make it worse. But since this is Python, you can make it better."

Thursday, November 6, 2008

Cisco Developer Contest

This hasn't had enough attention yet: Cisco is inviting application developers who "think outside the box", to innovate and promote the concept of the network as a platform. This is your opportunity to build exciting Linux based applications on the Cisco Application Extension Platform (AXP), and win a share of the total prize pool valued at US $100,000.

Read more at Cisco's contest site.

Wednesday, October 29, 2008

What makes me feel good

Whenever I feel down, I look at the TIOBE programming community index, and it makes me feel better. :-)

Monday, October 27, 2008

Questions Answered

I have now answered the top 20 questions in my section of "Ask a Google Engineer". Many of the remaining ones sound inappropriate or unanswerable, so I don't expect I'll be answering them, unless the popular vote really brings some of them to the top.

Sunday, October 26, 2008

Why explicit self has to stay

Bruce Eckel has blogged about a proposal to remove 'self' from the formal parameter list of methods. I'm going to explain why this proposal can't fly.

Bruce's Proposal

Bruce understands that we still need a way to distinguish references to instance variables from references to other variables, so he proposes to make 'self' a keyword instead. Consider a typical class with one method, for example:

class C:
   def meth(self, arg):
      self.val = arg
      return self.val

Under Bruce's proposal this would become:

class C:
   def meth(arg):  # Look ma, no self!
      self.val = arg
      return self.val

That's a saving of 6 characters per method. However, I don't believe Bruce proposes this so that he has to type less. I think he's more concerned about the time wasted by programmers (presumably coming from other languages) where the 'self' parameter doesn't need to be specified, and who occasionally forget it (even though they know better -- habit is a powerful force). It's true that omitting 'self' from the parameter list tends to lead to more obscure error messages than forgetting to type 'self.' in front of an instance variable or method reference. Perhaps even worse (as Bruce mentions) is the error message you get when the method is declared correctly but the call has the wrong number of arguments, like in this example given by Bruce:

Traceback (most recent call last):
File "classes.py", line 9, in
   obj.m2(1)
TypeError: m2() takes exactly 3 arguments (2 given)

I agree that this is confusing, but I would rather fix this error message without changing the language.

Why Bruce's Proposal Can't Work

Let me first bring up a few typical arguments that are brought in against Bruce's proposal.

There's a pretty good argument to make that requiring explicit 'self' in the parameter list reinforces the theoretical equivalency between these two ways of calling a method, given that 'foo' is an instance of 'C':

foo.meth(arg) == C.meth(foo, arg)

Another argument for keeping explicit 'self' in the parameter list is the ability to dynamically modify a class by poking a function into it, which creates a corresponding method. For example, we could create a class that is completely equivalent to 'C' above as follows:

# Define an empty class:
class C:
   pass

# Define a global function:
def meth(myself, arg):
   myself.val = arg
   return myself.val

# Poke the method into the class:
C.meth = meth

Note that I renamed the 'self' parameter to 'myself' to emphasize that (syntactically) we're not defining a method here. Now instances of C have a method with one argument named 'meth' that works exactly as before. It even works for instances of C that were created before the method was poked into the class.

I suppose that Bruce doesn't particularly care about the former equivalency. I agree that it's more of theoretical importance. The only exception I can think of is the old idiom for calling a super method. However, this idiom is pretty error-prone (exactly due to the requirement to explicitly pass 'self'), and that's why in Python 3000 I'm recommending the use of 'super()' in all cases.

Bruce can probably think of a way to make the second equivalency work -- there are some use cases where this is really important. I don't know how much time Bruce spent thinking about how to implement his proposal, but I suppose he is thinking along the lines of automatically adding an extra formal parameter named 'self' to all methods defined directly inside a class (I have to add 'directly' so that functions nested inside methods are exempted from this automatism). This way the first equivalency can be made to hold still.

However, there's one situation that I don't think Bruce can fix without adding some kind of ESP to the compiler: decorators. This I believe is the ultimate downfall of Bruce's proposal.

When a method definition is decorated, we don't know whether to automatically give it a 'self' parameter or not: the decorator could turn the function into a static method (which has no 'self'), or a class method (which has a funny kind of self that refers to a class instead of an instance), or it could do something completely different (it's trivial to write a decorator that implements '@classmethod' or '@staticmethod' in pure Python). There's no way without knowing what the decorator does whether to endow the method being defined with an implicit 'self' argument or not.

I reject hacks like special-casing '@classmethod' and '@staticmethod'. I also don't think it would be a good idea to automagically decide whether something is supposed to be a class method, instance method, or static method from inspection of the body alone (as someone proposed in the comments on Bruce's proposal): this makes it harder to tell how it should be called from the 'def' heading alone.

In the comments I saw some pretty extreme proposals to save Bruce's proposal, but generally at the cost of making the rules harder to follow, or requiring deeper changes elsewhere to the language -- making it infinitely harder to accept the proposal as something we could do in Python 3.1. For 3.1, by the way, the rule will be once again that new features are only acceptable if they remain backwards compatible.

The one proposal that has something going for it (and which can trivially be made backwards compatible) is to simply accept

def self.foo(arg): ...

inside a class as syntactic sugar for

def foo(self, arg): ...

I see no reason with this proposal to make 'self' a reserved word or to require that the prefix name be exactly 'self'. It would be easy enough to allow this for class methods as well:

@classmethod
def cls.foo(arg): ...

Now, I'm not saying that I like this better than the status quo. But I like it a lot better than Bruce's proposal or the more extreme proposals brought up in the comments to his blog, and it has the great advantage that it is backward compatible, and can be evolved into a PEP with a reference implementation without too much effort. (I think Bruce would have found out the flaws in his own proposal if he had actually gone through the effort of writing a solid PEP for it or trying to implement it.)

I could go on more, but it's a nice sunny Sunday morning, and I have other plans... :-)