Nicely written article! - Slightly off topic: I love seeing and reading Python code. Used to see many flaws in the language (things like `.append` changing the object, returning `None` instead of creating a copy and returning that), but after ten years of working with Python I really appreciate its versatility, it‘s ubiquitous availability, the large number of libraries and the community. There‘s nothing I can‘t solve with it. It‘s the swiss army knife in my pocket, available whenever I need to solve something by coding.
> (things like `.append` changing the object, returning `None` instead of creating a copy and returning that)
This would be horrendously inefficient without immutable data structures like Clojure's. Very few languages have that, so it's a strange assumption to make, especially for a language as old as Python.
It should also be worth nothing that Clojure sacrificed quite a bit to make this as efficient as possible.
“persistent vectors” are certainly an interesting data structure that strike a compromise between fast indexing and being able to relatively quickly create a copy where only one element changes, but it's a compromise and indexing is made slower to allow for the latter. — They also take up more memory on their own but are allowed to share memory with their copies.
I will say that my ideal language contains them in the standard library alongside standard vectors that index in constant time.
Further, it should be noted that much of the performance talk is on the assumption that accessing from memory is truly random access; — with the existence of c.p.u. caches that assumption is not entirely accurate and accessing from contiguous rather than scattered memory in practice is considerably cheaper so one also pays the price for their being scattered more in memory.
Random access into a clojure vector is going to need more memory lookups than conventional sequential buffer array (I don't recall the constants used in the implementation, I think it's either 4 or 8 lookups).
But when you're indexing into the vector sequentially, the memory layout plays rather well with memory caching behavior, and most lookups are going to be in L1 cache, just like they would be in a conventional array.
So lookups are a bit more expensive, but not as much more expensive as one might imagine.
The actual data of a Pvector is not in contiguous memory but scattered however the JVM wills it, and on top of that in order to find which address to retrieve it, an algorithm that runs in logarithmic time with respect to the length of the vector must be used opposed to a constant time one.
How can most lookups end up in L1 cache if an element that is 32 indices removed is statistically likely to be arbitrarily far removed in memory?
Of course, all of that is not that material to begin with given that most elements will be pointers to begin with so the actual objects wither they point will already be arbitrarily scattered and it simply adds one more pointer indirection, but for unboxed types such as integers it does play a factor.
So I don't recall what size of chunks Clojure's implementation uses, but I'll assume it uses 64-bit indices with 16-word chunks, because I want to use numbers.
Assuming no part of my-vector is cached at first, the first iteration needs to make a full 16 round trips to main memory — quite bad. But on the next iteration, all of that is cached, and we don't hit main memory at all, and the same until i=16, which requires one round trip. Then when i=16², we need to hit main memory twice, etc.
No doubt this is quite a bit worse than having everything nicely laid out sequentially in memory, but it's not as bad as you're describing.
Of course, all of that is not that material to begin with given that most elements will be pointers to begin with
I guess this is sort of true. If you're doing random lookups, then using a persistent vector instead of an array list mean 17 trips to main memory instead of just 1, so it's not totally inconsequential.
But I think (hope?) that modern JVMs can optimize collections of small immutable objects so that they're not represented as pointers to the heap. Surely ArrayList<Integer> x gets represented as int x[], and not int *x[], at least with the most optimizing JIT level.
I think you are talking about arbitrary stride sequential access, and the comment you were responding to meant stride-1 access. But conventional arrays also fare poorly with arbitrary stride access, though probably they have an advantage with small strides close to the Clojure persistent vector chunk size.
Things like `list.append` modifying in-place might feel like a flaw to some, but I think Python is really consistent when it comes to its behaviour. If you ask a person who comes from an object-oriented world, they'll say it only makes sense for a method on an object to modify that object's data directly.
There's always ways to do things the other way, for example you can use x = [*x, item] to append and create a new copy, while being quite a bit more explicit that a new list is being created.
One related area where Python is not consistent is operators like +=.
In pretty much all other languages that have them, the expected behavior of A+=B is exactly the same as A=A+B, except that A is only evaluated once. Now lets look at lists in Python:
xs = [1, 2]
ys = xs
ys = ys + [3]
print(xs, ys)
This prints [1, 2] [1, 2, 3], because the third line created a new list, and made ys reference that. On the other hand, this:
xs = [1, 2]
ys = xs
ys += [3]
print(xs, ys)
prints [1, 2, 3] [1, 2, 3], because += changes the list itself, and both xs and ys refer to that same list.
(Note that this is not the same as C++, because in the latter, the variables store values directly, while in Python, all variables are references to values.)
The worst part of it is that Python isn't even self-consistent here. If you only define __add__ in your custom class, you can use both + and += with its instances, with the latter behaving normally. But if you define __iadd__, as list does, then you can do whatever you want - and the idiomatic behavior is to modify the instance!
For comparison, C# lets you overload + but not +=, and automatically synthesizes the latter from the former to enforce the correct behavior.
You'll get an exception saying that local variable xs was used before it was assigned - precisely because += created a new local binding for xs inside foo.
But that's good. Because a string just needs to know aboit interable to perform that operation whereas every iterable would need to implement it's own join if you had it the other way around.
It may be necessary in python, but in general, a language could allow you to define a join method on the iterable superclasss/trait/interface that iterates over the elements, converting each to a string, and inserting the separator between each of them.
Yet Ruby and JS manage to do it somehow. To me it seems natural that join should be a method on the iterable, and I always have to pause to remember Python is different.
I don't think it should be a method at all. It's just a function: join(iterable, separator). It can also be implemented with reduce naturally: `reduce(lambda x, y: x + separator + y, iterable)`.
Oh yeah, it's horrendous, my point was just that it's functionally equivalent and makes more sense as a function than a method on either object. You can actually call it like this if you want, though: `str.join(separator, iterable)`.
The way it's managed in JS, digging the function out of the prototype to apply it, can be done in Python as well. But unlike JS you won't normally have to, thanks to the method not being defined only on one specific type of iterable.
Ruby does it by having a mixin (Enumerable) that anything meeting a basic contract (roughly equivalent to the Python iterable protocol) can include to get an enormous block of functionality; Python doesn’t have (or at least idiomatically use as freely; ISTR that there is a way to do it) mixins like Ruby does.
' '.join() makes more sense to me, and it's more universal too if done right and you accept anything resembling a "sequence" (which python does) and individual objects of the sequence have a sensible str(). And as language maintainer, you only have to maintain one such implementation, not one per collection type.
Javascript, on the other hand, kinda does it worst, at least of the languages I regularly use... .join() is a instance method on Arrays and TypesArrays. But they forgot to add any kind of join for Sets, for example.
(["a", "b", "c"]).join("")
"abc" # alright
(new Set(["a", "b", "c"])).join("")
Uncaught TypeError: (intermediate value).join is not a function
([...new Set(["a", "b", "c"])]).join("")
"abc" # grmpf, have to materialize it into an array first.
That illustrates the drawback: if you make it a method on the concrete sequence types you got, you better not forget some and make sure the different APIs are consistent, too. If Javascript had a String.join(sep, <anything implementing the iterator protocol>) this wouldn't have been an issue.
python isn't alone either, by the way. C# has the static string.Join(...) that accepts "enumerables" (IEnumerable<T>), but no array.Join() or list.Join() or dictionary.Join(). Combined with Linq, especially .Select, that becomes quite handy. It has been plenty of times I did print-debugging by adding a one liner along the lines of
In the cases you give, the original list is not being mutated; a new object (a string, not a list) is being created. So it does make sense not to have it be a method call on the list.
Huh I never even thought we would need to create copy of an object when adding new item to it (like a new item to list for example). Is there any drawback on doing that in standard pythonic way? I actually learned to program using Python and it was my first language. Since then I only used JS. In both I like using functions a lot and rarely dabble in OOP since it is more conveniet to me.
You often lose performance in traditional imperative languages when aiming for persistence.
When you have immutability guarantees (like in many functional programming languages like ML or Haskell) you can avoid making copies by sharing the parts of the data structure that don't change.
If this kind of thing interests you, you should check out Chris Okasaki's book "Purely Functional Data Structures".
whether mutating data is better than creating a new copy for everything is a really long debate about immutability and functional programming, with good points on either sides, but that's really beyond the point here.
In my opinion, you should use whichever method makes your code easy to read and understand for your usecase.
> things like `.append` changing the object, returning `None` instead of creating a copy and returning that
The obvious question is why it can't return a reference to the list instead of returning None. I feel like if I've been using the language on an almost daily basis for ten years now and I still get burned by that all the time, then it's just a poorly designed feature.
The advantage of mutating operations always returning None is that you can easily tell whether a mutation is happening by looking at the code. If you see y = f(x) that means x is unchanged, whereas if you see just f(x) on a line that means something stateful is happening.
Agreed. JavaScript's Array.sort is an example of this. Most of JavaScript's other array methods return a new array and people get used to chaining them, but sort mutates the array and also returns a reference to it. You can actually get pretty far before being bitten by this so long as you're sorting already-copied arrays. But then one day you hit a bizarre bug caused by behavior that's been sneaking past your radar the whole time.
I'll just point out that this is originally from Scheme (I think... Maybe Scheme got it from a previous Lisp) but borrowed by Ruby. Neither Scheme nor Ruby do a perfect job with sticking to the naming convention, at least if we include popular libraries, but it is very handy and intuitive.
Python has a naming convention as well: `sorted(arr)` returns a sorted copy and `arr.sort()` works in-place (and returns None). However, I've always thought it's a bit odd that one is a function and the other is a method.
The ! convention is ok but I don't think it's optimal because, in the presence of higher-order functions and related concepts, it's often not clear if a function should be marked as !.
For example if I have a map function that applies a function f to a sequence, should I call it map! because I might pass in a function f that mutates the input? If so then it seems like any function that takes a function as input, or any function that might call a method on an object, should get marked with ! just in case. But if I don't mark it that way then the ! marking is not as informative: I might end up with a line consisting only of non-! functions which still mutates the input.
Note that “!” in Ruby doesn’t conventionally mean “modifies the receiver”, it means “does something that is similar but less likely to be safe than a method with the same name without ‘!’”.
A very common case of this is mutating versions of non-mutating methods, but (1) mutating methods (in stdlib or other idiomatic code bases) that have no non-mutating equivalent are not named with a “!”, and (2) methods are sometimes named with “!” because they do dangerous things compared to a base method that are not mutating the receiver.
map! would mean a function that performs a map in-place on the array by replacing the values. So it would depend on if the callback was encouraged to mutate the array or discouraged from doing so.
Array#map! actually exists in Ruby, too. It mutates the array item by item. Enumerable#each doesn't have a bang, because it doesn't change the enumerable itself, even though it can mutate the objects contained in the enumerable. This is overall consistent -- there's a distinction between mutating the receiving object and mutating objects contained in or referred to by the receiving object.
random.shuffle() has bitten me that way a few times too:
array = random.shuffle(array)
because I expected it to return a copy or reference, instead making my array None.
It would also enable chaining operations:
array = array.append(A).append(B).sort()
In-place vs immutable copy is a language design choice with tradeoffs on both sides, but there's no reason that I can see to not return a reference to the list.
Perhaps recognizing this is really the job of an external linter. Sometimes I wonder if the future of enforcing canonical formatting on save like "gofmt" or "black" will extend to auto-correcting certain goofy errors on each save.
mypy would yell at you about this, but afaik type-checked python still isn't the norm.
In Python, a function that makes and returns a copy would be idiomatically named shuffled(). Consider sorting:
xs.sort() # in-place
ys = sorted(xs) # copy
As for functions returning the object - I think it's a hack around the absence of direct support for such repetition in the language itself. E.g. in Object Pascal, you'd write:
with array do
begin
append(A);
append(B);
sort;
end;
> Used to see many flaws in the language (things like `.append` changing the object, returning `None` instead of creating a copy and returning that)
I think it's pretty off-base to call this a "flaw". Immutable structures have their place and can be very helpful where appropriate, but making its core primitives work this way is far outside the scope or the philosophy of Python. If you want otherwise, you're really wanting an entirely different language. And there's nothing wrong with that! But I think it would be a "flaw" for Python to make these operations immutable, even though I love immutability personally.
And also like a Swiss army knife, it's not particularly great at anything, can be awkward even when functional, and there's always a better tool for any specific job.