Hacker Newsnew | past | comments | ask | show | jobs | submit | alt_'s commentslogin

This is more of a "Can you fake being a robot?" test.


Doesn't pretty much every language nowadays have foreach loops that don't require keeping an integer index around when iterating over elements? That seems like a way better idea.


Oh god, T9 text input is back. Would be interesting to see how it can work with modern autosuggest, though.


How is this different from PIX for Windows (https://blogs.msdn.microsoft.com/manders/2006/12/15/a-painle...)? Is it just an update for DX12?


> How is this different from PIX for Windows

That "died" with the DirectX SDK (which last recieved an update June 2010) when the DirectX SDK release cycle was merged into the general windows one AFAIK. It's had several lingering bugs - to wit, OOM issues, D3D11 support breaking on Windows 7?, and issues with managed code (64-bit specific perhaps? It's been awhile...)

Visual Studio gained support for graphics debugging as a "replacement", in what appears to have at least been a complete rewrite of the UI at bare minimum - with some new features and some missing ones, at least for awhile there.

Aside from "DXSDK PIX" and "VS PIX", there's also been "360 PIX" and "XB1 PIX" (referring to the consoles) - all effectively different pieces of software with their own quirks, bugs, feature sets, user interfaces, limitations... there's probably some common DLLs in the mix, but they're so varied in even 'core' functionality like capture behavior - that it wouldn't surprise me if there was minimal code shared between them.

This latest standalone version looks visually similar to "XB1 PIX", but that could handle D3D11 and this apparently can't? Should I call this "12 PIX"?


I used PIX back in 2012 for porting an engine from D3D9 to D3D11, and I'm happy to see that it finally got an update to be brought up back to scratch.

Old PIX used to be a powerhouse, but riddled with inconsistencies with new D3D11 features, and unstable to boot. It was one of those tools you loved when it worked, and hated if it didn't.


Looks like it's an update for DX12 and also a much-improved UI.

PIX is a great tool that has never gotten its time in the sun and I'm glad to see it is still being maintained, at least for now.


Agreed! PIX is really great. I first used it while working on the X360 and I was blown away by the ability to debug a pixel and replay a scene and step through draw calls, GPU state, etc.


d3 Technologies | London (software) or NYC (sales) | Full-time, onsite http://www.d3technologies.com/contact/jobs

d3 Technologies (unrelated to d3.js) develops an integrated visual video production suite and custom hardware for running high-end and complex events, shows and installations.

We're looking for a Windows generalist (Visual Studio, C++, Python, DirectX, OpenCV, ZeroMQ) interested in high-performance, LAN distributed, soft real-time systems.

We're growing fast, with two new offices opened in the last year (NYC and HK), and are looking to branch out into more cloudy technologies too. The team includes an interesting mix of creatives, networking/linux nutters (ie. me) and gamedevs.

The interview process includes a simple code test.


I probably should've added "potentially" as my personal sample size is one ;)


I manage a small fleet of six T460s and we're sitting at 50% of the units hitting a BSOD during the update and then being stuck in a perpetual reboot loop.

I've found that if I roll Windows back using Win10 installation media to a pre-Anniversary state, the computers that BSOD will (be automatically forced to) update and not BSOD on the second go around.

The "your computer has to update" mantra is a real bummer in a situation like this since if you connect the computer to the internet, it will download updates and attempt to upgrade/break again. I'm not sure if I've gotten lucky with the three I've rolled back and then had re-update, or if some poor soul is an eternal restore -> update -> BSOD loop.


* Jordan Inkeles, Altera's director of product marketing for high end FPGAs

Speaking in 2012, Danny Biran – then Altera’s senior VP for corporate strategy – said he saw a time when the company would be offering ‘standard products’ – devices featuring an FPGA, with different dice integrated in the package. “It’s also possible these devices may integrate customer specific circuits if the business case is good enough,” he noted.

There was a lot going on behind the scenes then; already, Altera was talking with Intel about using its foundry service to build ‘Generation 10’ devices, eventually being acquired by Intel in 2015.

Now the first fruit of that work has appeared in the form of Stratix 10 MX. Designed to meet the needs of those developing high end communications systems, the device integrates stacked memory dice alongside an FPGA die, providing users with a memory bandwidth of up to 1Tbyte/s.

“A few years ago,” said Jordan Inkeles, director of product marketing for high end FPGAs, “we partnered with Intel for lithography and were very excited. We also looked at Intel’s packaging technology and asked ‘can we use that?’. The answer was ‘yes’. The combination has allowed us to do things we thought were not possible.”

The concept is based on what Altera – now Intel’s Programmable Systems Group (PSG) – calls ‘tiles’. Essentially, these are the dice which sit alongside the FPGA. Tiles are connected to the FPGA using Intel’s EMIB – embedded multi-interconnect bridge – technology. “It’s not a traditional silicon interposer,” Inkeles explained. “It’s a little bridge chip which is used where you need to connect two pieces of silicon.”

* Statix 10 MX is said to combine the programmability and flexibility of STratix 10 FPGAs with integrated 3D stacked high bandwidth memory devices

Stratix 10 MX devices are designed to help engineers solve demanding memory bandwidth challenges which can’t be addressed using conventional memory solutions. The parts integrate four stacks of HBM2 DRAM, each with up to four memory dice. PSG says the parts are suitable for use where bandwidth is paramount. Apart from providing 10 times more memory bandwidth than conventional solutions, Stratix 10 MX devices are said to be smaller and to use less power.

“This idea of integrated chips opens up things,” Inkeles said. “FPGAs are trying to be everything to everyone. They have to support wireless, wired, networking, radar and high performance computing, amongst others. We saw divergence in what was possible.”

PSG started thinking about transceivers. “If we had transceivers in separate tiles, we could come out with devices for different markets,” Inkeles continued. “It also makes sense for analogue, which doesn’t move at the same pace as digital, and for design reuse. So we could use a tile that meets today’s needs – say a 28G transceiver – then come out in the future with a 56G PAM4 tile and a 28G NRZ tile. In the same process node time frame, we can deliver two very different types of product.”

This is the concept underpinning the MX. “Parallel memory is becoming a huge challenge,” Inkeles observed. “You can continue to use parallel interfaces, but with the memory right next to the FPGA to maintain signal integrity and reduce power. But, while Hybrid Memory Cube (HMC) is a good solution, it has to be serial,” he continued, “as you can’t get signal integrity on a 72bit wide datapath. Or you can put memory in the package.

“By providing up to four stacks of four DRAM dice, we’re providing a memory bandwidth never seen before. Each stack can run to 256Gbyte/s, so four stacks give 1Tbyte/s. That’s unprecedented and can’t be achieved with HMC.

“Power consumption is reduced because the memory is right next to the FPGA and drive strength is much smaller – only pJ/bit – because you’re not driving signals to a memory that could be 6in away.”

There is a downside, however; it’s an expensive solution. “You’re paying for bandwidth,” Inkeles admitted. “But customers complain about the effort it takes to do board layout and to get the DDR chips right. We’ve solved that without using any I/O or transceivers. And if 16Gbyte of DRAM in package isn’t enough, you still have transceivers and I/O available for use with external components.”

Inkeles pointed to three broad application areas for the MX device. “There’s high performance computing (HPC), cloud computing and data centres, but they all look for different things.

“HPC says ‘give me everything, while cloud says it’s worried about the cost per bit. Data centres can build algorithms in logic, which is quicker than a GPU, but need the memory bandwidth to ‘feed the beast’.”

Apart from imaging applications, such as medical and radar, Inkeles says there are applications in wireline communications. “Gone are the days of just routing traffic,” he said. “Everyone is now looking to differentiate their products, for example, by providing statistics on the data being handled. So they need to hold a piece of traffic for a moment to analyse what it is, then send it onwards. This couldn’t be done before because there wasn’t the bandwidth.”

MX is the first implementation of PSG’s strategy and the interesting thing is ‘what comes next?’. It’s quite possible that optical functionality might appear at some point in Intel PSG’s Stratix 10 parts.

Five years ago, Altera announced plans to integrate optical interfaces into its FPGAs as a way to cope with increasing communications bandwidth. Despite demonstrating the technology later in 2011, the idea remained on the shelf. Inkeles said: “We have continued to evolve the technology, but haven’t gone public with the developments.”

Inkeles noted: “Although PAM4 offers a way to stay in the electrical domain, we will, at some point, run out of capability and we’ve been preparing for that transition. Now we have transceivers on tiles, we can take out one tile and replace it with an optical interface.

“We’ve been working behind the scenes,” Inkeles continued, “but the right time to put a product into the market will depend on the economics.”

Altera’s acquisition by Intel also gives it access to silicon photonics technology. “We have exciting capabilities,” Inkeles added.

* Heterogenous 3D system in package integration could enable a new class of FPGA based devices

Another potential step is integrating such components as analogue, ASICs and CPUs alongside an FPGA. Intel PSG says EMIB offers a simpler manufacturing flow by eliminating the use of through silicon vias and specialised interposers. The result, it claims, will be integrated systems in package that offer higher performance, less complexity and better signal and power integrity.

Inkeles sees this as potentially a new market. “ASICs have become smaller and faster, but not cheaper. Unless you’re going to sell millions, you will have a tough time,” he said. “ASSPs are going away, unless you can find more customers or more volume.”

Is it possible that Biran’s vision of ‘standard products’ might be close to reality and could that even include custom versions of a Stratix 10? “Will we do custom?,” Inkeles wondered. “It’s within our ability. It’s not something we’re promoting, but we are engaging with customers.

“We have a range of options. Now we’re part of Intel, the ‘sky’s the limit’. As Altera, we developed HardCopy and had an ASIC team, but it wasn’t our core competence. But Intel Foundry can do ASIC,” he concluded.


From the reddit discussion[0], this appears to be a misreading of the law and it is actually talking about online ad publishers blocking competitors' ads through malicious software.

[0] https://www.reddit.com/r/worldnews/comments/4u2jd4/china_wil...


The fine article explicitly says that's the intent of the law, but that local lawyers suggest it also could be used against adblockers.


Database died. Google cache: http://webcache.googleusercontent.com/search?q=cache:5V0TMa0...

The gist of it is:

* Python spends almost all of its time in the C runtime

This means that it doesn't really matter how quickly you execute the "Python" part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS -- it's more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.

Pyston's performance improvements come from speeding up the C code, not the Python code. When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code. I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code. This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn't really broken.


> This means that it doesn't really matter how quickly you execute the "Python" part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them.

That doesn't really explain why Python is slow. Your just explaining how Python works. Why should C code be slow? Usually it is fast. Just saying the opcodes are complex doesn't really help, because if a complex opcode takes a long time, it is usually because it is doing a great deal.

Java used to have the opposite problem. It was doing too much at the "Java bytecode" level, such as string manipulation - so they added more "complex" opcodes written in C/C++ to speed things up, significantly.

What you really need to explain is why Python is inefficient. Bloated data structures and pointer hopping for simple things like adding two numbers may be a big reason. I know Perl had many efficiencies built in, and was considered quite fast at some point (90s?).


> What you really need to explain is why Python is inefficient.

Python is extremely dynamic and this makes things hard for someone who wants to build a JIT.

The powerful bits of python metaprogramming makes it really impossible for a JIT to say with some certainty across all running threads that what it is doing is right.

Inlining a simple call like a.x() is rather hard when everything underneath can move around - I am not saying that it always does, but implementing a python variant which is nearly the same isn't very useful.

Compare this to PHP, which has fixed method calls (unless you use runkit, which you shouldn't) - a->x() will always be the same method as long as there was an a->x which was valid.

The method will never change once it is has been validated.

However unlike Java, both languages end up not knowing exactly what type "a" will be when the method is being called.

Java also doesn't quite know, but only when the invoke is via an interface. But the engine at least knows exactly how many impls of that interface has been loaded so far (and the bi-morphic case is commonly 1 real impl and 1 mock impl).

But both in case of PHP and Python, the whole idea of "which object do I have look up ::x() for?" is an unknown. In PHP's case, you have to look it up once per class encountered and in Python's case, you have to verify someone hasn't replaced it at runtime.

There are very nice functional ways around this problem at the bottom end of numeric loops for Python, which makes it great for numeric processing interleaved with generic control flow.

numpy + numba is a great way of limiting all this and getting performance out of a simple loop. And I'd rather use numpy + a python script doing regexes rather than a C program + LAPACK.

But that performance doesn't translate over when you have class/object oriented structures or in general, just multi-threaded web/rpc style code.


JavaScript has the same problems, and that hasn't stopped every major JS engine from building a JIT.


JS is single-threaded which makes an enormous difference to actually squeezing performance out of your JIT.

Just building a JIT for a language generally isn't the hard part. Building a JIT that is substantially faster than a bytecode-compiled implementation of the language is what's hard, and how hard that is depends intimately on the semantics of the source language. When I say intimately, I mean every single detail of the language's semantics matter.


This article is a follow up to an earlier post (https://blog.pyston.org/2016/06/30/baseline-jit-and-inline-c...) which provides additional context. In short, python opcodes are inefficient because they must support a significant amount of dynamicity, even as opposed to some other language peers like JS, PHP, Lua, etc. The original post explains some of the cool stuff they're doing with Pyston (specifically baseline jit and inline caching) to combat it.


This is also why tensor flow is such a resource hog compared to torch. My mac book pro has intel iris graphics so no cuda, but I was able to get usable results with torch, still struggling to get tensor flow to produce anything useful. Compared to lua python is very hungry, but also Apple is going to need to have a decent GPU option in its new macbook pros to keep people who work on ai projects from bailing for hackingtoshes.


I don't know a ton about low level language implementation, so excuse me if this comment is misguided, but...

Does type hinting in Python 3.x (PEP 484) change this (i.e., reduce the overhead due to dynamicness)? I know it doesn't include runtime type checking, so perhaps it's moot but maybe someone has written an interpreter that does hard type checking either at runtime or through a compilation step.


> Does type hinting in Python 3.x (PEP 484) change this

No. The type hints are only meant for tooling (type checking, refactoring, documentation, etc.).


> Java used to have the opposite problem. It was doing too much at the "Java bytecode" level, such as string manipulation - so they added more "complex" opcodes written in C/C++ to speed things up, significantly.

Where'd you get that idea? It's because of advanced JITs that Java got a lot faster.

- former JVM hacker


This. The article does not explain at all where exactly all the processor cycles are going. It's basically just saying "it's not the Python languages' fault!" but fails to name a specific culprit.

It says it's spending the cycles in the "C Runtime" but what exactly does it (have to) do in the C Runtime that eats up performance?


> It's basically just saying "it's not the Python languages' fault!"

The article is actually saying the exact opposite. It claims Python the Language is slow because the opcodes need to do a lot of work according to the language specification. Python is not slow because the core team has done a poor job implementing the opcode interpreter and runtime.

When you have a language with thin opcodes that map closely to processor instructions then compiler improvements lead to smarter opcode generation which translates to efficient machine code after jitting. When you have fat opcodes you're SOL.

Consider this: an instruction like STORE_ATTR (implements obj.name = value) has to check whether the top of the stack refers to an object, then whether writing to the object attributes is allowed. Then "name" has to be checked if it's a string. Perhaps additional checking is needed to normalize the string or other unicode testing. Then the assignment has to happen which is a dictionary update (internally a hash map insertion which may trigger a resize). This is just the tip of the iceberg. A lot more stuff is happening and the correct exceptions are thrown when the instruction is used incorrectly (which leads to code bloat which hurts the instruction cache).

A thin bytecode instruction for STORE_ATTR could actually reduce the store to a single LEA machine code instruction (Load Effective Address).

The downside of a language with a thin instruction set is that the individual instructions can't validate their input. They have to trust that the compiler did its job correctly, a segfault or memory corruption will happen otherwise. One of Guido's goals when creating Python was that the runtime should never ever crash, even on nonsensical input. This pretty much rules out a thin instruction set right from the start. Simple concurrency is also a lot easier with a fat instruction set, because the interpreter can just yield in between instructions (Global interpreter lock). With a thin instruction set there is no clear delineation between instructions where all memory is guaranteed to be in a stable state. So a different locking model is needed for multi-threading, which adds even more complexity to the compiler and runtime.


All the problems you're describing are solved with a powerful JIT. And the core team do seem to be opposed to doing the work needed for that.


Python's philosophy chooses simplicity over everything else. Simple grammar. Simple bytecode instruction set. Simple CPython implementation. Simple threading model (GIL). Simple data structures. Adding a highly sophisticated and complicated JIT on top of that makes little sense.

It's not so difficult to create a high performance language that's much like Python. It's just not possible to make Python fast without abandoning some of its founding principles.


Why is a simple CPython implementation such an important requirement?

Portability? Make the JIT optional.

Ease of maintenance? Get a small team of experts to maintain it on behalf of everyone else.

Openness to beginners? That would be nice if possible as well, but CPython's job is to run programs rather than to educate.

A JIT needn't make the grammar, bytecode or threading model more complex. It would make data structures and the implementation more complex, but do you not think that's worth it if Python could be twice as fast?


> CPython's job is to run programs rather than to educate.

CPythons 'job' is to be the reference implementation of Python.


But that's just not the case in reality is it? In reality it's the main production implementation and its inefficiency costs the world wasted resources every day.

If readability and being the reference implementation is more important than performance, why is Python implemented in C rather than a higher level language?


> In reality it's the main production implementation and its inefficiency costs the world wasted resources every day.

Sure, but the inefficiencies in every part of the stack from the physical CPU right up to the executing program also cause waste.

> If readability and being the reference implementation is more important than performance, why is Python implemented in C rather than a higher level language?

Because like most projects it grew organically. Guido didn't sit down and write the first version of Python and think "hey, this is going to be the reference implementation for Python so lets write it in pure pseudocode so it's easy to read", he bashed out a version in C and it gained momentum over time. At the point where it became the reference implementation rather than the only implementation it would be suicide to chuck it out and re-write it in some high level language.


To be fair, the GIL wasn't included because it was a simple threading model (AFAIK). It was included because it was simple to implement and it was/is fast(er) (than removing it)[1][2].

If the Gilectomy [2] project succeeds, Guido has mentioned he would consider it for Python3.6+ [3].

[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[2] https://www.youtube.com/watch?v=P3AyI_u66Bw

[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s


Hence why I rather support Julia and leave Python for shell scripting like tasks.


coughsufficiently smart compilercough.


No we have compilers that can do these things today.


A small nit: LEA does the calculation but doesn't read or write from that address. In times of old, this instruction used the memory addressing port to do the calculation, but these days it's just a normal arithmetic instruction with slight difference that it doesn't set the flags based on the result. Instead, the ideal would be for the bytecode to reduce to a single MOV. In addition to loading and storing, MOV itself supports several forms of indexed and indirect address calculation which execute on dedicated ports without adding latency.


Which complex bytecodes did Java introduce?

Since hotspot was introduced,the amount of C++ on the reference JDK has been incrementally reduced between releases,with Hotspot improving its heuristics.

To the point that Graal is a pure Java JIT.

Also in the 80's and early 90's C compilers generated awfully slow code.

C developers have to thank almost 40 years of research in C optimizers and misuse of UB optimizations for the current state of C compilers quality in code generation.


The author means the Hotspot JVM has added intrinsics for memset, autovectorization, etc.

I don't agree with the main point of the article though. Pypy, ZipPy, etc. have shown that there are real gains from running the actual Python code faster.


>>I know Perl had many efficiencies built in, and was considered quite fast at some point (90s?).

There are a lot of threads in Perlmonks that talk in detail about speeding up Perl, related project et al.

To be summarizing it. Languages like Perl and Python are slow because they do a lot of work out of the box that languages like C don't. There fore when you talk of talk of translating Python to C, or Perl to C. Essentially what you are talking of is translating all that extra action back into C, which will run as fast as Perl or Python itself.

The more you make it easy for the compiler to interpret the faster it can run and vice versa.

Python is slow for the very reason its famous, its easy for the programmer.


Lisp and Smalltalk like languages run circles around Python and Perl performance.

Enjoy the same powerful features, have JIT and AOT compilers to native code.

It all boils down to how much the language designers care about performance.


And also how much the language designers care about proper language design.


I'm not sure if this really answers the question though. There are plenty of languages that do more work than C does that are not as slow as python. Do they do less work than python? Maybe, but they certainly do more work than C. The question is, even if they do less work than python, is the extra work python doing valuable to you?


Lua is almost as dynamic and flexible as Python and is very easy for the programmer. LuaJIT's performance is close to that of C.


I think part of the reason for this, though, is that Lua is a very "thin" language. It purposefully doesn't have a lot of fancy features (I mean, come on, it literally has one composite datatype, and you're supposed to use them [tables] for everything from arrays to maps to full objects). By removing a lot of assumptions that make Python "easy", Lua has made it much easier to write optimizing JITers and compilers since runtime behavior is (perhaps a little paradoxically) easier to predict. And it doesn't make Lua any less "easy" a language, it's just a different set of rules to follow. Guido's hunt for the simplest programming language possible is noble, but sometimes it causes poor decisions to be made imho.


Well, just want to say for one example here, why Python is hard to speedup than JS.

In Python, essentially every object is a dictionary. And behavior like, "x.y = 'z'" is legal, meaning pretty much everything could be put into object x without declaration ahead of time, at all. While, not very sure, JS doesn't allow such behavior, you can still do assignment, but assessing will yield undefined.

The above behavior come as default(you can use __slot__ for sure, but compiler can't assume that) is pretty problematic for optimization, compiler simply won't know where to locate a specific variable ahead-of-time, so it cannot substitute the variable access using a pointer offset, it will have to go through a hash lookup, which comparing to the offset alternative, is magnitude slower.


Except that LuaJIT does something close to just a pointer offset. The trick it uses is to use a hash function that is known by the compiler so when you have a constant key ("y" in this case) the hash table set is compiled down to 3 instructions (imagine hash("y") = 0xa7):

1. Load whatever necessary to check if x[0xa7] is occupied 2. Branch to slow path if occupied 3. Store x[0xa7] = 'z'

And a modern super-scalar CPU will even do some of this in parallel.


You can generally do that in JS as well. (Added properties are sometimes called "expandos".)


True. Just find out. But Chrome's console yield undefined to me...Weird.

Edit: find the reason. It is because I am trying to assign a property to int. Seems JS won't allow this to primitive type?

Edit: Python won't allow assignment to primitive type bject either. The handling is the same.


Are you the OP author, or working on Pyston? I have basically two questions/curiosities — I'm not asking adversarially: 1) For which code is the C runtime most expensive? Typical Python code tries to leave heavy-lifting in libraries, but what if you write your inner loop in Python? Enabling that is (arguably) one goal of JIT compilation, so that you don't need to write code in C. 2) What about using Python ports of performance-sensitive libraries?

In more detail: I arrived at https://lwn.net/Articles/691243/, but I'm not sure I'm convinced. Or rather: with a JIT compiler you probably want to rewrite (parts of) C runtime code into Python so you can JIT it with the rest (PyPy has already replaced C code in their implementation, so maybe there's work to reuse). For instance, an optimizing compiler should ideally remove abstractions from here:

  import itertools
  sum(itertools.repeat(1.0, 100000000))
Optimizing that code is not so easy, especially if that involves inlining C code (I wouldn't try, if possible), but an easier step is to optimize the same code written as a plain while loop. Does Pyston achieve that? I guess the question applies to the LLVM-based tier, not otherwise.

Yes, Python semantics allow for lots of introspection, and that's expensive — but so did Smalltalk to a large extent. Yet people managed, for instance, to not allocate stack frames on the heap unless needed (I'm pointing vaguely in the direction of JIT compilers for Smalltalk and Self, though by now I forgot those details).


> Optimizing that code is not so easy, especially if that involves inlining C code (I wouldn't try, if possible)

Inlining through C code probably isn't really an option[0], but the optimisation itself shouldn't be that much of an issue, the rust version and the equivalent imperative loop compile to the exact same code: https://godbolt.org/g/OJHIwc[1]

[0] unless you interpret — and can JIT — the C code with the same underlying machinery as Truffle does

[1] used iter_arith for sum(), but you can replace sum() by an explicit fold for no difference: https://godbolt.org/g/R1BgQQ


> [0] unless you interpret — and can JIT — the C code with the same underlying machinery as Truffle does

Pyston can easily JIT the C code because it uses LLVM for it's main JIT tier.


Nope, I have nothing to do with the blog or Pyston. Sorry if I gave that impression.


You may be interested in an optimizing Python compiler called Pythran.[1][2]

[1] - https://github.com/serge-sans-paille/pythran

[2] - https://www.youtube.com/watch?v=Af8B30mXZ7E


I'm not sure your assumptions support your conclusions here, especially your believe that a JIT compiler wouldn't make inroads on the slow C code being executed.

The biggest problem of Python is that it lacks the experts that could write those fast runtimes, and it fails to attract them after the Python leaders declared the GIL to be a non-issue.


There's PyPy. I think the problem with python is the community is especially resistant to breaking changes.


There's lot of python people who were very happily breaking the backwards compability in 3.0. However I guess majority of the comunity was not in that boat, but pretty big portion anyway.


So Python runs in very low memory because of no JIT. You can run say 10 copies of your Python Web server in the space of 1 Java runtime. Will the Java runtime still win? Sure the benchmarks prove it. But there are advantages to running in low memory. Really cheap hosting for low traffic stuff comes to mind... java on a cheap host tends to be disaster.


Not because of JIT, but because of refcounting GC. Also the JVM probably loads (and does, or at least is prepared to do - http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2... ) too much crap. (And probably the jigsaw and modular and whatever OpenJDK projects will help with that.)

Furthermore, Python is not really memory prudent either. PHP is much better in that regard (very fast startup time, fast page render times, but a rather different approach).


If you're running a low traffic site the hosting is a rounding error. We're talking about $5-10 here. Is that really a justification for anything? If money is THAT important you just write the app with Nginx/SQLite/Lua in the first place.


The BDFL (Benevolent Dictator For Life) Guido Von Rossum himself put forth the idea that he would consider a patch removing the GIL in 2007 [1]:

> "... I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease."

Unfortunately, experiments thus far have not succeeded to meet these requirements.

There is some work being done by the Gilectomy project to try and meet this bar as well as some other requirements currently though [2]. But it is currently grappling with the afore-discovered performance issues that come with removing the GIL.

Also at PyCon 2016, Guido himself mentions the Gilectomy project and it's potential consideration (if it works) for Python 3.6+ [3].

So when you say Python leaders declared the GIL a "non-issue", I think you are oversimplifying the actual reality of what removing the GIL means and why leaders (like Guido) have been reluctant to invest resources pursuing.

[1] http://www.artima.com/weblogs/viewpost.jsp?thread=214235

[2] https://www.youtube.com/watch?v=P3AyI_u66Bw

[3] https://youtu.be/YgtL4S7Hrwo?t=10m59s


That's a silly explanation. A smart JIT would be able to take successive Python opcodes and optimise them together as one operation, so how much time it spends interpreting individual instructions has nothing to do with the "speed", in the sense of "current speed" vs "potential speed".

If you want to talk about speed in any meaningful sense, you have to talk about potential for optimisation. It's possible that Python (and its opcodes) are designed in such a way that there is little potential for optimisation. This has to do with the semantics of the language, not how much time it spends in one part of the code or the other.

I don't know ultimately how much potential for optimisation Python has, but clearly it's a very difficult problem, so we can say with some certainty that Python is "slow", in the concrete sense that there are no low-hanging fruit left for speeding it up.

Edit: I say this as an avid Python user by the way. Especially combined with ctypes, I find the Python interpreter to be an absolutely excellent way to "organize" a bunch of faster code written in C/C++. I actually don't have any problem with Python itself being slow, I kind of like it that way personally. It's easy to understand its execution model if the interpreter is kept fairly simple, and this makes it easy to reason with. But then again I am not writing web-scale backends with it, I am just, more or less, using it to batch calls to scientific C functions. So it really depends on your use case. While I've spent plenty of time tuning every little performance gain out of a tight computational loop in C, I can't think of a single time where I've struggled to figure out how to speed up my Python code -- I just am not using it in ways that that would be necessary.


> That's a silly explanation. A smart JIT would be able to take successive Python opcodes and optimise them together as one operation, so how much time it spends interpreting individual instructions has nothing to do with the "speed", in the sense of "current speed" vs "potential speed".

I think you might actually be agreeing with the explanation. The point about big opcodes means that the opportunities to look at a sequence of opcodes (i.e. the Python part) are reduced because you're doing a lot of computation over a relatively small number of opcodes. So the challenge involves optimizing the "guts" of the opcodes and the sequence of "guys" across relatively few opcodes. Their approach to solving this is discussed in the original blog post (https://blog.pyston.org/2016/06/30/baseline-jit-and-inline-c...). This complication happens to make optimizing Python via JIT compilation a tough problem.


> When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code.

That's a silly question. JITs have knowledge instructions relate to each other. The "C code" you're talking about is not opaque to them, it has meanings that can be optimized in relation to other instructions. When you have a "* 2" bytecode + argument it doesn't just dispatch to a C function that multiplies the input by 2. The compiler knows the semantics of that and can convert it to a shift if appropriate.

It's not a JIT's responsibility to "speed up the C code". The C code is part of the interpreter, JITs (generally) don't invoke interpreter functions.

Or put differently, if you're spending too much time in the C runtime then maybe more code needs to be ported into JITed language itself so that code can also benefit from JIT optimizations.


Better-formatted link:

https://www.pastery.net/chjxpt/


I think this post misses the point. Having to enter the runtime in the first place is the problem (and making runtime code marginally faster is not the solution). Fast VMs for dynamically typed languages (Chakra, V8) design the object model and the runtime around being able to spend as much time in JITted code as possible - the "fast path".


Also, Python has a global interpreter lock so it has no parallelism.


See Larry Hastings' talk "Removing Python's GIL: The Gilectomy" from PyCon 2016.

https://www.youtube.com/watch?v=P3AyI_u66Bw

As well as his earlier talk "Python's Infamous GIL".

https://www.youtube.com/watch?v=4zeHStBowEk

The GIL is a simple design that provides serious benefit to Python (end user) developers.

There are plenty of options for working around this for parallelism, such as the built-in module multiprocessing.


I liked the talk, but when he talks about the GIL keeping Python "fast" I have to laugh. It's so slow already, who cares? He is talking about a factor of 1.3 to 2X to remove the GIL in a language that is already 30-100X slower than C/C++.


Not none, poor...


Some experiments I did during my coursework for my M.Sc. indicated that it was actually worse-than-useless at the time (2009?) Here's what would happen:

- When you've got one CPU core, the threads basically just act like a multiplexer. One thread runs for a while, releases the GIL, and the next thread runs for a while. Not a big deal.

- When you've got multiple CPU cores, you've got a thundering herd. When the lock is released, the threads waiting on all of the other cores all try to acquire the lock at the same time. Then after one thread has run on, say, core 3, it's gone and invalidated the cache on the other cores (mark & sweep hurts caches pretty badly). The thundering herd stampedes again and the process continues.

- To make matters even worse, each core runs at low utilization (e.g. a quad core machine, each core runs at ~25%). If you've got CPU throttling turned on (which my laptop, where I started the experiments, did), then the system detects that the CPU load is low and scales down the clock speed. Normally, this would result in increased CPU utilization, which would speed the CPUs back up again. Unfortunately, the per-core utilization stays pegged at 25% and things never speed back up again. The system looks at it and says "huh! only 25%! I guess we've got the CPU speed set properly!"

Maybe it's gotten better since then? I haven't checked recently.

Edit: I wish I had the results handy. The basic conclusion was that you got something like a 1.5x slowdown per additional CPU core. That's not how it's supposed to work! Using taskset to limit a multi-threaded Python process to a single core resulted in significant speedups in the use cases I tried.


This is a known issue in py2. On py2 when running in a multi-core machine it'll run ~1.8x slower (depending on what you are doing) than it'll run in a single-core machine. Python 3.2 ships a new GIL[0] fixing the problem.

[0] http://www.dabeaz.com/python/NewGIL.pdf


Dave Beazley! Yes, it was some of his work that inspired my research. Thanks for the reminder!

Edit: That's a beautiful solution to the problem, too. You're still not going to get a performance boost from multiple cores, but you're not going to have it fall flat on its face either.


Sounds interresting, I'd be glad to see a blog post with the actual use cases and a few graphs with different numbers of cores, and maybe the sources so people can go further.


I'll try to dig it up. I suspect it's sitting in an SVN repo somewhere...

If I recall, I took a stock Python interpreter and instrumented it with RDTSC instructions to do lightweight timestamps on GIL acquisitions and releases.


I don't think the GIL is ever released while running Python code is it? So there is no parallelism between Python threads.


You're forgetting multiprocessing. That manages to get round this by running multiple Python interpreters. Big problem is passing objects is dog slow as everything has to be pickled either way.


No true multi-threading, but multi-processing is not affected by the GIL.

Also, outside of parallelism, Python has good concurrency support with Gevent and now AsyncIO in Python 3.


Sorry for off topic, but I just tried getting the Google cache as well and it just doesn't work. Both in Firefox and Chromium, when I type "cache:<Ctrl+V><Enter>" in the google.com search bar, nothing happens. I checked the developer console, no network requests are made. It says in grey below the search box "press enter to search", but neither enter nor clicking the blue search icon does anything whatsoever.

In the past one could manually type /search?q=cache:something in the address bar and it would force the search, but these days it doesn't seem to work anymore. The search request is done in javascript and the /search?q=x link just prefills the search box, leaving it to javascript to fire the actual query (which then fails to do so).

Edit: found a way: disable Javascript. This forces Google to search immediately. No plugin necessary: in the Firefox developer console, use the cog wheel on the right top, then in the right column somewhere near the center you can tick "Disable Javascript". Loading the Google home- and search page is also noticeably faster, by the way.


I love this adaptation of it too: https://youtu.be/7tScAyNaRdQ


The video is pretty good. Thank you for that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: