More

staticfloat · on July 6, 2022

Small correction; it’s DisplayLink, not DisplayPort. You have to use DisplayLink because they do some part in software as a part of their driver (which you must download and install) whereas the native Apple stuff has some internal hardware limitation.

I run three 1080p monitors off of my M1Pro, and I don’t notice any heavy CPU load from whatever software load the driver may be adding.

ghostly_s · on July 6, 2022

Displaylink is lossy software compression of your video stream, and when I last explored using it with Macs in a professional setting ~3 years ago the software was borderline unsupported by the vendor on the Mac platform and absolutely unreliable.

CharlesW · on July 6, 2022

> Small correction; it’s DisplayLink, not DisplayPort.

Thanks, staticfloat and ksala_!

staticfloat · on July 4, 2022

We actually use this in our CI system to limit write access outside of the build environment’s build folder.

You can see some Julia code that generates the sandbox config rules here: https://github.com/JuliaCI/sandboxed-buildkite-agent/blob/ma...

staticfloat · on April 29, 2022

This linear scaling per-core doesn't match my experience with using the AMX co-processor. In my experience, on the M1 and the M1 Pro, there are a limited number of AMX co-processors that is independent of the number of cores within the M1 chip. I wrote an SO answer exploring some of the performance implications of this [0], and since I wrote that another more knowledgable poster has added more information [1]. One of the key takeaways is that appears to be one AMX coprocessor per "complex", leading us to hypothesize that the M1 Pro contains 2 AMX co-processors.

This is supported by taking the code found in the gist [2] linked from my SO answer and running it on my M1 Pro. Compiling it, we get `dgesv_accelerate` which uses Accelerate to solve a medium-size linear algebra problem, that typically takes ~8s to finish on my M1 Pro. While running, `htop` reports that the process is pegging two cores (analogous to the result in my original SO answer on the M1 pegging one core; this supports the idea that the M1 Pro contains two AMX co-processors). If we run two `dgesv_accelerate` processes in parallel, we see that they take ~15 seconds to finish. So there is some small speedup, but it's very small. And if we run four processes in parallel, we see that they take ~32 seconds to finish.

All in all, the kind of linear scaling shown in the article doesn't map well to the limited number of AMX co-processors available in Apple hardware, as we would expect the M1 Max to contain maybe 8 co-processors at most. This means we should see parallelism step up in 8 steps, rather than 20 steps as was shown in the graph.

Everything I just said is true assuming that a single processor, running well-optimized code can completely saturate an AMX co-processor. That is consistent with the tests that I've run, and I'm assuming that the CFD solver he's running is well-written and making good use of the hardware (it does seem to be doing so from the shape of his graphs!). If this were not the case, one could argue that increasing the number of threads could allow multiple threads to more effectively share the underlying AMX coprocessor and we could get the kind of scaling seen in the article. However, in my experiments, I have found that Accelerate very nicely saturates the AMX resources and there is none left over for future sharing (as shown in the dgesv examples).

Finally, as a last note on performance, we have found that using OpenBLAS to run numerical workloads directly on the Performance cores (and not using the AMX instructions at all) is competitive on larger linear algebra workloads. So it's not too crazy to assume that these results are independent of the AMX's abilities!

[0] https://stackoverflow.com/a/67590869/230778 [1] https://stackoverflow.com/a/69459361/230778 [2] https://gist.github.com/staticfloat/2ca67593a92f77b1568c03ea...

staticfloat · on Dec 5, 2021

My guess is that it’s due to git looking at its own base name to figure out which command it’s supposed to run as, kind of like busybox does.

geofft · on Dec 5, 2021

Fun anecdote about that: at work we added a wrapper around nvcc to point it at the right compiler, and renamed the original "nvcc" binary to ".nvcc-wrapped". But nvcc looks at argv[0] to print out its name in the --version output, and it truncates anything after a period (presumably to handle things like "nvcc.exe"?). And CMake's CUDA detection looks at nvcc --version. So CMake went down a really weird path where it knew that nvcc existed but didn't really believe it was nvcc, which was extremely confusing until I looked at some log output and went "wait, why isn't nvcc printing its own name".

tomjakubowski · on Dec 5, 2021

Yep. Source of that error message. https://github.com/git/git/blob/master/git.c#L879

staticfloat · on Nov 5, 2021

In the Julia world, we make redistributable binaries for all sorts of things; you can find lots of packages here [0], and for LLVM in particular (which Julia uses to do its codegen) you can find _just_ libLLVM.so (plus a few supporting files) here [1]. If you want a more fully-featured, batteries-included build of LLVM, check out this package [2].

When using these JLL packages from Julia, it will automatically download and load in dependencies, but if you're using it from some other system, you'll probably need to manually check out the `Project.toml` file and see what other JLL packages are listed as dependencies. As an example, `LLVM_full_jll` requires `Zlib_jll` [3], since we build with support for compressed ELF sections. As you may have guessed, you can get `Zlib_jll` from [4], and it thankfully does not have any transitive dependencies.

In the Julia world, we're typically concerned with dynamic linking, (we `dlopen()` and `dlsym()` our way into all our binary dependencies) so this may not meet all your needs, but I figured I'd give it a shout out as it is one of the easier ways to get some binaries; just `curl -L $url | tar -zxv` and you're done. Some larger packages like GTK need to have environment variables set to get them to work from strange locations like the user's home directory. We set those in Julia code when the package is loaded [5], so if you try to use a dependency like one of those, you're on your own to set whatever environment variables/configuration options are needed in order to make something work at an unusual location on disk. Luckily, LLVM (at least the way we use it, via `libLLVM.so`) doesn't require any such shenanigans.

[0] https://github.com/JuliaBinaryWrappers/ [1] https://github.com/JuliaBinaryWrappers/libLLVM_jll.jl/releas... [2] https://github.com/JuliaBinaryWrappers/LLVM_full_jll.jl/rele... [3] https://github.com/JuliaBinaryWrappers/LLVM_full_jll.jl/blob... [4] https://github.com/JuliaBinaryWrappers/Zlib_jll.jl/releases [5] https://github.com/JuliaGraphics/Gtk.jl/blob/0ff744723c32c3f...

Micoloth · on Nov 5, 2021

I’ll take advantage of this comment to ask the tangential question: Where can i learn how llvm “compilation” works in Julia?

I know code is only supposed to be JIT’ed and then executed by the runtime (that’s why PackageCompiler exists), but still I’d like to know more about how it works..

Like, if i write a simple pure function in Julia and call code_llvm on it… How “standalone” is the llvm code (if that is even a thing)? When does GC get called? How exactly does the generated code depend on the runtime?

Is there any good explanation of this?

staticfloat · on Nov 6, 2021

To add to Keno's sibling comment, Julia, as a JIT compiler, essentially creates large chunks of standalone, "static" code, and runs those as much as it can, breaking out into the "dynamic" runtime when it has reached the limits of type inference or for some other reason needs to return to the runtime to perform dynamic dispatch etc... in these instances, we break out of the standalone code and start using the runtime to do things like determine where to jump next (or whether to compile another chunk of static code and jump to that). Note that these chunks of static code can be both smaller or larger than a function, it all depends on what Julia can compile in one go without needing to break out into the dynamic environment.

KenoFischer · on Nov 5, 2021

> Like, if i write a simple pure function in Julia and call code_llvm on it… How “standalone” is the llvm code (if that is even a thing)? When does GC get called? How exactly does the generated code depend on the runtime?

It's standalone unless there's explicit calls to the runtime in it. The most common runtime support is probably heap allocation, so if you see a `jl_alloc_obj` in there, that gets lowered to runtime calls eventually. GC gets called during allocation if the runtime thinks there's been enough garbage generated to have a collection be worth it.

staticfloat · on Oct 20, 2021

Aha! A chance to plug one of my favorite CGP Grey videos that explores this very question: https://www.youtube.com/watch?v=JEYh5WACqEk

Teknoman117 · on Oct 21, 2021

That's a super interesting video. Thanks!

staticfloat · on Oct 17, 2021

Imagine instead of getting a grid of pixels once every 30th of a second, you instead get one pixel’s value, alone with its location, along with the time stamp at which the pixel’s change was noticed. Event cameras can have very fine time stamp resolution (orders of magnitude better than 1/30th of a second) and so a bright moving pixel can be tracked very accurately.

tuatoru · on Oct 17, 2021

When you explain it like that, existing camera technology looks like the stupidest, most inefficient way to do things possible.

This seems like one of those "obvious in hindsight" discoveries, which are always the best ones.

wtallis · on Oct 17, 2021

Well, digital camera technology for video is a pretty cheap extension of pre-existing digital camera technology for still photography. And capturing whole individual frames (or successive scanlines) is also a good fit for most display technologies.

It's only really in the context of computer vision and object tracking that the brute force whole-frames model starts to seem less than convenient.

tuatoru · on Oct 18, 2021

And VR?

amelius · on Oct 17, 2021

But with enough depth (bits), almost every pixel will be changing at high frequency due to small light variations.

b112 · on Oct 17, 2021

Which probably indicates some algorithmic way the chip determines true, non-light based change.

Yet, this also means the possibility of some missed changes, of some changes being taken as lighting changes, instead of object change.

Was that a cloud, or a large shape close to the lens?

Hmm. Gonna have to read on this.

staticfloat · on April 27, 2021

Note that the restriction on external monitors can be worked around by using a DisplayLink compatible dongle/dock, since it uses a custom driver (I assume it does something in software that would otherwise be limited in hardware).

I use the Dell D6000 and I run three (1080p) external monitors in addition to the built in monitor.

clashmoore · on April 27, 2021

I've been trying to find a DisplayLink dock that can output video via USB-C or Thunderbolt. Everybody's shared solutions always involve HDMI connections, never USB-C.

I have two Thunderbolt/USB-C monitors that I was hoping to daisy chain with one wire from my Mac. Alas it's not possible.

My hope is power into a dock. Thunderbolt from dock to laptop to power laptop. Thunderbolt/USB-C from dock into first monitor. Second Thunderbolt/USB-C from dock using the DisplayLink tech to second monitor.

jsjohnst · on April 27, 2021

You won’t find a Displaylink adapter that supports Thunderbolt monitors. It just won’t work from a technical aspect.

jacobolus · on April 27, 2021

Three 1080p displays add up to 3/4 the bandwidth of one 4k display, at the same framerate.

staticfloat · on April 20, 2021

In case there's anyone out there with an M1 MBP that desperately wants multiple monitors, if you use a DisplayLink compatible external dock, something about the alternate driver that such a dock uses allows for multiple monitors. I personally use the Dell D6000 to run three external displays in addition to the built-in display off of my M1 MBP and it works quite well. You can see a list of DisplayLink docking stations here [0].

[0] https://www.displaylink.com/products/universal-docking-stati...

SXX · on April 20, 2021

Any chance that you know of any good dock station with 3 HDMI ports? Since DP on my monitors is used by desktop GPU.

wil421 · on April 20, 2021

Good luck. Even with my old intel based Mac I struggled to find a decent dock. Lots of docks have DisplayPorts and hdmi or if you’re unlikely like myself you are forced to use USB C to hdmi dongles.

I haven’t bought a DisplayLink and probably won’t. Once an Arm based Mac comes out with dual monitor support I’m giving my wife the M1 Air and buying a new one.

SXX · on April 21, 2021

I guess I'll just go with hardware splitter then.

sprite · on April 21, 2021

Is there any adapter that will let me run my 2 LG 5k monitors on a M1 MacBook Pro?

staticfloat · on April 2, 2021

Hah! I did the exact same thing when I discovered the “net send” command. Only me and my friend were playing around so we sent eachother messages like “I know where you live”..... the school tech was cool with me so I didn’t get punished, but quite a few admins were freaked out by these strange messages appearing on their computers.