Not a single mention of the word ‘Apple’ in the original post, instead ‘the manufacturer’ and ‘the big corporation’.. curious if that is deliberate and if so what the reasoning is (legal?)
Presumably, because this isn't about Apple. They don't care about Apple. Merely about getting the M1/M2 arch to run Linux properly. It could have been Microsoft, Amazon, Google, treatment would have been the same.
As they are accepting a JSON schema for the function calls, it is likely they are using token biasing based on the schema (using some kind of state machine that follows along with the tokens and only allows the next token to be a valid one given the grammar/schema). I have successfully implemented this for JSON Schema (limited subset) on llama.cpp. See also e.g. this implementation: https://github.com/1rgs/jsonformer
As someone also building constrained decoders against JSON [1], I was hopeful to see the same but I note the following from their documentation:
The model can choose to call a function; if so, the content will be a stringified JSON object adhering to your custom schema (note: the model may generate invalid JSON or hallucinate parameters).
So sadly, it is just fine tuning. There's no hard biasing applied :(. You were so close, but so far OpenAI!
Good point. Backtracking is certainly possible but it is probably tricky to parallelize at scale if you're trying to coalesce and slam through a bunch of concurrent (unrelated) requests with minimal pre-emption.
Also WONNX can both be used from native apps (through wgpu it uses Vulkan, DX or Metal) a well as on the web (using WebGPU, WONNX compiled to WebAssembly).
In my opinion, ONNX is more complex than necessary. Therefore, I opted to convert it to an intermediate representation (IR) first, which is then used to generate source code. A key advantage of this approach is the ease of merging nodes into corresponding operations, since ONNX and Burn don't share the same set of operators.
Actually WONNX also transforms to an IR first (early versions did not and simply translated the graph 1:1 to GPU shader invocations in topographically sorted order of the graph). In WONNX the IR nodes are (initially) simply (copy-on-write references to) the ONNX nodes. This IR is then optimized in various ways, including the fusion of ONNX ops (e.g. Conv+ReLU->ConvReLU). The newly inserted node still embeds an ONNX node structure to describe it but uses an internal operator.
This makes running larger machine learning models in the browser feasible - see e.g. https://github.com/webonnx/wonnx (I believe Microsoft's ONNXRuntime.js will also soon gain a WebGPU back-end).
You can indeed perform inference using WebGPU (see e.g. [1] for GPU-accelerated inference of ONNX models on WebGPU; I am one of the authors).
The point made above is that WebGPU can only be used for GPU's and not really for other types of 'neural accelerators' (like e.g. the ANE on Apple devices).
ANE is only accessible via coreml and internal apple frameworks so i would assume it wont be using ANE but maybe some neural accelerators in Intel/AMD/Nvidia processors and GPUs.
Accelerators inside GPU (like Tensorcores) seems like a lot better deal as you can easy utilize it without 4 abstraction layers with only some unknown to us mortals operations support inside. (And my god i hope apple will allow to programmable run ANE or at least put this api inside Metal framework cause right now working with Coreml for anything new is a nightmare and even some old models are broken on new versions of coremltools)
Replacement for the ONNX IR perhaps, but as far as I can see there is not (yet?) a file format for StableHLO (ONNX has a standardized on-disk format specified in Protobuf)
StableHLO has a serialization format which is based on MLIR bytecode. https://github.com/openxla/stablehlo/blob/main/docs/bytecode... goes into details of reading/writing portable artifacts for StableHLO programs and associated compatibility guarantees.
I'd also like to comment on our (StableHLO's) relationship with related work. StableHLO was a natural choice for the OpenXLA project, because a very similar operation set called HLO powers many of its key components. However, I would also like to give a shout out to related opsets in the ML community, including MIL, ONNX, TFLite, TOSA and WebNN.
Bootstrapping from HLO made a lot of sense to get things going, but that's just a starting point. There are many great ideas out there, and we're looking to evolve StableHLO beyond its roots. For example, we want to provide functionality to represent dynamism, quantization and sparsity, and there's so much to learn from related work.
We'd love to collaborate, and from the StableHLO side we can offer production-grade lowerings from TensorFlow, JAX and PyTorch, as well as compatibility with OpenXLA. Some of these connections in the ML ecosystem have already started growing organically, and we're super excited about that.
+1 to what Eugene said and evolutionary aspects. The proposal for stability of the format as well as the opset can be followed on the respective project forums (discourse & github issues/rfc) as these are discussed and refined to meet community needs.
Some APs (eg Ubiquiti) can actually steer clients from one band to the other based on minimum RSSI and other parameters (including device compatibility, and you can exclude or force a band for individual devices), which prevents this from happening.