> Someone else in the world reported the problem back in September, and aside from some random person asking a totally useless question, nothing had happened on the thread.
It's a special kind of horror to find, after hours of high-end-googling, the one thread where someone reports the same problem you are experiencing, and it's just the question, and then one other person asking if the problem has been solved because she/he is having the same problem.
The one thing that is worse is if the OP then makes another post that simply says "Solved it! =D", without giving any explanation on how they solved it.
Or when you find a thread where someome is asking the question and the only response is by some wise guy telling him to "search". And of course, searching keeps pointing you back to the same thread.
On the plus side, the last time this happened to me, I discovered that someone had posted a really useful answer to my question which helped solve me problem.
The person who had posted the answer was, of course, also me.
My experience is that said person coming back to my thread is a coworker. Then they tell me they can't get away from me and even after they try to find something on their own I helped them fix it. Quite amusing
At least half of every enterprise software I have used suffers from catastrophic link rot. And when you look at the URLs that do still map or redirect to something marginally on-topic, you just know it's a single API change away from redirecting to "we don't know what you're looking for, try searching here for hundreds of completely off topic (and occasionally broken) links".
No, I am not bitter. Learning where their FTP server is and how to navigate, well, that was gold.
Hmm, there must be some extension that detects 404 errors and/or server unresponsive and prompts whether you want to check Google's cache or the Internet archive. That browsers haven't already integrated this (with a configurable list of archiving services, like search engines), is actually rather surprising to me.
There are many such extensions, but yes, we're working on getting directly integrated into Firefox, Chrome, Edge, and if anyone has a contact at Safari (or one of the other browsers), my email addr is in my profile. Thanks.
Yes, but it is only temporary. As long as you have a robots.txt file excluding some URLs, those URLs will: 1) not be crawled by the Internet Archive crawler, 2) not be shown in the Wayback Machine. Any already-crawled pages will, however, invisibly remain in the archive, and will reappear once they are not in the robots.txt anymore.
Mmmh, that has never happened to me. That is just plain mean.
Now that I think of it, I have seen that happen, but the reply was inevitably "I have already googled the question to death, without finding any useful results."
Someone should reply "tc qdisc add dev eth0 root netem delay 100ms" in the original thread [ https://communities.vmware.com/thread/519888 ] and link to the blog post. Ideally that person would know enough about tc to suggest how to do it only for outbound tcp6 handshakes too.
The worst is when somebody reports "solved it", then you spend four hours figuring out why it didn't work only to learn the kernel changed behavior (this happened to me recently) and the problem can't be fixed.
Well, considering that VMWare fired their entire dev team in January [1], it's not surprising that this isn't fixed... I'd expect more of these kinds of issues to crop up without traction in the future.
Funky! This feels like the connect(2) is returning before it has actually done its work, async-style.
Rachel, could you write a small sneaky program (using eg libpcap) to see if the TCP handshake has completed by the time connect(2) returns control to your program, before your first write(2)?
An issue a little bit like this that I've seen is overzealous admins who block ICMPv6, creating PMTU black holes. Short web pages load, and long pages hang. Too bad I discovered this during tax season a couple years ago, and the affected site was eftps.gov.
It does feel like an MTU issue. I'd grab a tcpdump at all 3 points (server, host, vm) and see what is getting dropped.
From what I know about tc, to only delay ip6 traffic you've got to create a root qdisc that has multiple subclasses (like tc-prio [0]), and attach tc-netem to one while passing the other straight through. Then classify packets between the two, although I'd do that with iptables rather than figuring out any more of the tc workings that necessary.
[0] The default pfifo_fast has multiple subclasses, but from what I remember it had some problem with child qdiscs?
"parts of the web are going IPv6-only", "Certain web servers have been going IPv6-only of late" - Really? Which parts of the web? Why would anyone configure their servers that way?
Inside a company, once you're out of RFC 1918 space, you'll end up here sooner or later. It's also a convenient forcing function to get people like me to stop being lazy and actually investigate dumb things like this.
Some budget hosting providers offer IPv6-only hosting for a lower price, since IPv4 addresses are getting harder to acquire en masse (and thus are more expensive).
3€ year, not month. I can accept being ripped off at that price. I think I found those guys on that same page you just linked, actually. But anyway, most stuff on https://lowendbox.com/ is quite more expensive than that, and I don't feel like browsing the entire site.
Woops. Yeah, I misread that. In that case you're definitely not being ripped off, and given how much IPs usually cost, that is probably something which can really only be a v6 deal. Interesting. First I've seen anything like it. Thanks.
I connect to Facebook, Google and YouTube over IPv6. It's automatic, I guess on DNS side. I'm pretty sure they still have plenty of IPv4 interfaces. Going IPv6 only seems a little aggressive nowadays.
BTW, if you use Firefox there is an addon called FlushDNS that shows the IP address of the web server. It's main purpose is to remove an address from the DNS cache inside the browser but actually it's more useful as an inspection tool.
I have some v6-only services that I only use myself, but if I run a VM I do expect to be able to do this. And even if I'd always be able to fall back to v4, that's no reason for this not to be a bug.
Some of my work VMs which are not intended to be generally accessible to the public are IPv6-only.
It's not that I need to restrict access to them, just that they don't need IPv6, since anyone accessing them usually already has IPv6, and dual-stack is extra work.
I have the same problem with VirtualBox Linux VM, was wondering what is going on, and this post comes up. I am not sure if it is the same reason. I tried:
It doesn't seem like VMware is the culprit here, mainly because it has nothing to do with anything above layer 3. Here's some points to look into and possible fixes.
[1] VMware's network driver does not handle TCP, or IP. It's just layer 2; it
implements one of a couple kinds of network hardware, that's it.
[2] VMware Guest Tools does install a para-virtualized network card driver
- vmxnet2/vmxnet3. It communicates with the physical network device by
communicating with the host OS, rather than emulating a network driver. That
potentially may do something wonky with something above layer 3, even though
it really should not be.
[3] VMware does have a virtual network switch, which forwards frames between
the physical NIC and virtual NIC based on MAC address.
[4] VMware may handle moving frames from a virtual NIC to a physical differently
than moving it to another virtual NIC.
[5] VMware provides VMDirectPath I/O, which allows the guest to directly address
the network hardware.
[6] TSO/LSO/LRO can have a negative impact on performance in Linux (though
supposedly, LRO only works on vmnet3 drivers, and from VM-to-VM,
for Linux).
[7] Emulated network devices may not be able to process traffic fast enough,
resulting in rx errors on the virtual switch.
[8] Promiscuous mode will make the guest OS receive network traffic from
everything going across the virtual switch or on the same network segment
(when using VLANs).
[1] You can try changing the VMware guest's emulated network card (vlance, e1000) and trying your thing again, but I doubt it will change much.
[2] Try installing or uninstalling VMware Guest Tools and corresponding drivers.
[3] Nothing to do here, really. If you have multiple guests sharing one physical NIC, try changing it to just one?
[4] Try your test again between two VMs on the same host.
[1] Your guest OS may have bugs. In its emulated network drivers, in its
tcp/ip stack, in its applications, etc.
[2] An intermediary piece of software may be fucking with your network
connection. IPtables firewall, router/firewall on your host OS, after
the host OS/before your internet connection, at your destination host, etc.
[3] Sometimes, intermittent network traffic makes it look like there is a
specific cause, when really the problem is hiding in the time it takes
you to test.
[4] The Linux tcp/ip stack (and network drivers) collect statistics about
erroneous network traffic.
[5] Network traffic will show missing packets, duplicate packets, unexpected
terminations, etc.
[6] Your host OS or network hardware may be buggin'.
[1] Try a different guest OS.
[2] Make sure you have no firewall rules on the guest, host, internet gateway, etc. Try a different destination host.
[3] Run tests in bulk, collect lots of samples and look for patterns.
[4] Check for dropped packets, errors on the network interface, in tcp/ip stats.
[5] Tcpdump the connection to see what happens when it succeeds or fails.
[6] Try a different host for your VM.
edit one more idea: Look at the response headers for the request to the site. The content length is 1413 bytes. Add on the TCPv6 and IPv6 header overhead (and http headers, etc) and this is probably over 1500 bytes, the typical MTU maximum. Try requesting a "hello world" text file and try your test again.
I'mm successfully running IPv6 on VMWare (Fusion, ESX 5, ESX 6) on both Clients (Debian 8, Ubuntu 12.04, FreeBSD 10.2, Windows 7, Windows 10) and Servers (Debian 8, Ubuntu 12.04, FreeBSD 10.2, Windows 2008R2) .
I have not seen the issue described here in any configuration - neither on clients nor on servers. I wonder whether this is an issue with VMWare running on a specific host?
edit: From the forum post in the linked article, I'm gathering they are using IPv6 NAT. So this might be a problem with the VMWare NAT interface - my configurations are all bridged.
Software is unreliable. Bugs happen. Always. There are bugs in avionics, medical devices firmware, nuclear power plants monitoring software, bank transfers backends, all places.
Once upon a time it was common to think that we can design software without bugs, or at least almost. That didn't work at all! What did work is redundant systems, invariant testing and fail-fast with restarts. This is how reliable systems are written these days.
Bugs are common; we have to learn to work around them.
> Bugs are common; we have to learn to work around them.
Or we could, you know, fix them.
I wasn't asking for a justification. I was just asking why this is occurring. If you don't know, that's cool. I mean, one of the reasons I ask is because I'd like to know if VMWare are going to fix this bug.
So thank you for explaining that software has bugs. I'm sure I'll remember that the next time I fix a regression in LibreOffice, as I did with the issue with EMF dashed lines not displaying correctly or when I fixed the issue where JPEG exports didn't export the DPI value correctly...
Just for future reference, something like "Do we know exactly what the bug in VMWare is, and whether they're going to fix it?" would be way more effective at getting the answer you're looking for here. "Uh... but why?!?" sounds like cursing at the sky, and gets a response appropriate for that.
I'll bite. What about Erlang makes it so that a restarted process doesn't run into the same bug when it gets to the same point, and panic again in an infinite loop?
The only way I can imagine this working is if Erlang is so buggy and nondeterministic that it inserts crashes sometimes but not all of the time. But that's obviously absurd.
If it's some weird race condition crash, restarting (hopefully?) puts you in a known good state and you're unlikely to hit it again.
If it quickly repeats, you've isolated the failure to happening within a narrow scope.
This part isn't really Erlang magic, apache in pre-fork mode has a lot of the same properties. There may be some magic in supervision strategies, but I think the real magic is the amount of code you get to leave out by accepting the possibility of crashes and having concise ways to bail out on error cases.
For example, to do an mnesia write and continue if successful and crash if not, you can write
ok = mnesia:write(Record)
Similarly, when you're writing a case statement (like a switch/case in C), if you expect only certain cases, you can leave out a default case, and just crash if you get weird input.
I also find the catch Expression way of dealing with possible exceptions is often nicer than try/catch. It returns the exception so you can do something like
case catch Expression of
something_good -> ok;
{'EXIT', badarg} -> not_so_great
end
and handle the errors you care about in the same place as where you handle the successes.
Edited to add, re: failwhale, your HTTP entrypoints can usually be something like
As long as the failure in real_work_and_output is quick enough, you'll get your failwhale. Of course, if the problem is processing is too slow, you might want to set a global failwhale flag somewhere, but your ops team can hotload a patch if they need to fix the performance of the failwhale ;)
"It returns the exception so you can do something like
case catch Expression of"
Something to be aware of is the cost of a bare catch when an exception of type 'error' is thrown:
"[W]hen the exception type is 'error', the catch will build a result containing the symbolic stack trace, and this will then in the first case [1] be immediately discarded, or in the second case matched on and then possibly discarded later. Whereas if you use try/catch, you can ensure that no stack trace is constructed at all to begin with." [0]
Stack trace construction isn't free, so it makes sense to avoid it if you're not going to use it. I know that in either Erlang 17 or Erlang 18, parts of Mnesia were slightly refactored to move from bare catch to try/catch for this very reason.
Well, okay, so your process crashes, you restart it, it crashes a few more times, then you kill it. What's the advantage there? How does this increase availability, beyond killing it the first time it crashes?
It seems actively worse to allow users to retry requests that are doomed to failure than to put up a fail-whale or similar while the ops team is being paged.
Because most production bugs are infrequent (otherwise they would be noticed by testing). They have to be logged and fixed, but not allowed to move the system into inconsistent state. Restart first, fix later.
Are they? The bug discussed in this comment was extremely deterministic. There's a difference between infrequent in the sense that, across lots of users and lots of requests it happens rarely, and infrequent in the sense that, for one particular use, it only triggers sometimes.
Also, the bug discussed in this article wasn't causing crashes. What would you propose be crashed and restarted in this case?
It's a special kind of horror to find, after hours of high-end-googling, the one thread where someone reports the same problem you are experiencing, and it's just the question, and then one other person asking if the problem has been solved because she/he is having the same problem.
The one thing that is worse is if the OP then makes another post that simply says "Solved it! =D", without giving any explanation on how they solved it.