Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No data clearly indicates that there is no data.

A statement of "I don't know' clearly indicates a lack of knowledge.

A statemnt of "I have no opinion" clearly indicates that the speaker has not formed an opinion.

In each case, a spurious generated response:

1. Is generally accepted as prima facie evidence of what it purports.

2. Must be specifically analysed and assessed.

3. Is itself subject to repetition and/or amplification. With empirical evidence suggesting that falsehoods outcompete truths, particularly on large networks operating at flows which overload rational assessment.

4. Competes for attention with other information, including the no-signal case specifically, which does very poorly against false claims as it is literally nothing competing against an often very loud something.

Yes: bad data is much, much, much, much worse than no data.



Data that's had data censored from it is bad data.


False.

Outlier exclusion is standard practice.

It's useful to note what is excluded. But you exclude bad data from the analysis.

Remember that what you're interested in is not the data but the ground truth that the data represent. This means that the full transmission chain must be reliable and its integrity assured: phenomenon, generated signal, transmission channel, receiver, sensor, interpretation, and recording.

Noise may enter at any point. And that noise has ... exceedingly little value.

Deliberately inserted noise is one of the most effective ways to thwart an accurate assessment of ground truths.


Defining terms here is important, so let's avoid the word bad for a moment because it can be applied in different ways.

1) You can have an empty dataset.

2) You can have an incomplete dataset.

3) you can have a dataset where the data is wrong

All of these situations, in some sense, are "bad"

What I'm saying is that, going into a situation, my preference would be #2 > #1 > #3.

Because I always assume a dataset could be incomplete, that it didn't capture everything. I can plan for it, look for evidence that something is missing, try to find it. If I suspect something is missing but can't find it then I at least know that much, and maybe even the magnitude of uncertainty that adds to the situation. Either way, I can work around it understanding the limits if what I'm doing or if there's too much missing, make a judgement call and say that nothing useful can be done with it.

If I have what appears to be a dataset that I can work with, but the data is all incorrect, I may never even know it until things start to break or, before that if I'm lucky, I waste large amounts of time to find out that the results just don't make sense.

It's probably important to note that #2 and #3 are also not mutually exclusive. Getting out of the dry world of data analysis, if your job is propaganda & if you're good at your job, #2 and #3 combined is where you're at.


I'd argue Facebook's censorship leaves us with 2 and 3. They don't remove things bevause they're wrong; they remove them because they go against the current orthodoxy. Most things are wrong, so most things that go against the modern orthodoxy are wrong... but wrong things that go WITH the modern orthodoxy aren't removed.

It's a scientist who removes outliers in the direction that refute his ideas, but not ones in the direction that support it.


Let's note that this thread's been shifting back and forth between information which is publicised over media and data, with the discussion focusing on use in research.

These aren't entirely dissimilar, but they have both similarities and differences.

Data in research is used to confirm or deny models, that is, understandings of the world.

Data in operations is used to determine and shape actions (including possibly inaction), interacting with an environment.

Information in media ... shares some of this, but is more complex in that it both creates (or disproves) models, and has a very extensive behavioural component involving both individual and group psychology and sociology.

Media platform moderation plays several roles. In part, it's performed in the context that the platforms are performing their own selection and amplification, and that there's now experimental evidence that even in the absence of any induced bias, disinformation tends to spread especially in large and active social networks.

(See "Information Overload Helps Fake News Spread, and Social Media Knows It". (https://www.scientificamerican.com/article/information-overl...), discussed here https://news.ycombinator.com/item?id=28495912 and https://news.ycombinator.com/item?id=25153716)

The situation is made worse when there's both intrinsic tooling of the system to boost sensationalism (a/k/a "high engagement" content), and deliberate introduction of false or provocative information.

TL;DR: moderation has to compensate and overcome inherent biases for misinformation, and take into consideration both causal and resultant behaviours and effects. At the same time, moderation itself is subject to many of the same biases that the information network as a whole is (false and inflammatory reports tend to draw more reports and quicker actions), as well as spurious error rates (as I've described at length above).

All of which is to say that I don't find your own allegation of an intentional bias, offered without evidence or argument, credible.


An excellent distinction. In the world of data with research & operations, I only very rarely deal with data that is intentionally biased. Counted on the fingers of my hand. Cherry picked is more common, but intentionally wrong to present things in a different light, that's rare.

Well, it's rare that I know of. The nature of things is that I might never know. But most people that don't work with data as a profession also don't know how to create convincingly fake data, or even cherry pick without leaving the holes obvious. Saying "Yeah, so I actually need all of the data" isn't too uncommon. Most of the time it's not even deliberate, people just don't understand that their definition of "relevant data" isn't applicable. Especially when I'm using it to diagnose a problem with their organization/department/etc.

Propaganda... Well, as you said there's some overlap in the principles. Though I still stand by more preference of #2 > #1 > #3. And #3 > 2&3 together.


Does your research data include moderator actions? I imagine such data may be difficult to gather. On reddit it's easy since most groups are public and someone's already collected components for extracting such data [1].

I show some aggregated moderation history on reveddit.com e.g. r/worldnews [2]. Since moderators can remove things without users knowing [3], there is little oversight and bias naturally grows. I think there is less bias when users can more easily review the moderation. And, there is research that suggests if moderators provide removal explanations, it reduces the likelihood of that user having a post removed in the future [4]. Such research may have encouraged reddit to display post removal details [5] with some exceptions [6]. As far as I know, such research has not yet been published on comment removals.

[1] https://www.reddit.com/r/pushshift/

[2] https://www.reveddit.com/v/worldnews/history/

[3] https://www.reveddit.com/about/faq/#need

[4] https://www.reddit.com/r/science/comments/duwdco/should_mode...

[5] https://www.reddit.com/r/changelog/comments/e66fql/post_remo...

[6] https://www.reveddit.com/about/faq/#reddit-does-not-say-post...


Data reliability is highly dependent on the type of data you're working with, and the procedures, processes, and checks on that.

I've worked with scientific, engineering, survey, business, medical, financial, government, internet ("web traffic" and equivalents), and behavioural data (e.g., measured experiences / behavour, not self-reported). Each has ... its interesting quirks.

Self-reported survey data is notoriously bad, and there's a huge set of tricks and assumptions that are used to scrub that. Those insisting on "uncensored" data would likely scream.

(TL;DR: multiple views on the same underlying phenomenon help a lot --- not necessarily from the same source. Some will lie, but they'll tend to lie differently and in somewhat predictable ways.)

Engineering and science data tend to suffer from pre-measurement assumptions (e.g., what you instrumented for vs. what you got. "Not great. Not terrible" from the series Chernobyl is a brilliant example of this (the instruments simply couldn't read the actual amount of radiation).

In online data, distinguishing "authentic" from all other traffic (users vs. bots) is the challenge. And that involves numerous dark arts.

Financial data tends to have strong incentives to provide something, but also a strong incentive to game the system.

I've seen field data where the interests of the field reporters outweighed the subsequent interest of analysts, resulting in wonderfully-specified databases with very little useful data.

Experiential data are great, but you're limited, again, to what you can quantify and measure (as well has having major privacy and surveillance concerns, often other ethical considerations).

Government data are often quite excellent, at least within competent organisations. For some flavour of just how widely standards can vary, though, look at reports of Covid cases, hospitalisations, recoveries, and deaths from different jurisdictions. Some measures (especially excess deaths) are far more robust, though they also lag considerably from direct experience. (Cost, lag, number of datapoints, sampling concerns, etc., all become considerations.)

It's complicated.


I've worked with a decent variety as well, though nothing close to engineering.

>Self-reported survey data is notoriously bad

This is my least favorite type of data to work with. It can be incorrect either deliberately or through poor survey design. When I have to work with surveys I insist that they tell me what they want to know, and I design it. Sometimes people come to me when they already have survey results, and sometimes I have to tell them there's nothing reliable that I can do with to. When I'm involved from the beginning, I have final veto. Even then I don't like it. Even a well designed survey with proper phrasing, unbiased likert scales, etc can have issues. Many things don't collapse nicely to a one-dimensional scale. Then there is the selection bias inherent when by definition you only receive responses from people willing to fill out the survey. There are ways to deal with that, but they're far from perfect.


Q: What's the most glaring sign of a failed survey analysis project?

A: "I've conducted a survey and need a statistician to analyse it for me."

(I've seen this many, many, many times. I've never seen it not be the sign of a completely flawed aproach.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: