Part of the problem with large internet platforms is that parts of 'my data' is ...

xpe · on June 25, 2021

Yes, calling out linkage as a key challenge for data privacy is very important.

To dig in one level deeper... Have you looked into privacy-preserving record linkage (PPRL) or similar ideas? (I have not, but I'm interested.)

> The process of linking records without revealing any sensitive or confidential information about the entities represented by these records is known as privacy-preserving record linkage (PPRL). > Source: "Privacy-Preserving Record Linkage". > DOI: https://doi.org/10.1007/978-3-319-63962-8_17-1

See also: "A taxonomy of privacy-preserving record linkage techniques" at https://www.sciencedirect.com/science/article/abs/pii/S03064...

9wzYQbTYsAIc · on June 25, 2021

> In that sense any opt-in choice given to another is yet another privacy breach on their 'contacts' for example.

That is a non-sequitor, when we are discussing opting-in to social science research.

Rally is not a social network platform. It is a social science platform. There is no reason for it to be directly, as a platform, concerned with your contacts.

Per their FAQ:

> We abide by a series of principles known as Lean Data Practices. Lean Data means that we collect just the data we need, and do everything we can to protect what we collect. Studies only collect data that is essential to creating a broader understanding of a research topic.

Institutional Review Boards, privacy policies, and the various contractual agreements between parties operating and building the Rally research platform would be held to task by scientific principles of treating participants humanely and ethically.

If an IRB deemed that it was unethical to conduct a study due to the design implications indicating that data could be obtained that did not originate from informed consent, then that study would not be able to be conducted and the research design would have to be modified to correct itself or that specific research methodology would be considered to be generally unethical by the wider scientific community, just the same as the scientific community deems it unethical to do genetic experiments on unwilling human subjects.

xpe · on June 25, 2021

>> In that sense any opt-in choice given to another is yet another privacy breach on their 'contacts' for example. That is a non-sequitor, when we are discussing opting-in to social science research.

> That is a non-sequitor, when we are discussing opting-in to social science research.

As I understand it, the commenter's point does not rest on 'contact' linking being present. Their point is that any kind of data linking provides a reindentification risk.

Regarding the risk of data linkages, how confident are you that Mozilla and others with access to the data will manage it ...

1. ... up to the currently-accepted level of knowledge (including hopefully some theoretical guarantees, if possible, and if not, mitigations with known kinds of risk) and ...

2. ... that the current level is acceptable given that history of data privacy doesn't paint a rosy picture?

To be open, I'm not interested in your confidence level per se, but rather the reasoning in your risk assessment. I want to weight the various factors myself, in other words. For example, you appear to have more confidence in IRB's than I do.

Knowing the history of the "arms race" between deidentification and reidentification, I don't put a whole lot of trust in Institutional Review Boards. Many smart, well-meaning efforts have fallen prey to linkage attacks. They are insidious.

P.S. In my view, using "non-sequitor" here is a bit strong, perhaps even off-putting. It is only a "non-sequitor" because you are making different logical assumptions than the commenter. Another approach would be to say "your conclusion only holds if..." This would make your point without being so pointed. It also helps show that you want to understand the other person's assumptions.

9wzYQbTYsAIc · on June 26, 2021

> As I understand it, the commenter's point does not rest on 'contact' linking being present. Their point is that any kind of data linking provides a reindentification risk.

It appears that the parent commenter revised its content to indicate that the concern was indeed “your data getting mixed with my data, when browsing Facebook”, to paraphrase.

My response there was essentially: ethical review would have to determine if all data must be provided through informed consent of all the originating humans.

Held to the gold standard of ethics, an IRB would likely have to contraindicate a research design if it did not provide a way for every individual human involved to provide informed consent. If any single individual in a data set indicated that they did not consent, then that data set would need to be reshaped to not include that individual. In lieu of that, the entire data set would have to be excluded from study.

Of course, that has some complex implications, when it comes to broad categories of data sources for browser usage: social networking sites would be a minefield. Did the website author provide consent for their content to be machine analyzed for sentiment, etc., if one really wanted to get down to it. You’d have to consider each and every resource location. Can’t assume that all browser traffic is open web traffic - someone could have left their Rally extension running while navigating to a corporate confidential network, complex copyrights, etc.

My understanding is that the US Supreme Court is about to decide on whether “if you can read it, you can keep it” as a consequence of Microsoft/LinkedIn vs. hiQ Labs, so don’t forget the “arms race” of justice, either.

> Many smart, well-meaning efforts have fallen prey to linkage attacks. They are insidious.

Indeed, even just basic double-blind medical studies are hard to defend when you consider operational security, let alone information security.

9wzYQbTYsAIc · on June 25, 2021

Thank you for your PS feedback, it is appreciated and will be incorporated.

My overall point is that if you don’t want data being captured that may provide data about your contacts, then dont opt-in to providing it.

Informed consent is the bedrock upon which social science ethics rests.

xpe · on June 25, 2021

Sure, I understand your point. Have you dug into the problems of data linkage attacks? (see questions above)

9wzYQbTYsAIc · on June 25, 2021

Not yet! I’m vacuously familiar with the basics, but I’m curious about the details and their relation to research design.

Will comment more fully as I find the energy.

xpe · on June 26, 2021

In case it is of interest, here is a fairly short article with a short historical look at data de-identification. If nothing else, it is one jumping off point.

"Data De-identification: Possibilities, Progress, and Perils". 2019. https://forge.duke.edu/blog/data-de-identification-possibili...