Maven Imported 1.12 Million Fediverse Posts

hedge@beehaw.org · 2 months ago

Maven Imported 1.12 Million Fediverse Posts

Freeman@lemmings.world · 2 months ago

They pulled DMs of two users of the same instance?! Quite concerning tbh

Skull giver@popplesburger.hilciferous.nl · edit-2 2 months ago

ActivityPub doesn’t do DMs per se. Many ActivityPub implementations will use AP messages that are not posted on any public list or timeline. Basically, a Tweet with visibility set to “only people mentioned in this thread”.

This design makes it quite easy for AP servers to misimplement DMs. Asking a server for all messages of a particular user (to get their timeline) and forgetting to filter out messages not published globally is trivial to get wrong.

ActivityPub DMs are, in my opinion, not a good feature. This has come up before in Mastodon, where DMs mentioning a third account will add that account to the thread and destination of all future messages (and possibly authorise it for accessing past messages); one mention will give them full access to your “direct” messages.

I doubt this scraper did anything wrong here, I think it’s just a matter of a buggy server or users sending DMs that aren’t really DMs because of Fediverse software with GUI design flaws.

Edit: looks like it’s probably a Mastodon bug: https://hackers.town/@thegibson/112604700601089641

jherazob@beehaw.org · 2 months ago

I recall somebody’s working on actual, E2EE Mastodon DMs, but couldn’t give you details, i guess when it’s ready we’ll know when people start using it

Peter1986C@lemmings.world · edit-2 2 months ago

That would be Sup: https://github.com/theSupApp

By the same person who started Pixelfed.

jherazob@beehaw.org · 2 months ago

How the hell does he do so much? 😄

4am@lemm.ee · 2 months ago

Seems if the messages are sent in an inherently insecure fashion, all one would need to do is set up an instance that purposefully does not filter out all the things it’s supposed to be kind/competent enough to filter out, and boom it has everything.

Skull giver@popplesburger.hilciferous.nl · 2 months ago

Yes, just like on twitter, reddit, and most of the other platforms the Fediverse is trying to replace, server admins are free to read your messages. There’s no encryption. The Fediverse just adds more server admins to the mix.

I would not recommend using the DM function on most Fediverse platforms for things you’d like to keep private. While in most cases there are no privacy risks, there are also very few guardrails to ensure that.

You’re better off using a federated platform with encryption support like Matrix or XMPP. Neither of those are very safe if you don’t verify the other’s keys (although neither is any other chat service, even Signal) but both are much safer.

If it weren’t for the lack of shared credentials, I would’ve expected someone to add a minimal secure chat client to the Lemmy frontend already. Especially on the servers that host a Matrix server already

kevincox@lemmy.ml · 2 months ago

It’s not “inherently insecure” at least not to that degree. (Once could argue that lack of E2EE is insecure.) If you stand up an unrelated instance you shouldn’t be able to access private messages that don’t relate to an account on your instance. So only bugs in your instance, or your conversation partner’s instance, will be able to leak those messages.

IllNess@infosec.pub · 2 months ago

If we hit these AI companies with targeted suing, like how Scientology got their way with the IRS, maybe we then they can listen to not steal our shit.

The MPAA and RIAA have created all these laws and used our own government againat us. Maybe we can use these same laws and do the same.

sfera@beehaw.org · 2 months ago

I was confused for a minute, not understanding what (Apache) Maven has to do with social networks.

Pekka@feddit.nl · 2 months ago

Maybe we have some bias on this topic, but I had the same thought. Maven is such a well known tool in IT, that I’m surprised they just created a social network with the same name. Until they get a bit famous this won’t be good for SEO.

darkphotonstudio@beehaw.org · 2 months ago

I wouldn’t have a problem with all this scraping, if these companies had to release their models trained on this data as open source.

esaru@beehaw.org · 2 months ago

That’s a great idea. Can we not apply a license to that social content that forces AI models trained on it to be open source?

renard_roux@beehaw.org · 1 month ago

That’s actually pretty good. And then they’re open to getting sued when caught.

I guess it could be done on an instance basis, although I’m not sure how happy fediverse users will be if their instance has an official policy of open-sourcing (or maybe it’s public-domaining?) all their content by default.

esaru@beehaw.org · edit-2 30 days ago

Well, such a license could just obligat to open source the AI model that has been trained on it. If the instance prohibits training of AI models, or allow it, would be a separate condition that’s up to the instance owner, and its users can decide if they want to contribute under that condition, or not.