Steem Sincerity - Improved Anti-Spam API

in #steemdev7 years ago (edited)



Steem Sincerity is a project aimed at helping to address the spam problem we have on Steem.

As I explained in my introductory post there are three aspects to this. This post discusses the most important aspect in more detail.

Public API for Developers

This is a service hosted on my server(s), which can be queried by any front-end website or app to obtain information about Steem accounts. It uses a database which stores the last 7 days worth of posts, comments and votes.

Periodically the software extracts meta-data (data about the data) from these accounts, and much of this can be easily accessed by application developers using the methods here. The meta-data for each account is also fed into a kind of artificial intelligence software which looks at how it compares to other known spamming and bot accounts, so it can 'classify' each active account.

What is classification?

In machine learning, classification is an approach in which the computer program learns from the data input given to it and then uses this learning to classify new observations. So from our perspective, we first 'train' the classifier by giving it three lists of Steem accounts that have been manually classified as either Human Content Creator, Spammer or Bot.

It is programmed to be able to extract the relevent meta-data - or what are called features in machine learning - from these accounts. Some of the many features used in the Steem Sincerity software are: number of comments, number of posts, average number of downvoted comments, average word length etc. It looks at how these features vary between the different classes of account, and makes rules for itself to use when deciding about how to classify accounts that it hasn't seen before.

The classifier has currently been trained using only around 30 accounts of each type, and has a cross-validation accuracy of around 78%, and very little non-spam is classified as spam. Cross-validation is a standard technique for evaluating the accuracy of a classifier, but of course what constitutes spam is highly personal, so inevitably my preferences will have introduced biases. A larger crowdsourced training set is planned to reduce this bias in the near future.

Rather than making a direct prediction about whether an account belongs to a spammer, the API actually returns the probabilities of the account belonging to each of the three classes. For example an account may show the following classifications scores:

Human Content Creator: 45%
Spammer: 45%
Bot: 10%

Each front-end using the API can make its own decision about what should happen at different spam thresholds. For example, it could fade the comment if the spammer score is between 40-70% and hide it altogether if the score exceeds 70%. It could even leave this up to the user to decide.



This is a very simple illustration of how accounts with comments containing certain combinations of features may be classified as spam. The red dots represent real spamming accounts, and the pink area shows the accounts which are classified as spammers. The accuracy is not perfect, but good enough to be useful. In practice the machine learning algorithm used by the Sincerity software uses far too many features to be able to show in a two-dimensional diagram.

API Specifications

If you are a developer, you can find the API specification here. There are currently 10 methods, and since the main intention is to help improve front-end user experiences, performance is prioritised over the having larger amounts of historical data. Currently no API keys are required, and request rate limiting is fairly relaxed, but this may need to change depending on future demand.

Main Methods

/api/accounts-info/account1,account2,account2

This expects a comma separated list of accounts, and returns various useful meta-data about the accounts. This includes the probability that each account is a: Human Content Creator, Spammer or Bot. It also includes some metrics about the commenting and voting behaviour of the accounts. Note that only accounts which have commented in the last period will have records in the database. Because up to 100 accounts can be queried at a time, this is the most useful method for hiding or changing the appearance of spam in your application.

/account-full-info/account1

This returns the complete analysis information that are held for the account specified. There are many fields, a few of which are unused. You may want to query this when an account profile is clicked for example.

/account-comments/account1

Returns a time-sorted list of the comments made by the specified account in the last 7 days.

/account-outgoing-votes/account1

Returns a time-sorted list of the votes made by the specified account in the last 7 days.

/account-outgoing-downvotes/account1

Returns a time-sorted list of the flags given by the specified account in the last 7 days.

/account-apps-used/account1

Returns the list of apps the specified user has used to post and comment in the last 7 days.

/biggest-spammers/

Returns the 500 accounts most likely to be spamming accounts. This may be useful for stakeholders employing bots to clean up the platform.

There are a few other methods, and I will add more over time.


I'll be improving the Chrome Sincerity extension soon, to use some of these new methods.

If you have other requirements for a different API method or need to apply machine learning to different data, I'd be delighted to work for STEEM ;)

Sort:  
There are 2 pages
Pages

This post was funded/promoted by @DevFund using a budget of about 360.00 USD on voting bots.

100% of the money sent or earned via upvotes to this account will be powered down and used to give back via promotion bots to Steem ecosystem development initiatives like this one.

https://steemit.com/@devfund/comments

This is FUCKING SPECTACULAR! Thank you for putting this together.

BEWARE SPAMMERS, NOW THAT I CAN FIND YOU, I WILL COME AFTER YOU.

/biggest-spammers/ better run...

<p dir="auto"><span><img src="https://images.hive.blog/768x0/https://img.thrfun.com/img/164/250/dog_hiding_l1.jpg" srcset="https://images.hive.blog/768x0/https://img.thrfun.com/img/164/250/dog_hiding_l1.jpg 1x, https://images.hive.blog/1536x0/https://img.thrfun.com/img/164/250/dog_hiding_l1.jpg 2x" />

Nice demonstration, @drakos. Ha ha..

Thanks for the support! Let me know if you want different views of the data.

We definitely need to support this amazing project. For this community gets bigger and bigger some users are abusing it. Well done @andybets

thanks @andybets this will make the steemit community a better place to be . :)

Hi, awesome work! Would you also like to have users input on this?
I am thinking about using this on SteemPlus extension (currently about 1600 active users) and could code something to report spammers / bots to your API if you want to take human feedback into account. You can contact me on Steem.chat/Discord if you're interested.
EDIT: self voting for visibility

That'd be excellent! I was thinking about the possibility of adding that to my very simple extension, but since yours is much better than I could do, and you have lots of active users, it makes great sense. I'll be in touch. :)

Great! Waiting for your message then.

This API sounds awesome!

Maybe MB will use this in the coming days to detect abuse ;)

Hey reggae, did you notice @art-universe made a painting of you?

here's the link to the original post if you wanna go check it out.

What are the use cases of these ? Is it like people can see and upvote or flag accordingly ? or is this meant for @steemcleaners?

It has many uses which app developers will decide, but one is that it can be used for re-rendering comment sections in front-ends to hide spam.

@steemreports will shortly have some tools to display this info for end-users.

One men one account ?

@andybets great idea, but for many people (like me) it could also mean less visibility. For some reason, I was human before but now I am identified as spammer (which is pretty weird as I haven't been active in the last couple of days) and there's rreally not much you can do about it..

Sorry for this inaccuracy, it is clear to me you're not a spammer, so I've added you account name to the training data. When the next version is released your scores should improve.

Thank you so much! I was also wondering how the "personal voting option" for the steemplus extension plays a role in it? Light how much is the voice of a personal voter weighted against the api?

The data from SteemPlus is used to help form the training data that informs the API what spamming and bot accounts look like, so it can make estimates about the othr thousands of accounts that it isn't given a classification for. There are various other data sources as well as SteemPlus though.

Oh wow, great! Thanks for making that clear!

All I can say is: wow this is freakin cool! I am going to add this to my list of things to integrate into the post promoter voting bot software!

Great! Let me know if you would like any changes on my side.

user my perspective is userish: I would rather prefer it "onload" than "onclicked".
chrome.browserAction.onClicked.addListener
If a user installs the extension, she wants it 2 b active by default. Correct me if i'm wrong.Hi @andybets! Although I'm very excited about the API, as a frontend

Thanks for the input. I actually ask about this issue here (or maybe that's where you saw it?):
https://github.com/andybets/steem-sincerity-chrome-extension/issues/1

I think I now understand how this should work, and will start working on the next version of the Chrome extension soon. I think I may not even need the background page, but am very new to Chrome development.

Saw it now, sorry, wasn't aware of ur awareness:)
2 your concern of load on the API, i think load is the first indicator of success and worth thinking about. like some sort of incentive 4 users 2 share their comp's resources... but as i'm diving deeper, it bcomes clear 2me, that i'm trying 2 reinvent steem and that job is already done, pretty fucking well.
If i can help u by my old cpu/hd/bandwidth and even 4 redundancyz sake, i'll gladly do.

Hi I'm confused why it said I was 60% spam? All my post are encouragement and from the heart? Is there something I don't know?


Hi, your account @cliffpower is not classified as spam: http://steemreports.com/sincerity-accounts-info/?accounts=cliffpower

...do you have another that you are referring to?

@smartsteem owing.pngWhat about the guy who does'nt pay, is there something I can do? I'm new at steem since January and still figuring this all out. Now we have spam police who just seem to steal your money. @buildawhale and @smartsteem did the same thing to me? do you have any advice :)

THANK YOU, I just want to be a good player :) I'm one man one account.

You are not on the @buildawhale blacklist, so I don't know how you can claim we stole money from you.


@@smartsteem owing.pngWhy is it I don't get paid for a post from you and a post from @smartsteem. Maybe it's my mistake but I can't see what I'm doing wrong? Why didn't @smartsteem pay? Thats 70 steem dollars I invested

Thank you for replying

You got an upvote worth $151 from SmartSteem.

I don't see when you used my bot (@buildawhale) but it is the same thing. We respond with an upvote, we don't give you cash back. If that was the case, we would just use it for ourselves to print unlimited money and open a theme park on the moon.

That was the first bid O did and he paid but the second one he did not pay. I was expecting a vote value of $365.87 to be sent to my blog post? I'm going to take some time and read how all this works. Even though I've received upvotes I don't fully understand it. Cheers

You are looking at an estimate if no one else bids after you (which is almost never the case).

We definitely need anti-spam tools as it's already an issue. I'd like to see ways to use the list of problem accounts to flag them and prevent them profiting. That should help discourage it. All the best with this project

I agree. The 'Biggest Spammers' list is only useful if a bot decides to use it. I'll also add a list of account links to steemreports soon, and after that, maybe even some kind of interface that uses SteemConnect to make it really quick and easy to flag spam manually.

Hi, i've just found Steem Sincerity in SteemPlus, I've been using it for 2 days now. This is a great tool i think. But i have a question. How can it be calculated? One of my friend is a newbie steemian, @zitus. She made only some posts but she is considered as a 38.14% human, 34.97% spammer and 26.89% bot. And me, as 58.40% human, 40.00% spammer(!!!)and 1.60% bot. Well, i frequently use the same phrases, like "dear Steemies, today is orange, TuesdayOrange" (and other colors for each days of the week)because it's more comfortable for me than formulating different English sentences. That's a hard effort for me because my English is not so good. And recently i made much more posts, 6-7 a day (but they were all good quality posts) Other question: does it count, that i use upvote bots, 3 times daily?

Hi , these scores are indications or probabilities which applications developers can use in their interfaces for excluding or penalising accounts considered to the spammers. Many will not take any action until the spammer score is above 70%, so you don't need to worry about this. New accounts have baseline probabilities, which are 40% human, 30% spammer and 30% bot, and as you interact with the platform they are re-evaluated.

Here you can see your current scores:
http://steemreports.com/sincerity-accounts-info/?accounts=kalemandra%2C+zitus

Only accounts in the 'Spammer' triangle, may be penalised by some app developer is they're using the Sincerity API.

let us hope that we all will be safe right now

This sounds really great! I get tired of some of the known spammers and bots out there.

What will really improve things, is when bloggers can do better than "mute" but can actively block certain people from commenting on their blog. Spammers need prey and the prey being able to better defend themselves would be great.

Until then, your efforts to reduce spam are greatly appreciated!

I agree with this. I totally understand why you can't stop people voting or downvoting where the reward pool is concerned, but see no reason that people should need to allow everyone to comment on their posts.

I've had a couple of unwanted bots commenting on my posts. One was that "catfacts" bot who puts useless trivia about cats in the comments of anyone who uses the word "cat" as a tag. (Mute #1 for me.)
The other is the "cheetah" bot which gets some respect from what I have seen, but in my case, all it could do is provide the link I'd already provided in my article! (Mute #2.) Another person I've ended up muting because the guy puts useless comments everywhere. He can still comment on my posts though, which is annoying. I know I'm "preaching to the choir", but I know you understand where I'm coming from.

I am now very cautious in commenting post and tapping any link. Just like this account @tomole444. Every time I saw his/her comment it scares me.

Are there any plans to release the source code under an open source license?

I am considering this, but haven't decided yet. If so, I would like to recover more of my development costs before it happens.

EDIT: I should also say that there is a cost to this in that spammers would then be better able to circumvent any measures that may arise from their adverse spam scores.

Sounds completely reasonable! I'm a representative from Utopian.io - sounds like you might just be looking for us if you want to go open source and get rewarded in steem.

Thanks. I've used Utopian for some software it's great. I'm just unsure about this project.

No problem, I felt kinda weird advertising us there, I just came to comment since I like the project. I just figured you might want to consider this as an option. Let me know if you do decide to go open source!

AWESOME! I applaud this so much!

but I am still sad my cuddle-bot (delivering tons of upvotes and barely ever leaving a comment without upvoting) has made it into the top 500 biggest spammers

I don't think there's many who interact with the kitten who actually see her as spam... but I understand how an algorithm may get to that impression.

P.S.: maybe some of this data could be incorporated into the rating to determine how spammy an account actually is?!

at the moment a very obnoxious spammer (@tomole444 for example) does not make it into the top 500 (despite ~5k comments and ~650 flags received) while my cuddle-delivery service does get caught with "only" 160 comments and zero flags!

I'd also like to have seen this account with a lower spam score, and higher bot score. I will add it to the bot training set. ;)

These factors you mention are included in the rating, but the accuracy and effectiveness is limited by the training set, which I'm in the process of expanding.

I see! Thanks for the feedback... expanding the training data by reasonable but not too biased examples will be the major challenge (as it seems to be with AI).

I'm curious to see how the detection will improve over time. Thanks a ton for the efforts you are making on this!

Wow nice development for Steemit :) Personally I do ML using randomforest for classification problems. Key issue is to have sufficient features yet not overfit my predictions. Luckily RF does output probability scores for each classification so it makes it easier to set different thresholds.

Depending on how much data you have in your training set, I guess I can take a look at your APIs. Not sure if I could contribute as I am also tired of those pesky spammers while doing nothing about it. Im sure your work can help create whitelists and blacklists or give out a “spam” rating for every user. Ratings should be kept below certain threshold. It could probably help mirror the reputation but focused on catching spam.

Have a good day and hope we can chat a bit more about this implementation.

Thanks. I'm not familiar with random forest, but I see it relates to nearest neighbours, which my implementation uses. My training set isn't really big enough for it to be highly robust/unbiased, but I hope to fix this soon.

So, are you considering me as spam ?

The software just provides spam probability scores. How apps decide to use those is up to them. That said, I think your account's spam score at 43% looks too high, so will add you to the training set for next time.

Same for me suddenly...

This is really awesome! What ML algorithm did you use?

Thanks. It's currently using k-nearest neighbors, but I'm still investigating what works best.

I was originally planning on building something like this for a global blacklist before finding was cut. When I am on a desktop I’ll check it out.

This is very nice post... Lol :D
Thank you for your work, tip!

This is a nice piece of work! Have you been able to get the training data sets you were looking for?

Not yet. I didn't get a lot of interest, so I'm devising the best way to crowdsource it.

This tool is amazing, thank you.
And the best part is that is developed it like an API. In this order, many of us can use it in our tools.

I have created a new tool called Custom Feed. Where you can filter posts by reputation, resteems, payout, number of votes, comments, body length, tags, authors, among others.

In this order, it will be more easy for you find the content that you want to read. Maybe you are interested in it. Details here.

Didn't finish 2 read yet, just browsed to the methods and was awed with an urge 2 thank u. Now back 2 the article :)
OK, the rest of the article was what i already saw, but the extension is a real candy!

Hi, this is a pretty cool initiative. Any steps or ideas to classify posts directly as spam instead of accounts?

By the way are you reachable on any discord server? I'm working with Machine Learning on Steemit Blockchain data, too. Mainly trying to find good content, rather than punishing bad actors :-D. I'm interested in exchanging ideas if you like.

Btw, I found a bot that tries to achieve a similar goal to your initiative (maybe you can get in touch, too):
https://steemit.com/introduceyourself/@duplibot/introducing-duplibot-reducing-rewards-on-comment-spam

cough.(By the way I did find this by using my own content search bot @hounddog ;-)

I decided that since, unlike email, the senders of messages can't be spoofed in Steem comments, and that account intentions would change slowly if at all, that accounts were a better level of granularity than comments for classification. I do see a lot of merit in an additional layer for scoring individual comments though, and these could in fact feed into an account classifier.

I'm in the steemdevs server in discord, but am not very familiar with it, and also steem.chat as @andybets. I'd be interested in what you're working on. :)

So people will be able to run bots that only look at the list created?

They can use any of the APIs listed and some allow them to check any account's classification scores. There will be more coming over time.

I would like to see more interesting facts like this, do not you?

Its reall to like awesome work

Wow I like your style. I'm a beginner your post just amazes me.

Nice job
Now all steemian may be secure

Hi @andybets! You have received 0.3 SBD tip from @cardboard!

Click here to learn more :)You can now delegate SP / invest in @tipU for daily profit:)

You got a 59.83% upvote from @postpromoter courtesy of @devfund!

Want to promote your posts too? Check out the Steem Bot Tracker websitevote for @yabapmatt for witness! for more info. If you would like to support the development of @postpromoter and the bot tracker please

Thank you for your tireless effort invested into thiswork to update us @andybets

Nice reseearch..
We need to eliminate spammers from steemit
I'll be willing to join in searching for more methods.

Thanks for your nice sharing ,it is a very good post

This a helpful information.education Thanks @andybets for the

thats really helpful information Thankyoh

good post...

Very helpful post for understanding how to use steemit account. I am thankful to you I will follow on your instructions. I understand that this platform is very helpful for those who are true accounts. Thanks again for sharing great information.

Wow spectaculer

Nice !
Lol i jolie d'or sûre it's a real problem the community shouldn't be the center of the action dev should dix ans be more open to problem about this platform

Nice post i like it @andybest

nice logo... can use it for my blog?

nice good work


Your post was mentioned in the Steemit Hit Parade in the following category:Congratulations @andybets!

  • Pending payout - Ranked 5 with $ 430,67
There are 2 pages
Pages