Major topics.
Technical Info:
Each incoming message is analyzed for as many 'features' as can be found that have significance, then the pre generated feature processor rule file combinese those features to figure out the probability that the message is SPAM or not. If the message passes a minimum level then it is placed in the users 'spam' folder and the sender is notified by a url (with CAPTCHA) this allows a real sender to bypass the 'spam' folder and ensure delivery.
The user can also access their spam folder via IMAP or the SurgeWeb email interface if they suspect a message may have been miss classified
This system has several key benefits:
- No messages are blocked or dropped, failed deliveries due to filtering mistakes are enormously annoying, and no filter is ever accurate enough to allow this type of heavy handed mechanism.
- The user does not 'see' the filtered messages unless they wish to.
- A real sender can easily get their message delivered directly to the users inbox by proving they are a human.
- By identifying many different 'features' using different modules/methods the results are more robust than any single mechanism for identifying spam.
- System admins can add their own 'features' and the system will correctly analyze the 'value' of those features so the systems reliability will not be degraded by 'one' badly chosen rule/score...
- SPF Domain Keys Sender Verification - These systems allow the sender domain/ip address to be verified as genuine, although this doesn't stop all spam it can certainly stop a large bulk of spam and phishing by identifying the obvious forgeries.
- Friends - For each user we maintain a list of friends, messages from known friends are whitelisted automatically and delivered directly to the inbox, the list of friends is automatically updated by the server as it notices who you send email to, and what messages you move in/out of your spam folder.
- Automatically collected sample database - A sample database is automatically collected so that the filter rules can be tuned to match the types of spam and nonspam that are common on your server.
- Bayesian like content filter - Each day the database is used to re-tune the Bayesian like filter which works by statistically anlyzing the words and order of words. This filter does well with text content spam (phishing, scams etc)
- Phrase and Key Word Filter - This filter is manually created and regularly updated to target phrases common to spam, this filter is also good at catching common scams and any common text based spam messages.
- Web page probing - This filter carefully examines messages to find url links which are 'unknown' to it, it then probes the web pages in question to see if they contain commonly spammed products (drugs etc)
- SPF and Domain Keys sender verification - Sender verification methods are used to identify people forging email, this allows non genuine email to be quickly identified and blocked. Most spammers forge email identities in some way.
- Feature rule file - This rule file identifies many different features of email (like capital letters in the subject), these features are then fed into the self tuning feature processing module...
- Self tuning feature processing - This module takes all the information from the modules above and combines the results based on analysis of the sample database to figure out the probability of a message being spam based on all the features it exhibits. Each day the rules are recreated based on your local spam samples, and combinational rules are created automatically.
How sf_mfilter --> feature_gen.dat (note aspam_mfilter.txt is replaced with sf_mfilter.txt in this version)
The rule file sf_mfilter.txt now produces a list of significant features for any email message, the features are then analyzed using the rules in feature_gen.dat to come up with a 'score'. So the scores are not 'hard coded' into sf_mfilter.txt
The file feature_gen.dat is created by analyzing sample messages from your own server, so lets say we have a feature "blob" which on your server correlates 98% with spam, and on my server correlates 20% with spam (so in other words an email on your server with the feature 'blob' is a spam email 98 times out of a hundred, and on my server 80 times out of a hundred its not spam. Then on your server the score in x-spamdetect header will be something like "plus 10" for an email with 'blob' and on my server it will be 'minus 4'...
The feature 'blob' might relate to something like the length of the 'to' header, or weather or not the spf tests passed etc...
Then, in addition to simple rules the automatic process generates combinational rules based on your sample messages, so it might notice that a message which is from yahoo and has a long "To" header, is always spam. These 'combined' rules are also used to further increase accuracy.
SurgeMail now includes it's own RBL system (Realtime Blocking List) and Reputation system. This is a two level database, a local database based on each server, and a reporting system and DNS based query system to merge data between all SurgeMail servers in the world.
This system classifies all ip addresses into one of 5 colors
Unknown = 98% spam
Blue = less than 10 days old, nothing significant known (typically 70% spam)
Brown = 95% spam
Orange = 40% spam
Yellow = 20% spam
Black = 99% spam
White = Less than 4% spam
As most 'real' email comes from servers you talk to all the time this system quickly identifies the trusted mail servers that never send spam so that messages from those server will be very unlikely to be accidentally classified as spam.
The advantages this has over traditional RBL services (which should also be used of course)
It is free to use
It makes use of the many users clicking 'spam/not spam' on messages as they read them to help identify spam more accurately.
It provides both positive and negative indications for the spam filter, this is much more valuable than purely negative responses as given by many rbl services because the bulk of spam comes from 'unknown' transient ip addresses, so the significant information is the list of known mail servers that regularly send 'non' spam.
It is also a long term reputation rbl, so instead of automatically forgetting everything every 2 days like many rbl systems we try and store a long term record of stats for each ip address. This database can be searched here: http://reputation-email.com/reputation/index.htm
tellmail sf_train - Rebuild feature_gen.dat from sf_mfilter.txt using local data in 'train' subdirectories
tellmail sf_compare - Test feature_gen.dat on train sub directories.
tellmail friends_url - Show a sample URL for unblocking a message, use to test your web access/ports are set correctly.
g_myrbl_share "true" - Share IP reputation information with netwinsite.com (strongly recommended, this setting really helps contribute to the wide area rbl which all customers benefit from)
g_sf_generate "true" - Generate feature_gen.dat locally rather than using a standard generic one from NetWin. This is worth setting once you have a reasonable sample collected (surgemail automatically collects sample messages within a few days)g_friends_lang_auto "true" - Guess the users language(s) by observing messages from each users friends, then add a tag if the user receives a message which is primarily in a language that the user does not have listed. The users language settings are prefixed with the word 'Auto,' when this setting is used so users who have manually set their language(s) will not get adjusted.
Here is an example header:
*******: 7.8 sd=7.8 [194]99%13.1(!9,46) [126]10%-7.2(!33,108) [38]87%5.4(X-myrbl:unknown)"
This shows a score of 7.8, then a list of the rules that were applied seperated by spaces. There are two sorts of rules, simple rules and combination rules.
A combination rule looks like this: [rulenumber]percent%score([!]a,[!]b)
rulenumber = this rule number as listed in feature_gen.dat
percent = The percent of messages that have spam if this rule is true
score = The score which is generated using the percent. Anything over 50% generates a positive score, below 50% generatese a negative score.
(a,b) = The two rule which were true that made this combined rule true, ! signs are used to indicate 'not'.
A simple rule looks like this: [38]87%5.4(X-myrbl:unknown)
rulenumber = this rule number as listed in feature_gen.dat
percent = The percent of messages that have spam if this rule is true
score = The score which is generated using the percent. Anything over 50% generates a positive score, below 50% generatese a negative score.
(a:b) = The header and value that were 'matched' that made this rule true, if no header is specified then it's a feature as defined in spf_mfilter.txt
The total sd=7.8 is not a simple sum of rules, but rather an 'average' of the rules that matched. Offset by '4' to the right, e.g. sum(scores)/n+4
To disable the new Smart Filter mechanisms and return to the old behaviour!
Only use these settings if you really must :-)
g_myrbl_disable "true"
g_sf_disable "true"
g_friends_byemail "true"
g_spf_byemail "true"
You can tailor your own rules still with this new system however, we suggest you consider the following, try using the builtin rules first and see how they perform.
If you are going to use 'local' rules then you must first enable local training:
g_sf_generate "true"
When adding rules (e.g. converting an existing local.rul file) you will need to change the actions to choose from the various possibilities
- Add a manual score - call feature_manual(0.8, "Manual addition")
- Add a self tuning score based on your spam sample - call feature_add(1.4,"featurename")
In the second case the score '1.4' is ignored. So the recommended method is to convert all call spamdetect(x,y) statements to call feature_add(x,y)
In the first case the value 0.8 is NOT the value added to the spam score, it is the probability that such a message is a spam message, so a value of 0.99 might add 12 to the spam score. A value above 0.5 will add a positive value to the spam score, a value below 0.5 will decrease the spam score. Examples:
call feature_manual(0.95, "Probably spam")
call feature_manual(0.7, "may be spam")
call feature_manual(0.0, "almost never spam")
call feature_manual(0.1, "Probably not spam")
You should only use 'manual' rules when the feature is so 'rare' that your sample data does not give useful figures on it, and in that case, the rule is probably of little or no value, so we suggest you don't use it at all :-) But there are exceptions where the sample spam messages will tend to give the wrong result (as the sample is not entirely random) so a manual rule might make sense.
A good example of when to use manual rules is when using an RBL service etc so you know that the probability of it being spam is very high you can improve results by using the manual rule with a nice high value like 0.9
Then run the following commands:
tellmail sf_train
tellmail sf_compare
The first will generate a feature_gen.dat rule file and the second will use it to compare results with the sample spam folders.
If you examine 'feature_gen.dat' after the sf_train command you will be able to see what surgemail thought the feature was and how significant it was (sig = the number of messages with the feature), A feature with a probability near 0.5, or one that occurs less than 20 times in the sample is probably of little point... Near 0.0 means the feature implies the message is not spam, near 1.0 implies the feature correlates with spam...
We are always interested in new features you make up that prove useful. Be cautious that some features can give misleading results due to the nature of the sample messages.
Convert into automatic rules instructions (recommended)
- copy local.rul to sf_mfilter_local.txt
- Search/replace "spamdetect(" with "feature_add("
Convert into manual rules instructions:
- copy local.rul to sf_mfilter_local.txt
- Search/replace "spamdetect(" with "feature_manual("
- Change each spam score -x to +x into a probability instead (0.0 -> 1.0) of spam.
Never put your email address on a web page, instead use a service like this one: http://www.emailmeform.com/
If the scoring is completely blank or if you see this text in the headers:
X-SpamDetect: : 0.0 sd=0 feature_gen.net (or .dat) is blank or missing, update from netwinsite failed see netwinsite.com/surgemail/help/myrbl.htm for help
It might mean you are running a new build with the new spam handling mechanism, and most likely it's failed to pickup it's main rule file so it's not applying any rules at all.
It might fail if you don't have updates, or if you have a firewall blocking port 80 outgoing connections from your server. Once you fix the problem you can 'trigger' an update automatically by deleting aspam_update.done and restarting surgemail.
The two files you need are:
sf_mfilter.txt
feature_gen.net
They should automatically be fetched from netwinsite but that 'can' fail if
your firewall is blocking port 80 connections. In which case you could download
them manually then restart surgemail.
https://netwinsite.com/surgemail/sf_mfilter.txt
https://netwinsite.com/surgemail/feature_gen.net
Or You can disable the new system with this setting: g_sf_disable "true"