Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Applications > Mail.app: matching junk using regular expressions - my solution

Mail.app: matching junk using regular expressions - my solution
Thread Tools
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 01:10 PM
 
(Edit: new thread
http://forums.macnn.com/showthread.p...hreadid=188887
points to the site for the tool described here)

Read this if you

1. are annoyed that spammers nowadays intentionally garble their messages to defeat Mail.app's junk filter (e.g., viagra becomes vi!agra)
2. are going nuts that some spams come in HTML in which keywords are encoded in HTML entities, such as ' ' for the space character


My solution involves an AppleScript script, a Python script, and 3 configuration (text) files containing patterns you want to match (one for subject line, one for sender line, and one for message raw source text). Everything required is built-in in Panther.

I'll post the 5 files in the following message, but here are the steps to follow:

1. Create a folder: /Users/<yourname>/Library/Scripts
2. Dump all the 5 files you copy-n-pasted from my next 5 posts in that directory
3. in Mail.app, open Preferences > Rules
4. Add a rule to the end of your rule list, call it JunkMatcher (or whatever): let the rule match "every message", and set the "performing the following actions:" to "Run AppleScript", and then choose /Users/<yourname>/Library/Scripts/junkMatcher.scpt.
5. Click ok to add the rule, and you should be all set.


What does this do?
Whenever a new message comes in, if its subject/sender/raw content matches ANY of the patterns specified in junkSubj.txt/junkSender.txt/junkContent.txt, the message is moved to the junk folder. Note this is DISJUNCTION - any pattern match would make the message junk!
(Last edited by fortepianissimo; Nov 15, 2003 at 09:12 PM. )
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 01:11 PM
 
junkMatcher.scpt
Use Script Editor to create the file by copy-n-paste:

Code:
using terms from application "Mail" on perform mail action with messages theMessages for rule theRule tell application "Mail" repeat with theMsg in theMessages set theSubject to subject of theMsg set theSender to sender of theMsg set theContent to source of theMsg set result to do shell script "python ~/Library/Scripts/junkMatcher.py " & quoted form of theSubject & ¬ " " & quoted form of theSender & " " & quoted form of theContent if result is equal to "yes" then move theMsg to the junk mailbox end if end repeat end tell end perform mail action with messages end using terms from
(Last edited by fortepianissimo; Nov 14, 2003 at 01:46 PM. )
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 01:15 PM
 
junkMatcher.py
use any text editor to create this file

Code:
#!/usr/bin/env python import re,sys,os ROOT=os.environ["HOME"]+'/Library/Scripts/' entityPat=re.compile(r'&#\d+;') content=sys.argv[3].replace('\n',' ') idx=0 while 1: m=entityPat.search(content,idx) if m is None: break code=int(content[m.start(0)+2:m.end(0)-1]) if code<256 and code>=0: content=content[:m.start(0)]+chr(code)+content[m.end(0):] idx=m.start(0)+1 else: idx=m.end(0) #print content def makePat (fn): patStr='|'.join([line.strip()[1:-1] for line in open(ROOT+fn,'r').xreadlines()]) if len(patStr)==0: return None else: return re.compile(patStr) junkSubjPat=makePat('junkSubj.txt') junkSenderPat=makePat('junkSender.txt') junkContentPat=makePat('junkContent.txt') if ((junkSubjPat and junkSubjPat.search(sys.argv[1])) or (junkSenderPat and junkSenderPat.search(sys.argv[2])) or (junkContentPat and junkContentPat.search(content))): print 'yes' else: print 'no'
(Last edited by fortepianissimo; Nov 14, 2003 at 01:26 PM. )
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 01:18 PM
 
junkSubj.txt
use any text editor to create this file - this is a list of regex patterns, each of which must be on a line, surrounded by a pair of ", and the pattern must be specified in Python formalism (see, for example http://www.amk.ca/python/howto/regex/ , you can find other places detailing this).

Code:
"(?i)v\W?i\W?a\W?g\W?r\W?a" "(?i)p\W?e\W?n\W?i\W?s" "(?i)prescription"
(Last edited by fortepianissimo; Nov 14, 2003 at 01:26 PM. )
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 01:20 PM
 
junkSender.txt
use any text editor to create this file - this is a list of regex patterns, each of which must be on a line, surrounded by a pair of ", and the pattern must be specified in Python formalism (see, for example http://www.amk.ca/python/howto/regex/ , you can find other places detailing this).

Code:
(that's right, so far I haven't tried to match against senders)
(Last edited by fortepianissimo; Nov 14, 2003 at 01:27 PM. )
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 01:21 PM
 
junkContent.txt
use any text editor to create this file - this is a list of regex patterns, each of which must be on a line, surrounded by a pair of ", and the pattern must be specified in Python formalism (see, for example http://www.amk.ca/python/howto/regex/ , you can find other places detailing this).

Code:
"(?i)v\W?i\W?a\W?g\W?r\W?a" "(?i)p\W?e\W?n\W?i\W?s" "(?i)prescription" "(?i)manhood"
(Last edited by fortepianissimo; Nov 14, 2003 at 01:27 PM. )
     
Mac Elite
Join Date: Mar 2001
Location: Provo, UT
Status: Offline
Reply With Quote
Nov 14, 2003, 02:08 PM
 
You should send this hint to OSXHints.
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 08:55 PM
 
These are new patterns I added to junkContent.txt

Code:
"(?i)<\s*i(?:=\s*)?m(?:=\s*)?g(?:=\s*)?[^>]+(?:l(?:=\s*)?o(?:=\s*)?w(?:=\s*)?)?s(?:=\s*)?r(?:=\s*)?c(?:=\s*)?\s*(?:=|=3d)\s*(?:'|")\s*h(?:=\s*)?t(?:=\s*)?t(?:=\s*)?p(?:=\s*)?:" "(?i)pill(?:s)?"
The first pattern will throw any message referring to an image via http into junk folder.

The 2nd is an obvious addition, and can be added to junkSubj.txt as well.


Junk mails, die die die!

(edit: updated pattern for external image, also removed "microsoft" pattern - too aggressive)
(edit: updated img patter - more coverage)
(Last edited by fortepianissimo; Nov 15, 2003 at 02:33 PM. )
     
Senior User
Join Date: Nov 2002
Location: US
Status: Offline
Reply With Quote
Nov 14, 2003, 09:00 PM
 
Originally posted by clarkgoble:
You should send this hint to OSXHints.
Done that - thx for the suggestion.

My secret hope is, once everyone is using this, those son of b*tches will find their messages are all but down in the drain.

Of course, my 2nd secret hope, is that everyone not using Mac will still get loads of spams as before.
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 10:21 AM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2