JunkMatcher Howto: Pattern Tests

(new stuff is in red)

What is a typical workflow to add a new pattern test?

  1. You load a piece of spam into the Analyzer window.
  2. You switch back and forth among different message views to observe the most distinct texts that makes this message "spammy".
  3. You start typing your experimental pattern into the rounded text field that's directly underneath the message content, and hit the "Match" button to the right to test. The matched portion of the text will be highlighted in red.
  4. When you're satisfied, hit the "Add" button and the pattern test targeting the currently selected message view will be appended to your list of tests in the lower half of the window.
  5. You may want to spawn the test to target additional views.
  6. You may want to drag to reorder some of these test instances to different positions in the list.

How do I re-target a pattern test to a different message view?

In the test table of the Analyzer window (the lower-half), every pattern test displayed in a row is given a little popup menu to the right side of the row, and you can use that menu to change the targeting view for the test, like this:

Re-target views for a pattern test

How do I spawn a pattern test to target additional message views?

Easy: first select a pattern test in the Analyzer window, then hit the "Spawn" button at the bottom of the window - this will add an almost identical pattern test to the end of the test table, with only one difference: the targeting message view is the first view that has not been targeted by the same pattern. If that view is not what you want to target at, you can always switch the targeting view to something else.

How do I activate certain pattern tests only on emails from a certain account?

The answer is almost identical to this one.

How do I activate certain pattern tests only on emails written in certain languages?

Bring up the Test Inspector window on the pattern (by double-clicking on it in the Analyzer window). Enter a pattern that matches the charset of target language into the "Encoding Pattern" field:

Test Inspector on a Pattern

For example, to activate a pattern only on emails written in Traditional Chinese, enter "^big5" in its Encoding Pattern field.

What is a user pattern, and how do I change a non-user pattern into a user pattern?

The distinction between a user pattern vs. a non-user pattern is only meaningful when you:

The basic idea is when you get new patterns from an outside source, a comparison will be made between the set of new patterns and your set of non-user patterns; i.e., only non-user patterns are candidates of any kind of change: they could be removed or changed, or new non-user patterns might be added. In other words, non-user patterns are "managed" and are up to any future updates.

Important corollary: if you want to make sure a certain pattern won't be touched, you should change it into a user pattern. Take a look at the screenshot here: see that little "user" checkbox? Ticking it will change a pattern into a user pattern.

How do I import patterns from a JunkMatcher Pattern Package (jpp) file?

To import patterns, first you need to already have a JunkMatcher Pattern Package file - this file has a file extension .jpp, and it can be created by exporting patterns. The jpp file contains both meta patterns and patterns, and its icon looks like this:

JunkMatcher Pattern Package

You can start the importing process by double-clicking on a jpp file, or drag the file and drop it on top of JunkMatcher.app. Alternatively, you can choose the menu item "Import Patterns" under the File menu, and then choose a jpp file to import.

After opening a jpp file, a dialog box will pop up to tell you some basic info about the package: when and who created the package, etc.:

Pattern Package Info

Clicking on "Show the Pattern Deltas" will then initiate a comparison between the current set of non-user patterns you have and the patterns contained in the file. If the comparison finds no difference, you will be informed so and the process ends. Otherwise, you will be presented with a Pattern Delta window to show you all the changes contained in the jpp file:

Pattern Delta window

(That's right - it is the same window presented to you when installing the factory version of patterns actually finds any change necessary to revert your non-user patterns back to the factory set)

In the Pattern Delta window you can select the changes you want to accept, so in effect you can selectively import patterns. Or you can hit "Accept All" and then "Proceed" to accept all changes - you will then have exactly the same set of non-user patterns contained in the jpp file.

One last word: any change won't be final until you hit the "File -> Save Tests" menu item.

How do I export patterns to a JunkMatcher Pattern Package (jpp) file?

To export patterns, choose the menu item "Export Patterns" under the File menu. You will be presented with this window:

Exporting Patterns

You have three options in terms of choosing what patterns to export: you can export only the non-user patterns (maybe you want to publish only these less personalized patterns), only the user patterns (for backup), or all of the patterns. In any case, the following will also happen:

  • All of your meta patterns will also be exported, regardless whether they are user or non-user meta patterns; and
  • All of the patterns/meta patterns in the exported JunkMatcher Pattern Package (jpp file) will be marked as non-user (so others can import them to affect their non-user patterns).

The window above also allows you to enter some basic information that will be shown to a user when she imports the resulting jpp file - maybe indirectly from receiving a pattern update online.

If you intend to publish the resulting jpp file, read on here.

Click on the "Proceed" button, choose a filename to save, and you're all set.

How do I get new pattern updates online?

You can do it via the menu item:

Check for Pattern Updates

Or tell JunkMatcher to check for pattern updates every time you launch JunkMatcher.app.

If a new pattern update is found, the new pattern package will be downloaded and an importing process will then take place. The only difference comparing to importing from a jpp file is that you don't actually see or open the file.

How do I publish my patterns online?

The first step of publishing your patterns is of course exporting your patterns. When the "Exporting Patterns" window shows up, make sure you tick the box "Generate a pattern news file for publishing". This will instruct JunkMatcher to generate a file PatternNews (no file extension), which contains the necessary information telling a user where to download your patterns, and what the MD5 checksum of the jpp file is. The URL you will enter there has to be a direct link leading to your jpp file, and it doesn't need to be the same URL where you will put your PatternNews file.

(if you don't know what an MD5 checksum is, it's basically a "fingerprint" of a file, so that we can compare the fingerprint with the actual file to see if the file has been tempered or corrupted)

After you get the exported PatternNews file and jpp file, you can then upload them to your server(s). When you announce the published patterns, tell everyone the URL where you place the PatternNews file so they can change their pattern update preference accordingly. Just remember that the URL actually leading to the jpp file has to match the URL you entered in the Exporting Patterns window.

Let me reiterate: a jpp file and its PatternNews file don't need to reside at the same location.

By targeting a pattern to both of the body view and the rendering view, aren't you double-penalizing a matched email?

Your reasoning is correct because the body view and the rendering view are just two different views of the same content.

That's exactly why there is a hidden provision: whenever a pattern matches either one of the body/rendering views, it will not try to match against the other.

However to facilitate the process of building/testing patterns, when you click on the "Match All" button in JunkMatcher.app, it will still report matches from the same pattern in both views.

Teach me regular expressions!

In JunkMatcher you can write your own patterns using regular expressions: it is a fairly flexible and compact representation for textual patterns. In particular the kind of regular expressions used here are the ones used in the Python programming language - but you don't need to learn the language to start writing your own patterns! Don't you worry - it's really not hard. I'll explain some basics here. If you want to know more, you can read from here.

(To be more specific, the Python regular expression syntax used here is from Python version 2.3.x - this is the version Apple shipped in OS X 10.3.x Panther)

Let's take a look at some of the patterns designed to match a message's body view (each line below is one pattern):

(?i)v\W?i\W?a\W?g\W?r\W?a
(?i)p\W?e\W?n\W?i\W?s
(?i)<\s*img[^>]+(?:low)?src\s*=\s*(?:'|")\s*http:
(?i)http://\S*\.biz

The first pattern matches variations of that powerful word in most of the junk - viagra. The initial (?i) makes the pattern case-insensitive (the default is the opposite), so we'll match vIagra, viAGra, etc. The `\W' is a special sequence, representing a single non-alphanumeric character, such as punctuation marks (alphanumeric characters are things like a...z, A...Z, 0...9, etc.), so we'll be able to match vi!agra, etc. The trailing `?' makes the pattern preceding it optional, so `\W?' means the non-alphanumeric character is only optional, and we can match both viagra and vi!agra.

With this much knowledge, you already know what the 2nd pattern is about. The 3rd pattern is by far the most complex pattern: it matches any mentioning of a HTML img tag. Here are some of the text the pattern will match:

<IMG SRC="http://blah blah blah" border=0>
<img border=0 lowsrc="http://blah blah blah">
<img src='http://blah blah blah'>

That's right, my friend - they are all capable of rendering images in your HTML-enabled Mail.app! You'll be amazed by how much creativity the spammers have these days to hide the real things they want to say. But with regular expressions in our arsenal, we shall afraid no more!

Some of the things required to understand the 3rd pattern are already covered, so let's concentrate on the new stuff here. First, `\s' is another special sequence, representing any whitespace character (such as space, tabs, returns, etc.), and the trailing `*' repeats the preceding character zero or many times. All in all, `\s*' will match zero or more whitespace characters, and thus allow us to skip arbitrary number of them.

The second new thing here is `[^>]+'. The `[]' construct lists all the possible characters, while `[^]' lists all the invalid characters (for matching). So `[^>]' can match anything other than `>'. How about the trailing `+' ? That's very similar to '*', but it matches one or more times (i.e., at least one time). So here you go: `[^>]+' matches one or more non-`>' characters - because we want to skip anything other than the `src' part of the string, but we don't want to wander out of the img tag yet (which ends with `>).

The last thing you'll need to know in order to understand the pattern, is the meaning of `(?:'|")'. The construct `(?:)' is to group multiple patterns into one. But what about this vertical bar? It simply signals alternatives - so pattern `(?:'|")' can match either ' or ".

There is not much left to learn in order to understand the last pattern: this one matches any URL link pointing to a .biz site; for example, `http://www.spammer.biz' (for some reasons, most of the spam I've seen has something to do with a .biz site). One of the new things here is how we match a single `.' (dot) character - this deserves special mentioning because `.' alone means an arbitrary character in regular expressions, namely, it can match any character. To use the literal meaning of `.', we need to escape the special meaning by putting a `\' before `.', so '\.' will match a dot, which is what we need here (this is called escaping). The last new thing in this pattern is `\S' - yet another special sequence. Note this is a capital S, and represents the opposite of what `\s' (a small s) means: any non-whitespace character.

Finally we'll learn the meaning of `^' and `$': the former matches the beginning of a line, and the latter matches the opposite - the end of a line. This is useful in specifying a safe IP pattern so that IPs matched won't be sent off to a blacklist for examination (save time and bandwidth):

^127\.0

which makes any IP address starting with `127.0' safe (127.0.0.1 is the default address of your Mac); and this is also useful in specifying a filename to match a junk attachment:

(?i)\.pif$

which matches any name ending with the string `.pif' (case-insensitive).

Where can I learn more about Python regular expressions?

Here is the section in the Python documentation (version 2.3.x) describing the regular expressions usage.

What are the meta patterns for?

In JunkMatcher I extended Python's regular expressions a little bit with the addition of meta patterns: these are patterns you can use in writing your patterns (a bit like macros). I guess by now you certainly have noticed that spammers like to use different characters of similar shape as substitutes for the original one, for example they write "v1@gra" for "viagra" - this is to confuse statistical junk filter since they count word frequencies to figure out what is junk. Without meta patterns you would have to write patterns like "v[i1][a@]gra" to match variations, and this will become tedious considering many other keywords waiting to be obfuscated ("c@$h" anyone?). Enter meta patterns: you can define a meta pattern "[i1]" and name it "i", and similarly do so to define meta pattern "a". Now you can use "v(?#i)(?#a)gra" to match many variations of this popular word. Note the connection between meta pattern names and their usage: for a meta pattern with name "name", you use it in this form: "(?#name)" (that's right - "(?#...)" is used for comments in Python regular expressions, and I abused it for convenience). Managing your meta patterns can be easily done in JunkMatcher.app. Only one restriction: you can't use a meta pattern in defining another meta pattern (no recursion, that is) - at least for now.

What are the reserved meta patterns then?

A small set of meta patterns are special in that they cannot be altered. These patterns are called reserved meta patterns, and are built directly from your settings in Mail.app. For example, myEmails means a pattern matching any one of your email addresses specified in the Account Setting of Mail.app. If you have multiple accounts, individual reserved meta patterns are also created, such as myEmail1, myEmail2, etc. These are useful since spammers like to make their messages more "personalized", but they only know your email addresses (and don't know your full name). You can take a look at the available reserved meta patterns in the Meta Patterns drawer in JunkMatcher.app.

Help! I can't match `\', `.', `?', `+', `*', `|', `[', `(', `)', '^', '$' etc. characters!

Yes you've been bitten by one of the most thorny problems in writing regular expressions: one of escaping. You see, some characters carry special meaning in regular expressions, for example, `.' is used to match a single character (any character), and `?' is to signal a particular pattern is optional (can appear zero or one time), etc. To use their original, literal meaning, you have to use a `\' (backslash) to precede these characters - to `escape' from their special meaning.

Here is a concrete example: to match the latest crop of mutant "\/iagra" (that's right, the `v' is written as a backslash plus a forward slash), you need to use this pattern: "\\/iagra" - note the extra backslash for escaping? Better yet, create a meta pattern for the mutant `v': "(?:v|\\/)" - this should cover both the normal and the mutant `v'.

Unfortunately there is one exception to the escaping rule above: inside "[]" special characters lose their power, so you can't escape them. For example: "[s$]" matches either `s' or `$' - note you don't add a backslash before `$'. Programmers are a weird bunch isn't it? Details can be found, again, here.

How do I write more efficient regular expressions?

JunkMatcher uses regular expressions module provided by Python, which already did some optimization when compiling each expression into its internal representation. But there are still some general rules of thumb for crafting more efficient patterns:

  • When grouping multiple patterns into one, use `(?:)' instead of `()'. For example, to say "apple or banana", write (?:apple|banana) instead of (apple|banana). This is because the latter grouping instructs Python to remember the content of the match (if any), which is not used in JunkMatcher anyway.
  • Instead of writing multiple related patterns, merge them into one using the alternative delimiter `|'. For example, you can have two patterns to match either apple or banana, but you can also write (?:apple|banana). But sometimes you might value clarity more than efficiency and decide not to merge patterns - it's up to you.
  • Try to merge the same part of alternative patterns. For example, you can write (?:refinance|refinancing), but refinanc(?:e|ing) can give you better efficiency (Python probably already did this automatically, but it never hurts to be a bit considerate). Again use your judgment between readability and efficiency.
  • When listing multiple single-character alternatives, use `[]' instead of `(?:)' plus `|'. For example, spammers like to replace characters with some other characters with similar "shape" - "interested" becomes "1nterested", etc. You can write [1i]nterested instead of (?:1|i)nterested for better efficiency.

How do I match words longer than n characters?

This is useful in identifying spam since spammers tend to use really long strings of characters as garbage to mislead statistically trained filters. According to this site, the longest English word contains 28 characters (letters), but in normal usage we might just give ourselves a smaller number, say 20. We can use this pattern:

\w{20,}

to match any word consisting of more than 20 alphanumeric characters. You can certainly juice this up based on your intuition.

How do I match vowels and consonants?

Vowels are easy: "[aeiou]", and consonants are not difficult either: "[^aeiou]". The `^' character inside a pair of square brackets means "not", so what we're saying in the consonants pattern is "any single character that is not a vowel". Note however, once outside the "[]", `^' means the start of a line.

How do I match patterns spanning over multiple lines? (example: filtering 419 scams)

(thanks to Jonathan Cardozo for asking this question)

Ok say we want to filter junk mails that contain certain keywords, but the words do not necessarily show up in the same line. For examples, one way to filter the (in-)famous 419 scam emails is to detect words like: Nigeria, fund, transfer, etc. It should be easy right? Our first try is:

(?i)Nigeria.+fund.+transfer

This almost won't work, since these 3 words might show up in separate lines, and by default those `.' will not match any newline character used implicitly to break lines! The solution is simple: we use an additional flag `s' to indicate we want `.' to match newline characters as well (detailed here: see the explanation for flag `S' - NOTE we use the lowercase `s' inside a regexp instead of the capital `S' used when writing the expression inside Python programs):

(?is)Nigeria.+fund.+transfer

There is still a small problem though: these 3 words can show up in a different ordering! For 3 words we can have 6 different orderings, so we'll just exhaust them all:

(?is)(?:Nigeria.+fund.+transfer|Nigeria.+transfer.+fund|...)

(Yeah I got lazy - the "..." above is left to you to fill in the rest of the combinations)

I know what you're thinking: this is brain-dead! Well, I'm afraid this is how things have to be done using regular expressions. Most of the times you don't have to enumerate all possible orderings - language is not just arbitrary juxtaposition of symbols after all. Also, choose your keywords wisely so they are both general enough to get the bad guys, and precise enough to avoid catching the good guys.