PCRE and NewsSrv

This document describes how the strings identifying a post are computed from the subject line in NewsSrv, starting from version 0.3.0. It assumes that you have a knowledge of regular expressions, and specifically Perl-compatible regular expressions, as implemented by libPCRE.

In the configuration pages of NewsSrv, you'll find two new configuration groups: "Post rules" and "Post rules groups". "Post Rules" allow you to configure arbitrary regular expressions, and "Post rules groups" lets you regroup these rules in so-called post rules groups, which can then be associated with GlobalGroups.

Additionnaly to the regular PCRE syntax, these expressions can contain two special tokens: %s and %t. These are equivalent to ([0-9]+), with the additionnal meaning that if %s matches, it will be considered to be the sequence number of the message in the post; %t will be considered to be the total number of messages in the post. You can obtain a litteral "%" by doubling it.

First, the subject line is matched against the regular expressions contained in the post rule group of the concerned global group. The order of the match is the order of the "priority" value, descending. As soon as a match is found, and this means that both %s and %t matched, the search for sequence/total stops, but the string is still matched against expressions marked "always".

Then, all capturing subpatterns corresponding to a match with either the first match, either one of the "always" expressions, are removed, and the generated string will be used to identify the message's post.

Some examples, now...
The canonical form of a post is, let's say, something like
<string identifying the post>[seq/total]<string identifying the file>
To extract the postID, we can use the following rules:

Always	Expression	Priority
no	.(\[%s/%t\].)	5
yes	(\.part[0-9]{2})	4
yes	(\.[rp]ar\|([0-9]{2}))	3
yes	(\.(nfo\|md5\|htm\|txt))	2

Let's see what happens in two cases. First, with a subject line that follows the canonical form, let's say [Dream-Anime]GenoCyber episode 4(SBC) [01/38] [D-A]GenoCyber - Stage 4(SBC)(DFE4FC57).part01.P01 This string matches rule 1, and the capturing subpart matches the substring [01/38] [D-A]GenoCyber - Stage 4(SBC)(DFE4FC57).part01.P01 Then, none of the "always" rules match. Finally, the sequence number will be 1, the total number of files for the post 38, and the postID will be [Dream-Anime]GenoCyber episode 4(SBC) after removing the capturing substrings. Which is fine... Every message having the same postID will be considered part of this post.
Now, with an example to show the utility of the "always" rules. Let's say someone posts #Rice-Box@irc.enterthegame.com presents - yEnc "R-B__Yaiba - 06__XVID.part01.p02 [2/36]" This matches rule 1, but the resulting string is absolutely not unique through the post, because of the ".part01.p02" thing. Luckily, the substring ".part01" matches rule 2, and thus will be removed from the resulting string. The remaining ".p02" matches rule 3 (as would match ".par", ".rar", ".r17", etc) and will also be removed. Finally, the postID will be #Rice-Box@irc.enterthegame.com presents - yEnc "R-B__Yaiba - 06__XVID" which, once more, is fine :)

That's all for the doc. You'll have to do some experiments to find rules suitable for your newsgroups frequentations, or just use the default set!

Jérôme Laheurte, 10 Nov 2002