

With both approaches we avoid giving a linear 'importance' to the frequency.īM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with theĮquation weight(t)/(weight(t)+k1). In Lucene this is carried out with the root square of the frequency, another classical approach I really believe that this can be done (not sure how), so maybe we will need the suggestions of some 'scorer guru'.īasically what we are trying is to constraint the effect of the raw frequency (saturate the frequency). Saturate the effect of frequency with k1.

Normalize frequencies with document/field length and b factor.It is really important to follow the steps as it appears in my explanation:

Otis is right as far as I know just changing similarity is not enough, some data is not available to TermScorer neither similarity and TermScorerĪpply the obtained values from similarity in a way that make it incompatible with BM25.

As far as I know that is a problemīecause Lucene doesn't store the document frequency per document but per field. I recommend as heuristic to use the field with more terms, or use an special field that contains all the terms). In the BM25 equations family, IDF is always computed at document level (that is why (Currently Lucene is doing it after normalization and saturation of frequencies, what in my opinion is not the best approach.)Ī more detailed explanation of BM25F and this issue can be found in this paper What it is really important is the way boost factors are applied, as you can see in the equation these must be applied to raw frequencies and not to normalized frequencies or saturated frequencies. In relation with BM25 or BM25F they are equivalent, BM25F is the version for more than a field, so yes go for BM25F. So yes it would be better a tigher integration, and I believe we will get more support for different query types. That is the main reason why there are some duplicated classes. To modify the Lucene code, moreover I tried to create a jar that could be straight added to the official I'm going to try to answer some of your questions, when I started to develop this library I didn't want Joaquin Perez-Iglesias ( migrated from JIRA) If what I said is complete nonsense, don't hurt me, I do not know much about BM25, but for me it is an implementation detail and not part of a public API. for function queries (to further change the score) or FuzzyQuery and what else. This way, it could also be used for other query types (not only TermQ/BQ), but eg. on the IndexSearcher to use BM25 scoring). The internal impl like BM25 or conventional scoring should be hidden from the user (and maybe properties e.g. Query classes should be abstract wrappers for Weights and Scoreres. That was just my first impression, these additional classes do not look like a good public API to me. So TermQuery could be switched to BM25 mode and then using another Scorer or something like that. The question is more, why do we need the BM25 classes at all, why should it be not possible to use normal term queries and other query types together with BM25 by just changing some scoring defaults? So replace Similarity and maybe have a switch inside the Scorers. I was wondering about the separate BooleanQuery, too, as it is almost simply a copy (of an old version of it).
