Monospace Font Selector

❤ 🗑

7pt
8pt
9pt
10pt
11pt
12pt
13pt
14pt
15pt
16pt
17pt
18pt
19pt
20pt

KDevelop
KDevelop (dark)
Konsole clickhouse-client
Textarea
Textarea (dark)
About
About (dark)

This website is created for the needs of ClickHouse developers.
In fact it's created because I had some curiosity about monospace fonts.

It allows to quickly check how fonts will look without installation, directly from web browser.
Fonts may take significant time to load.

Source code: https://github.com/alexey-milovidov/font-selector
If you will find the information on this website inaccurate,
or if you want to extend this website, please make pull requests.

Font Rendering

How fonts will look depends on multiple factors:

1. Monitor.

In low-DPI monitors you may see how individual pixels are laid out, how well text is aligned in pixel grid.
To improve quality of font rendering, anti-aliasing and hinting may be required.
(It is not needed in CRT monitors that you are unlikely to use).

Anti-aliasing can be "grayscale" and "subpixel".
Subpixel antialiasing will produce more sharp text with slightly noticeable distortion of colors.
It's a matter of habit, what type of antialiasing would you prefer.

Hinting is used to perfectly align lines of symbols to pixel grid.
Text will appear much more sharp if hinting is used, but kerning and letter shape may be slightly worse.
Hinting is not needed if you use large font size.

There is a statement that subpixel antialiasing and hinting is not required on high-DPI ("retina") displays.
But it depends on actual DPI.

For example, 13" laptop with 1920x1080 resolution have medium DPI and you may find perfectly hinted font better.
30" monitor with 2560x1600 resolution have lower DPI and if you prefer crisp looking fonts, you will definitely need hinting.
32" 4K monitor with 3840x2160 have DPI just slightly more than previous example and font rendering may still benefit for hinting.

There are also bitmap (old-school) fonts that does not require any antialiasing or hinting
(they are already perfectly laid out in pixels). But they can work only in certain sizes.
On retina monitors their usage is mostly impractical.
They will not display correctly on MacBook because it is using non-integer scaling factor for pixels.

2. Operating system.

Different OS implement different mechanisms for font rendering.

Mac OS X does not use subpixel antialiasing nor hinting.
Text will appear slightly blurred on non-"retina" displays and just fine on "retina" displays.

Windows is using "ClearType" rendering method that supports both subpixel antialiasing and hinting.

Linux is using FreeType library that has support for subpixel antialiasing and hinting
but this support is broken in various Linux distributions in numerous different ways.
Default settings in Linux may give unsatisfactory results.

3. The settings of operating system.

For example, on fresh installation of kUbuntu, only grayscale antialiasing is enabled (similar to Mac OS X)
and subpixel antialiasing and hinting are not enabled.

You can enable subpixel antialiasing in system settings that will take effect after restart.
But hinting is totally broken. To enable hinting, put this line to /etc/environment:

FREETYPE_PROPERTIES="truetype:interpreter-version=35 cff:no-stem-darkening=1 autofitter:warping=1"

and restart the system. On some versions, recompilation of FreeType from sources may be necessarily.
More information: https://wiki.archlinux.org/index.php/Font_configuration
https://mrandri19.github.io/2019/07/24/modern-text-rendering-linux-overview.html

Windows allows installation of alternative font rendering libraries instead of ClearType.

4. Application and how the application was packaged.

Some applications don't support subpixel antialiasing and/or hinting even if it is enabled in the OS.
For example, Konsole terminal works perfectly but Kitty does not:
https://github.com/kovidgoyal/kitty/issues/214

Font rendering may differ in your IDE, in terminal and in web browser.
For example, Firefox does support hinting but Chromium does not if it is installed from "snap" package.
Firefox supports bitmap fonts perfectly but Chromium does not.

The font size may depend on application.
For example, Liberation Mono 9pt is rendered in web browser one pt smaller than the same font in KDevelop.
Sometimes the size is different in Firefox and Chromium even with the same settings.

KDevelop supports subpixel AA and hinting if it was installed from apt.
But does not if you run it from AppImage.

Telegram Desktop supports subpixel AA and hinting if it was installed from apt.
But does not if you install it from official website.

Java applications (JetBrains products like CLion, IDEA, DataGrip) does not support hinting.
You can fix it installing different JVM and tuning some settings:
https://superuser.com/questions/614960/how-to-fix-font-anti-aliasing-in-intellij-idea-when-using-high-dpi

Terminal applications may require box drawing characters in font to render UI.
But not always. For example, Konsole will render box drawing characters by itself (by default)
and it will work perfectly even if your font does not have box drawing characters.
But if you copy-paste terminal output to some website, it may break.

5. How the font was packaged.

Some websites (like Google Fonts) may provide web-optimized, incomplete versions of fonts.
For example, if you use Fira Code font from Google Fonts, it will not render console UI correctly
despite the fact that full version of this font has all required characters.

Needless to say that hinting information can be lost in repackaged versions of fonts.

Bitmap fonts cannot be embedded/downloaded for the web page with CSS but they can be used if present in OS.
Only in Firefox, not in Chromium.
If you want to use bitmap fonts on web page without installation, they have to be converted to vector versions.
It can be done in numerous ways but most of them will give terrible results.

So, chances are low that you will see perfect look of Terminus font in your web browser.
It will not work on Mac OS X. It will not work in Chromium.

We deliberately removed all bitmap fonts for automatic download via CSS because they cannot display correctly.
To look at bitmap fonts, do

sudo apt install xfonts-terminus
sudo rm /etc/fonts/conf.d/70-no-bitmaps.conf
sudo ln -s /etc/fonts/conf.avail/70-yes-bitmaps.conf /etc/fonts/conf.d/70-yes-bitmaps.conf
sudo rm /etc/fonts/conf.d/10-scale-bitmap-fonts.conf

How to use this website

You can see the font selector on top. Click on the font you want to look.
There are different categories of fonts.

Most of the fonts have license allowing to freely use/redistribute them on the web.
They are instantly available for selection.

Some fonts are free to use but cannot be redistributed. Examples: Input Mono, Envy Code R.
They can be selected only if you have installed them in your OS.
It is easy. Step 1: go to the official website of the font. Step 2: download the font.
Step 3: if you are using Linux, open "Font Management", then open the font from filesystem, install it.
Step 4: if you are using Firefox, reload the page; if you are using Chromium, restart the web browser.

Some fonts comes with your operating system and cannot be redistributed.
Examples for Mac OS X: Monaco, Menlo, SF-Mono.
Examples for Windows: Lucida Console, Consolas.
These fonts can be selected only if available in your OS.

Some fonts are non-free and purchased separately.
Examples: Cartograph (trial download available), Pragmata Pro, MonoLisa.
These fonts are not included on the website.

Selectors for fonts that cannot be downloaded automatically and not available in your OS will be disabled (gray).

Different fonts are of different completeness/maturity.
We check that font has support for cyrillic letters (а-я) and box drawing characters.
If it does not, the selector will be in 50% opacity. The check is not precise.

The information about fonts availability is updated each second, because some fonts may take time to load.

Some fonts may have multiple variants. For example, Fira and JetBrains may have ligatures enabled or disabled.
Another example, Iosevka allows to choose different stylistic variations for many letters to accomodate everyone's taste.
Unfortunately, we list only one variation for most of the fonts.

Another example is cursive italic variant with handwritten style in Victor Mono and Cartograph.
It will look cool and hip when you share screenshots in social networks!
Unfortunately, default style of all programs that I use for code or as a terminal does not use italics.
And this website does not provide relevant examples.

Below you will find font size selector and examples selector.

The sizes are selected in points (pt). This is done to better match with the size selection in other applications.
But for different fonts, the same pt may correspond to different size.
Note that bitmap fonts are available not in every size.
They will either look terrible if size selection does not match or not render at all.

Right to these selectors you will find "add to favorite" (❤) and "trash" (🗑) buttons.
Add to favorite will move current font to the front of the list, so you can compare your favorite fonts more quickly.
Trash button will delete current font from the list, so you can quickly choose from the remaining fonts.
If you changed your mind and want to return some deleted font, reload the page by pressing F5 or Ctrl+R.

KDevelop example is partially based on the output of Woboq Code Browser.
The source file example is from https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/Volnitsky.h

Konsole example is based on output of clickhouse-client on test dataset available here:
https://clickhouse.tech/docs/en/getting-started/example-datasets/metrica/

What fonts are the best?

There are so many differencies in fonts rendering and perception that makes answering this question impossible.
I refuse even trying to answer this question.

After creating this website I have found that I'm not ready to change the font that I have used before.
It means that creation of this website did not met the goal.

But I hope that this website will allow you to find a font that at least does not look like a trash.

Similar Projects

1. https://www.programmingfonts.org/

Upsides:
- Provides comprehensive list of fonts with info about each of them.
- Allows to test-drive interactively in web browser.

Downsides:
- All bitmap fonts on this website cannot display correctly and will provide misleading look.
- Some fonts on this website were not packaged correctly and has no hinting info. This will also give you misleading look.
- It is very difficult to select "light" theme instead of "dark" for examples.

2. http://www.s9w.io/font_compare/
This website provides the look of fonts rendering in Windows with screenshots.

3. https://github.com/chrissimpkins/codeface
Provides comprehensive list of fonts in form of TTF files and screenshots.
The list of fonts on my website is partially derived from this project.

#pragma once

#include <algorithm>
#include <vector>
#include <stdint.h>
#include <string.h>
#include <Core/Types.h>
#include <Poco/Unicode.h>
#include <Common/StringSearcher.h>
#include <Common/StringUtils/StringUtils.h>
#include <Common/UTF8Helpers.h>
#include <common/StringRef.h>
#include <common/unaligned.h>

/** Search for a substring in a string by Volnitsky's algorithm
  * http://volnitsky.com/project/str_search/
  *
  * `haystack` and `needle` can contain zero bytes.
  *
  * Algorithm:
  * - if the `needle` is too small or too large, or too small `haystack`, use std::search or memchr;
  * - when initializing, fill in an open-addressing linear probing hash table of the form
  *    hash from the bigram of needle -> the position of this bigram in needle + 1.
  *    (one is added only to distinguish zero offset from an empty cell)
  * - the keys are not stored in the hash table, only the values are stored;
  * - bigrams can be inserted several times if they occur in the needle several times;
  * - when searching, take from haystack bigram, which should correspond to the last bigram of needle (comparing from the end);
  * - look for it in the hash table, if found - get the offset from the hash table and compare the string bytewise;
  * - if it did not match, we check the next cell of the hash table from the collision resolution chain;
  * - if not found, skip to haystack almost the size of the needle bytes;
  *
  * MultiVolnitsky - search for multiple substrings in a string:
  * - Add bigrams to hash table with string index. Then the usual Volnitsky search is used.
  * - We are adding while searching, limiting the number of fallback searchers and the total number of added bigrams
  */


namespace DB
{
namespace VolnitskyTraits
{
    using Offset = UInt8; /// Offset in the needle. For the basic algorithm, the length of the needle must not be greater than 255.
    using Id = UInt8; /// Index of the string (within the array of multiple needles), must not be greater than 255.
    using Ngram = UInt16; /// n-gram (2 bytes).

    /** Fits into the L2 cache (of common Intel CPUs).
     * This number is extremely good for compilers as it is numeric_limits<Uint16>::max() and there are optimizations with movzwl and other instructions with 2 bytes
     */
    static constexpr size_t hash_size = 64 * 1024;

    /// min haystack size to use main algorithm instead of fallback
    static constexpr size_t min_haystack_size_for_algorithm = 20000;

    static inline bool isFallbackNeedle(const size_t needle_size, size_t haystack_size_hint = 0)
    {
        return needle_size < 2 * sizeof(Ngram) || needle_size >= std::numeric_limits<Offset>::max()
            || (haystack_size_hint && haystack_size_hint < min_haystack_size_for_algorithm);
    }

    static inline Ngram toNGram(const UInt8 * const pos) { return unalignedLoad<Ngram>(pos); }

    template <typename Callback>
    static inline void putNGramASCIICaseInsensitive(const UInt8 * const pos, const int offset, const Callback & putNGramBase)
    {
        struct Chars
        {
            UInt8 c0;
            UInt8 c1;
        };

        union
        {
            Ngram n;
            Chars chars;
        };

        n = toNGram(pos);

        const auto c0_al = isAlphaASCII(chars.c0);
        const auto c1_al = isAlphaASCII(chars.c1);

        if (c0_al && c1_al)
        {
            /// 4 combinations: AB, aB, Ab, ab
            putNGramBase(n, offset);
            chars.c0 = alternateCaseIfAlphaASCII(chars.c0);
            putNGramBase(n, offset);
            chars.c1 = alternateCaseIfAlphaASCII(chars.c1);
            putNGramBase(n, offset);
            chars.c0 = alternateCaseIfAlphaASCII(chars.c0);
            putNGramBase(n, offset);
        }
        else if (c0_al)
        {
            /// 2 combinations: A1, a1
            putNGramBase(n, offset);
            chars.c0 = alternateCaseIfAlphaASCII(chars.c0);
            putNGramBase(n, offset);
        }
        else if (c1_al)
        {
            /// 2 combinations: 0B, 0b
            putNGramBase(n, offset);
            chars.c1 = alternateCaseIfAlphaASCII(chars.c1);
            putNGramBase(n, offset);
        }
        else
            /// 1 combination: 01
            putNGramBase(n, offset);
    }

    template <bool CaseSensitive, bool ASCII, typename Callback>
    static inline void putNGram(const UInt8 * const pos, const int offset, [[maybe_unused]] const UInt8 * const begin, const Callback & putNGramBase)
    {
        if constexpr (CaseSensitive)
        {
            putNGramBase(toNGram(pos), offset);
        }
        else
        {
            if constexpr (ASCII)
            {
                putNGramASCIICaseInsensitive(pos, offset, putNGramBase);
            }
            else
            {
                struct Chars
                {
                    UInt8 c0;
                    UInt8 c1;
                };

                union
                {
                    VolnitskyTraits::Ngram n;
                    Chars chars;
                };

                n = toNGram(pos);

                if (isascii(chars.c0) && isascii(chars.c1))
                    putNGramASCIICaseInsensitive(pos, offset, putNGramBase);
                else
                {
                    /** n-gram (in the case of n = 2)
                      *  can be entirely located within one code point,
                      *  or intersect with two code points.
                      *
                      * In the first case, you need to consider up to two alternatives - this code point in upper and lower case,
                      *  and in the second case - up to four alternatives - fragments of two code points in all combinations of cases.
                      *
                      * It does not take into account the dependence of the case-transformation from the locale (for example - Turkish `Ii`)
                      *  as well as composition / decomposition and other features.
                      *
                      * It also does not work if characters with lower and upper cases are represented by different number of bytes or code points.
                      */

                    using Seq = UInt8[6];

                    if (UTF8::isContinuationOctet(chars.c1))
                    {
                        /// ngram is inside a sequence
                        auto seq_pos = pos;
                        UTF8::syncBackward(seq_pos, begin);

                        const auto u32 = UTF8::convert(seq_pos);
                        const auto l_u32 = Poco::Unicode::toLower(u32);
                        const auto u_u32 = Poco::Unicode::toUpper(u32);

                        /// symbol is case-independent
                        if (l_u32 == u_u32)
                            putNGramBase(n, offset);
                        else
                        {
                            /// where is the given ngram in respect to the start of UTF-8 sequence?
                            const auto seq_ngram_offset = pos - seq_pos;

                            Seq seq;

                            /// put ngram for lowercase
                            UTF8::convert(l_u32, seq, sizeof(seq));
                            chars.c0 = seq[seq_ngram_offset];
                            chars.c1 = seq[seq_ngram_offset + 1];
                            putNGramBase(n, offset);

                            /// put ngram for uppercase
                            UTF8::convert(u_u32, seq, sizeof(seq));
                            chars.c0 = seq[seq_ngram_offset]; //-V519
                            chars.c1 = seq[seq_ngram_offset + 1]; //-V519
                            putNGramBase(n, offset);
                        }
                    }
                    else
                    {
                        /// ngram is on the boundary of two sequences
                        /// first sequence may start before u_pos if it is not ASCII
                        auto first_seq_pos = pos;
                        UTF8::syncBackward(first_seq_pos, begin);
                        /// where is the given ngram in respect to the start of first UTF-8 sequence?
                        const auto seq_ngram_offset = pos - first_seq_pos;

                        const auto first_u32 = UTF8::convert(first_seq_pos);
                        const auto first_l_u32 = Poco::Unicode::toLower(first_u32);
                        const auto first_u_u32 = Poco::Unicode::toUpper(first_u32);

                        /// second sequence always start immediately after u_pos
                        auto second_seq_pos = pos + 1;

                        const auto second_u32 = UTF8::convert(second_seq_pos); /// TODO This assumes valid UTF-8 or zero byte after needle.
                        const auto second_l_u32 = Poco::Unicode::toLower(second_u32);
                        const auto second_u_u32 = Poco::Unicode::toUpper(second_u32);

                        /// both symbols are case-independent
                        if (first_l_u32 == first_u_u32 && second_l_u32 == second_u_u32)
                        {
                            putNGramBase(n, offset);
                        }
                        else if (first_l_u32 == first_u_u32)
                        {
                            /// first symbol is case-independent
                            Seq seq;

                            /// put ngram for lowercase
                            UTF8::convert(second_l_u32, seq, sizeof(seq));
                            chars.c1 = seq[0];
                            putNGramBase(n, offset);

                            /// put ngram from uppercase, if it is different
                            UTF8::convert(second_u_u32, seq, sizeof(seq));
                            if (chars.c1 != seq[0])
                            {
                                chars.c1 = seq[0];
                                putNGramBase(n, offset);
                            }
                        }
                        else if (second_l_u32 == second_u_u32)
                        {
                            /// second symbol is case-independent
                            Seq seq;

                            /// put ngram for lowercase
                            UTF8::convert(first_l_u32, seq, sizeof(seq));
                            chars.c0 = seq[seq_ngram_offset];
                            putNGramBase(n, offset);

                            /// put ngram for uppercase, if it is different
                            UTF8::convert(first_u_u32, seq, sizeof(seq));
                            if (chars.c0 != seq[seq_ngram_offset])
                            {
                                chars.c0 = seq[seq_ngram_offset];
                                putNGramBase(n, offset);
                            }
                        }
                        else
                        {
                            Seq first_l_seq;
                            Seq first_u_seq;
                            Seq second_l_seq;
                            Seq second_u_seq;

                            UTF8::convert(first_l_u32, first_l_seq, sizeof(first_l_seq));
                            UTF8::convert(first_u_u32, first_u_seq, sizeof(first_u_seq));
                            UTF8::convert(second_l_u32, second_l_seq, sizeof(second_l_seq));
                            UTF8::convert(second_u_u32, second_u_seq, sizeof(second_u_seq));

                            auto c0l = first_l_seq[seq_ngram_offset];
                            auto c0u = first_u_seq[seq_ngram_offset];
                            auto c1l = second_l_seq[0];
                            auto c1u = second_u_seq[0];

                            /// ngram for ll
                            chars.c0 = c0l;
                            chars.c1 = c1l;
                            putNGramBase(n, offset);

                            if (c0l != c0u)
                            {
                                /// ngram for Ul
                                chars.c0 = c0u;
                                chars.c1 = c1l;
                                putNGramBase(n, offset);
                            }

                            if (c1l != c1u)
                            {
                                /// ngram for lU
                                chars.c0 = c0l;
                                chars.c1 = c1u;
                                putNGramBase(n, offset);
                            }

                            if (c0l != c0u && c1l != c1u)
                            {
                                /// ngram for UU
                                chars.c0 = c0u;
                                chars.c1 = c1u;
                                putNGramBase(n, offset);
                            }
                        }
                    }
                }
            }
        }
    }
}


/// @todo store lowercase needle to speed up in case there are numerous occurrences of bigrams from needle in haystack
template <bool CaseSensitive, bool ASCII, typename FallbackSearcher>
class VolnitskyBase
{
protected:
    const UInt8 * const needle;
    const size_t needle_size;
    const UInt8 * const needle_end = needle + needle_size;
    /// For how long we move, if the n-gram from haystack is not found in the hash table.
    const size_t step = needle_size - sizeof(VolnitskyTraits::Ngram) + 1;

    /** max needle length is 255, max distinct ngrams for case-sensitive is (255 - 1), case-insensitive is 4 * (255 - 1)
      *  storage of 64K ngrams (n = 2, 128 KB) should be large enough for both cases */
    VolnitskyTraits::Offset hash[VolnitskyTraits::hash_size]; /// Hash table.

    const bool fallback; /// Do we need to use the fallback algorithm.

    FallbackSearcher fallback_searcher;

public:
    using Searcher = FallbackSearcher;

    /** haystack_size_hint - the expected total size of the haystack for `search` calls. Optional (zero means unspecified).
      * If you specify it small enough, the fallback algorithm will be used,
      *  since it is considered that it's useless to waste time initializing the hash table.
      */
    VolnitskyBase(const char * const needle_, const size_t needle_size_, size_t haystack_size_hint = 0)
        : needle{reinterpret_cast<const UInt8 *>(needle_)}
        , needle_size{needle_size_}
        , fallback{VolnitskyTraits::isFallbackNeedle(needle_size, haystack_size_hint)}
        , fallback_searcher{needle_, needle_size}
    {
        if (fallback)
            return;

        memset(hash, 0, sizeof(hash));

        auto callback = [this](const VolnitskyTraits::Ngram ngram, const int offset) { return this->putNGramBase(ngram, offset); };
        /// ssize_t is used here because unsigned can't be used with condition like `i >= 0`, unsigned always >= 0
        /// And also adding from the end guarantees that we will find first occurence because we will lookup bigger offsets first.
        for (auto i = static_cast<ssize_t>(needle_size - sizeof(VolnitskyTraits::Ngram)); i >= 0; --i)
            VolnitskyTraits::putNGram<CaseSensitive, ASCII>(this->needle + i, i + 1, this->needle, callback);
    }


    /// If not found, the end of the haystack is returned.
    const UInt8 * search(const UInt8 * const haystack, const size_t haystack_size) const
    {
        if (needle_size == 0)
            return haystack;

        const auto haystack_end = haystack + haystack_size;

        if (fallback || haystack_size <= needle_size)
            return fallback_searcher.search(haystack, haystack_end);

        /// Let's "apply" the needle to the haystack and compare the n-gram from the end of the needle.
        const auto * pos = haystack + needle_size - sizeof(VolnitskyTraits::Ngram);
        for (; pos <= haystack_end - needle_size; pos += step)
        {
            /// We look at all the cells of the hash table that can correspond to the n-gram from haystack.
            for (size_t cell_num = VolnitskyTraits::toNGram(pos) % VolnitskyTraits::hash_size; hash[cell_num];
                 cell_num = (cell_num + 1) % VolnitskyTraits::hash_size)
            {
                /// When found - compare bytewise, using the offset from the hash table.
                const auto res = pos - (hash[cell_num] - 1);

                /// pointer in the code is always padded array so we can use pagesafe semantics
                if (fallback_searcher.compare(haystack, haystack_end, res))
                    return res;
            }
        }

        return fallback_searcher.search(pos - step + 1, haystack_end);
    }

    const char * search(const char * haystack, size_t haystack_size) const
    {
        return reinterpret_cast<const char *>(search(reinterpret_cast<const UInt8 *>(haystack), haystack_size));
    }

protected:
    void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset)
    {
        /// Put the offset for the n-gram in the corresponding cell or the nearest free cell.
        size_t cell_num = ngram % VolnitskyTraits::hash_size;

        while (hash[cell_num])
            cell_num = (cell_num + 1) % VolnitskyTraits::hash_size; /// Search for the next free cell.

        hash[cell_num] = offset;
    }
};


template <bool CaseSensitive, bool ASCII, typename FallbackSearcher>
class MultiVolnitskyBase
{
private:
    /// needles and their offsets
    const std::vector<StringRef> & needles;


    /// fallback searchers
    std::vector<size_t> fallback_needles;
    std::vector<FallbackSearcher> fallback_searchers;

    /// because std::pair<> is not POD
    struct OffsetId
    {
        VolnitskyTraits::Id id;
        VolnitskyTraits::Offset off;
    };

    OffsetId hash[VolnitskyTraits::hash_size];

    /// step for each bunch of strings
    size_t step;

    /// last index of offsets that was not processed
    size_t last;

    /// limit for adding to hashtable. In worst case with case insentive search, the table will be filled at most as half
    static constexpr size_t small_limit = VolnitskyTraits::hash_size / 8;

public:
    MultiVolnitskyBase(const std::vector<StringRef> & needles_) : needles{needles_}, step{0}, last{0}
    {
        fallback_searchers.reserve(needles.size());
    }

    /**
     * This function is needed to initialize hash table
     * Returns `true` if there is nothing to initialize
     * and `false` if we have something to initialize and initializes it.
     * This function is a kind of fallback if there are many needles.
     * We actually destroy the hash table and initialize it with uninitialized needles
     * and search through the haystack again.
     * The actual usage of this function is like this:
     * while (hasMoreToSearch())
     * {
     *     search inside the haystack with the known needles
     * }
     */
    bool hasMoreToSearch()
    {
        if (last == needles.size())
            return false;

        memset(hash, 0, sizeof(hash));
        fallback_needles.clear();
        step = std::numeric_limits<size_t>::max();

        size_t buf = 0;
        size_t size = needles.size();

        for (; last < size; ++last)
        {
            const char * cur_needle_data = needles[last].data;
            const size_t cur_needle_size = needles[last].size;

            /// save the indices of fallback searchers
            if (VolnitskyTraits::isFallbackNeedle(cur_needle_size))
            {
                fallback_needles.push_back(last);
            }
            else
            {
                /// put all bigrams
                auto callback = [this](const VolnitskyTraits::Ngram ngram, const int offset)
                {
                    return this->putNGramBase(ngram, offset, this->last);
                };

                buf += cur_needle_size - sizeof(VolnitskyTraits::Ngram) + 1;

                /// this is the condition when we actually need to stop and start searching with known needles
                if (buf > small_limit)
                    break;

                step = std::min(step, cur_needle_size - sizeof(VolnitskyTraits::Ngram) + 1);
                for (auto i = static_cast<int>(cur_needle_size - sizeof(VolnitskyTraits::Ngram)); i >= 0; --i)
                {
                    VolnitskyTraits::putNGram<CaseSensitive, ASCII>(
                        reinterpret_cast<const UInt8 *>(cur_needle_data) + i,
                        i + 1,
                        reinterpret_cast<const UInt8 *>(cur_needle_data),
                        callback);
                }
            }
            fallback_searchers.emplace_back(cur_needle_data, cur_needle_size);
        }
        return true;
    }

    inline bool searchOne(const UInt8 * haystack, const UInt8 * haystack_end) const
    {
        const size_t fallback_size = fallback_needles.size();
        for (size_t i = 0; i < fallback_size; ++i)
            if (fallback_searchers[fallback_needles[i]].search(haystack, haystack_end) != haystack_end)
                return true;

        /// check if we have one non empty volnitsky searcher
        if (step != std::numeric_limits<size_t>::max())
        {
            const auto * pos = haystack + step - sizeof(VolnitskyTraits::Ngram);
            for (; pos <= haystack_end - sizeof(VolnitskyTraits::Ngram); pos += step)
            {
                for (size_t cell_num = VolnitskyTraits::toNGram(pos) % VolnitskyTraits::hash_size; hash[cell_num].off;
                     cell_num = (cell_num + 1) % VolnitskyTraits::hash_size)
                {
                    if (pos >= haystack + hash[cell_num].off - 1)
                    {
                        const auto res = pos - (hash[cell_num].off - 1);
                        const size_t ind = hash[cell_num].id;
                        if (res + needles[ind].size <= haystack_end && fallback_searchers[ind].compare(haystack, haystack_end, res))
                            return true;
                    }
                }
            }
        }
        return false;
    }

    inline size_t searchOneFirstIndex(const UInt8 * haystack, const UInt8 * haystack_end) const
    {
        const size_t fallback_size = fallback_needles.size();

        size_t ans = std::numeric_limits<size_t>::max();

        for (size_t i = 0; i < fallback_size; ++i)
            if (fallback_searchers[fallback_needles[i]].search(haystack, haystack_end) != haystack_end)
                ans = std::min(ans, fallback_needles[i]);

        /// check if we have one non empty volnitsky searcher
        if (step != std::numeric_limits<size_t>::max())
        {
            const auto * pos = haystack + step - sizeof(VolnitskyTraits::Ngram);
            for (; pos <= haystack_end - sizeof(VolnitskyTraits::Ngram); pos += step)
            {
                for (size_t cell_num = VolnitskyTraits::toNGram(pos) % VolnitskyTraits::hash_size; hash[cell_num].off;
                     cell_num = (cell_num + 1) % VolnitskyTraits::hash_size)
                {
                    if (pos >= haystack + hash[cell_num].off - 1)
                    {
                        const auto res = pos - (hash[cell_num].off - 1);
                        const size_t ind = hash[cell_num].id;
                        if (res + needles[ind].size <= haystack_end && fallback_searchers[ind].compare(haystack, haystack_end, res))
                            ans = std::min(ans, ind);
                    }
                }
            }
        }

        /*
        * if nothing was found, ans + 1 will be equal to zero and we can
        * assign it into the result because we need to return the position starting with one
        */
        return ans + 1;
    }

    template <typename CountCharsCallback>
    inline UInt64 searchOneFirstPosition(const UInt8 * haystack, const UInt8 * haystack_end, const CountCharsCallback & count_chars) const
    {
        const size_t fallback_size = fallback_needles.size();

        UInt64 ans = std::numeric_limits<UInt64>::max();

        for (size_t i = 0; i < fallback_size; ++i)
            if (auto pos = fallback_searchers[fallback_needles[i]].search(haystack, haystack_end); pos != haystack_end)
                ans = std::min<UInt64>(ans, pos - haystack);

        /// check if we have one non empty volnitsky searcher
        if (step != std::numeric_limits<size_t>::max())
        {
            const auto * pos = haystack + step - sizeof(VolnitskyTraits::Ngram);
            for (; pos <= haystack_end - sizeof(VolnitskyTraits::Ngram); pos += step)
            {
                for (size_t cell_num = VolnitskyTraits::toNGram(pos) % VolnitskyTraits::hash_size; hash[cell_num].off;
                     cell_num = (cell_num + 1) % VolnitskyTraits::hash_size)
                {
                    if (pos >= haystack + hash[cell_num].off - 1)
                    {
                        const auto res = pos - (hash[cell_num].off - 1);
                        const size_t ind = hash[cell_num].id;
                        if (res + needles[ind].size <= haystack_end && fallback_searchers[ind].compare(haystack, haystack_end, res))
                            ans = std::min<UInt64>(ans, res - haystack);
                    }
                }
            }
        }
        if (ans == std::numeric_limits<UInt64>::max())
            return 0;
        return count_chars(haystack, haystack + ans);
    }

    template <typename CountCharsCallback, typename AnsType>
    inline void searchOneAll(const UInt8 * haystack, const UInt8 * haystack_end, AnsType * ans, const CountCharsCallback & count_chars) const
    {
        const size_t fallback_size = fallback_needles.size();
        for (size_t i = 0; i < fallback_size; ++i)
        {
            const UInt8 * ptr = fallback_searchers[fallback_needles[i]].search(haystack, haystack_end);
            if (ptr != haystack_end)
                ans[fallback_needles[i]] = count_chars(haystack, ptr);
        }

        /// check if we have one non empty volnitsky searcher
        if (step != std::numeric_limits<size_t>::max())
        {
            const auto * pos = haystack + step - sizeof(VolnitskyTraits::Ngram);
            for (; pos <= haystack_end - sizeof(VolnitskyTraits::Ngram); pos += step)
            {
                for (size_t cell_num = VolnitskyTraits::toNGram(pos) % VolnitskyTraits::hash_size; hash[cell_num].off;
                     cell_num = (cell_num + 1) % VolnitskyTraits::hash_size)
                {
                    if (pos >= haystack + hash[cell_num].off - 1)
                    {
                        const auto * res = pos - (hash[cell_num].off - 1);
                        const size_t ind = hash[cell_num].id;
                        if (ans[ind] == 0 && res + needles[ind].size <= haystack_end && fallback_searchers[ind].compare(haystack, haystack_end, res))
                            ans[ind] = count_chars(haystack, res);
                    }
                }
            }
        }
    }

    void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num)
    {
        size_t cell_num = ngram % VolnitskyTraits::hash_size;

        while (hash[cell_num].off)
            cell_num = (cell_num + 1) % VolnitskyTraits::hash_size;

        hash[cell_num] = {static_cast<VolnitskyTraits::Id>(num), static_cast<VolnitskyTraits::Offset>(offset)};
    }
};


using Volnitsky = VolnitskyBase<true, true, ASCIICaseSensitiveStringSearcher>;
using VolnitskyUTF8 = VolnitskyBase<true, false, ASCIICaseSensitiveStringSearcher>; /// exactly same as Volnitsky
using VolnitskyCaseInsensitive = VolnitskyBase<false, true, ASCIICaseInsensitiveStringSearcher>; /// ignores non-ASCII bytes
using VolnitskyCaseInsensitiveUTF8 = VolnitskyBase<false, false, UTF8CaseInsensitiveStringSearcher>;

using VolnitskyCaseSensitiveToken = VolnitskyBase<true, true, ASCIICaseSensitiveTokenSearcher>;
using VolnitskyCaseInsensitiveToken = VolnitskyBase<false, true, ASCIICaseInsensitiveTokenSearcher>;

using MultiVolnitsky = MultiVolnitskyBase<true, true, ASCIICaseSensitiveStringSearcher>;
using MultiVolnitskyUTF8 = MultiVolnitskyBase<true, false, ASCIICaseSensitiveStringSearcher>;
using MultiVolnitskyCaseInsensitive = MultiVolnitskyBase<false, true, ASCIICaseInsensitiveStringSearcher>;
using MultiVolnitskyCaseInsensitiveUTF8 = MultiVolnitskyBase<false, false, UTF8CaseInsensitiveStringSearcher>;


}

milovidov@milovidov-desktop:~/work/ClickHouse$ clickhouse-client
ClickHouse client version 20.7.1.1.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.7.1 revision 54437.

milovidov-desktop :) SELECT SearchPhrase, count() FROM test.hits GROUP BY SearchPhrase ORDER BY count() DESC LIMIT 10

SELECT
    SearchPhrase,
    count()
FROM test.hits
GROUP BY SearchPhrase
ORDER BY count() DESC
LIMIT 10

┌─SearchPhrase───────────────┬─count()─┐
│                            │ 8267016 │
│ ст 12.168.0.1              │    3567 │
│ orton                      │    2402 │
│ игры лица и гым чан дизайн │    2166 │
│ imgsrc                     │    1848 │
│ брызговик                  │    1659 │
│ индийский афтозный         │    1549 │
│ ооооотводка и              │    1480 │
│ выступная мужчин           │    1247 │
│ юность                     │    1112 │
└────────────────────────────┴─────────┘

10 rows in set. Elapsed: 0.054 sec. Processed 8.87 million rows, 112.70 MB (165.22 million rows/s., 2.10 GB/s.)

milovidov-desktop :)

oO08 iIlL1| g9qCGQ ~-+=>

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Pack my box with five dozen liquor jugs.

абвгдеёжзийклмнопрстуфхцчшщъыьэюя
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
В чащах юга жил бы цитрус? Да, но фальшивый экземпляр.