Proper parsing of urls in C/C++ using the public suffix list

Proper parsing of urls is not a trivial task, especially getting the full tld (called properly by mozilla: public suffix).

Getting a tld is as standard as getting the string after the last dot in a host name, but you have no validation, and according to the same parsing rule co.uk would be a domain. But co.uk can’t be publicly registered, so we call it the public suffix as it acts like a tld. Of course now tlds can be publicly registered if you have the servers for it, so maintaining your own list of tlds gets even harder.

I found faup to be a good implementation of mozilla’s public suffix list, especially for applications where url parsing performance is crucial.

It’s a little tricky to implement it in a multithreaded environment, as it’s not fully documented yet, so I decided to write an article to help noobs people like me out :).

Getting faup:

This code is tested on ubuntu 12.10, but it should work on other unix systems.

#https://github.com/stricaud/faup/blob/master/README.md
git clone git://github.com/stricaud/faup.git
cd faup
mkdir build
cd build
cmake .. && make
sudo make install

#for ubuntu, configure shared file locations
echo '/usr/local/lib' | sudo tee -a /etc/ld.so.conf.d/faup.conf
sudo ldconfig
sudo ldd /usr/local/bin/faup

Test via the faup command

$ faup -v
$ faup -f tld domain.co.uk

C++ sample using the library:

#include <iostream>

#include <faup/faup.h>
#include <faup/decode.h>
#include <faup/options.h>
#include <faup/output.h>

using namespace std;

int main()
{
 // faup_options_new() is not thread safe, and should only be runned once per code, it is also the part that loads the cached publicsuffix.org file
 // faup_tld_update(); // updates the cached publicsuffix.org file
 faup_options_t *faup_opts;
 faup_opts = faup_options_new();
 // no need for the default csv output
 // faup_opts->output = FAUP_OUTPUT_NONE; // 20141025 no longer needed as this option is now the default
 // modules slow down faup_init(), 10s vs 0.394s, so 25x slower for the loop
 // faup_opts->exec_modules = FAUP_MODULES_NOEXEC; // 20141025 no longer needed as this option is now the default

 for(int i=0; i<1000000; i++)
 {
 // fh is a pointer to a c struct that has, among others, all the positions where the uri splits in host, tld, etc.
 faup_handler_t *fh;

 // check if the options are set
 if (!faup_opts) {
 fprintf(stderr, "Error: cannot allocate faup options!\n");
 }

 // init the faup handler
 fh = faup_init(faup_opts);

 //the url to parse
 string url = "https://domain.co.uk";

 // this is the only command you need to run, to parse a url (if you are not running in multithreaded mode)
 faup_decode(fh, url.c_str(), url.size() );

 // equivalent to url.substr(fh->faup.features.tld.pos, fh->faup.features.tld.size);
 string tld = url.substr(faup_get_tld_pos(fh),faup_get_tld_size(fh));

 // cleanup
 faup_terminate(fh);
 }
 return 0;
}

update 2014-10-24: the sample was added to the main faup branch, so you should check for updates there.

Compile with :

g++ -W -Wall "%f" -o "%e" -lfaupl -g

note: replace %f with the source file, and %e with the output binary file

On my system for one million urls this took about 0.394s. Add another 200ms if you wish to update the list from mozilla.

Debugging c and c++ memory problems

Memory management in c/c++ can be tough, especially for long running applications.

Valgrind is an open source instrumentation framework for dynamic analysis tools. In other words it’s a programming toolkit, most often used for memory debugging and profiling.

The default (and most used) tool is Memcheck. Which is a great tool, but it’s output is sometimes hard to read, and false positives can get in the way of the real problems, especially when working with code you didn’t write.

The massif tool is a heap profiler. It performs detailed heap profiling by taking regular snapshots of a program’s heap. The output is a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations.

    MB
3.952^                                                                    # 
     |                                                          @#:
     |                                                        :@@#:
     |                                                   @@::::@@#: 
     |                                                   @ :: :@@#::
     |                                                 @@@ :: :@@#::
     |                                              @@:@@@ :: :@@#::
     |                                           :::@ :@@@ :: :@@#::
     |                                           : :@ :@@@ :: :@@#::
     |                                         :@: :@ :@@@ :: :@@#:: 
     |                                       @@:@: :@ :@@@ :: :@@#:::
     |                  :       ::         ::@@:@: :@ :@@@ :: :@@#:::
     |               :@@:    ::::: ::::@@@:::@@:@: :@ :@@@ :: :@@#:::
     |            ::::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |           @: ::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |           @: ::@@:  ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |           @: ::@@:::::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |       ::@@@: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |    :::::@ @: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
     |  @@:::::@ @: ::@@:: ::: ::::::: @  :::@@:@: :@ :@@@ :: :@@#:::
   0 +----------------------------------------------------------->Mi
     0                                                           626.4

Number of snapshots: 63
 Detailed snapshots: [3, 4, 10, 11, 15, 16, 29, 33, 34, 36, 39, 41,
                      42, 43, 44, 49, 50, 51, 53, 55, 56, 57 (peak)]

As you can see it is a little hard to understand what is what, especially for a new user, but don’t worry here is where the massif-visualiser app comes in handy, as it takes the output from massif and converts it to a beautifully interactive chart:

massif-visualizer Screenshot
massif-visualizer Screenshot

Requirements

Install instructions for ubuntu 14.10 (probably works for older versions, or other debian distributions).

  • valgrind

    sudo apt-get install valgrind
    
  • massif-visualiser

    sudo add-apt-repository ppa:kubuntu-ppa/backports 
    sudo apt-get update
    sudo apt-get install massif-visualizer
    

For other operating systems please consult the valgrind’s website.

Usage

$ valgrind --tool=massif --massif-out-file=${outFile} ${command} ${args}
$ massif-visualizer ${outFile}

That’s it, now just inspect by click the parts that are of interest and/or hide functions that get in the way.
If you’re checking for a leak, look at function representations and compare a checkpoint from the start of the program, where everything should be in check with the ending. If it’s usage percentage increased and it doesn’t have any real reason for the accumulating data, then you found yourself a leak.

If you have any questions, advice or just want to show appreciation, use the comment section below :).