BGP Data for Tor

Welcome to the BGP data analytics for Tor network.

On this page, we will explain our data collection and provide tutorials on running our analytic script.

To learn more details, please refer to our paper Counter-RAPTOR: Safeguarding Tor Against Active Routing Attacks.

BGP Data Collection#

We collected and processed BGP data from June 2016 - August 2017 through six RIS collectors: rrc00, rrc01, rrc03, rrc04, rrc11, rrc14. For more information on RIS collectors, refer to the RIS Raw Data page.

We filtered the BGP data to include only the updates whose subnets cover any Tor relay IP address. We used hourly-updated Tor consensus data from CollecTor to extract the Tor relay IPs.

Run Data Analytics#

We provide an analytic script that we developed for detecting routing anomalies involing Tor relays.

Our script offers two options to run the analytics:

First, we need to have a time window over which we compute the time or frequency of an announcement. This time window is currently hard-coded as 30 days. For instance, when an announcement is being analyzed for anomaly, we compute the ratio of (1) the amount of time (or number of times) that this prefix has been announced by the announcing AS in the past 30 days, over (2) the total amount of time (or number of times) that this prefix has been announced by any AS in the past 30 days. If the resulting ratio falls below a detection threshold, this announcement will be logged as suspicious.

The script takes in two data files as input:

Other optional inputs are threshold values for time and frequency analytics. The default values are 0.065 and 0.0025, respecitvely.

Run python detection.py to see help messages for the input options.

The script will output:

Approxiamte runtime (OS X 10.10, 2.5GHz Intel core i7, 16GB 1600MHz DDR3): ~10 seconds for time analytic, ~4 minutes for frequency analytic.

Develop Your Own Analytics#

First of all, take a look at the log - what are the announcements that are marked as suspicious? Is there any pattern? Could them be true positives?

Then, you can easily start with trying different threshold values for the frequency and time analytics. Or, you can change the time window (currently 30 days) to see if it makes any difference.

Keep in mind though, low false positives may come at the cost of false negatives.

In the end, the data is available, feel free to use it!