Skip to main content

The SymPy/HackerRank DMCA Incident

Thank you to Ondřej Čertík, Oscar Benjamin, Travis Oliphant, and Pamela Chestek for reviewing this blog post. Any errors in the post are my own.

On April 19, 2022, for approximately 11.5 hours, the documentation website for the open source SymPy project and the corresponding GitHub repository were taken down as a result of a DMCA takedown notice submitted by WorthIT Solutions on behalf of HackerRank. In this post, I’d like to explain what the DMCA is and how it works, detail exactly what happened with this incident, and go over my views on the incident and some of the lessons we’ve learned as a result.

What is the DMCA?

The information in this section has been drawn from various sources, including:

The Digital Millennium Copyright Act (DMCA) is a US law that was passed in 1998. It has laid the groundwork for how copyright is enforced in the age of the internet. One of the DMCA provisions that is important to understand is the so-called “safe harbor” provision. The safe harbor provision, roughly speaking, makes it so that websites can avoid liability because they are simply hosting or transmitting copyrighted content on behalf of their end-users.

The safe harbor provision has been absolutely essential to the modern internet. Without it, a website like GitHub could not exist. Imagine if every time someone found their copyrighted content had been uploaded to GitHub without their permission, that they could sue GitHub for damages. If this were the case, GitHub could never operate. The way GitHub works today, it cannot possibly know if every single thing that is uploaded to it is done so legally. The safe harbor provision also makes modern social media websites like Facebook, Twitter, and YouTube possible.

However, in order for a website to have and maintain safe harbor status, it must follow certain practices as outlined by the DMCA. In particular, sites must manage the “notice-and-take-down process”. GitHub’s notice-and-take-down process is described in its DMCA takedown policy.

The notice-and-take-down process works like this: a provider (like GitHub) has a means for a copyright holder to submit a notice. In GitHub’s case, they provide an online form. This notice must include certain information, including the name and address of the submitter, identification of the copyrighted work, the material on the site that is infringing copyright, and a statement of good faith, signed under penalty of perjury, that the person submitting the notice owns the copyright and believes their rights have been infringed upon. GitHub’s form also requires the submitter to assert that they have taken fair use into consideration.

Once the notice is submitted and once the provider confirms that the above things are done, they must “expeditiously” remove public access to the content. The DMCA does not define “expeditiously”.

The owner of the accused content then has the opportunity to file a counter-notice. GitHub’s counter notice procedure is described in its guide to submitting a DMCA counter notice. This gives the person whose content has been removed recourse if they believe the original notice was invalid. A counter notice contains a statement by the user that the content was removed “as a result of mistake or misidentification”, signed under penalty of perjury. When a provider receives a counter notice, they must forward it to the original complainant, and restore the challenged material after 10-14 business days. At this point, if the complainant wishes to keep the material down, they must file a lawsuit against the alleged infringer.

By following these procedures as outlined by the DMCA, GitHub avoids liability to either party.

GitHub goes beyond what the DMCA law requires. All notices and counter notices that have ever been submitted to GitHub are published publicly in the github/dmca repository (with personal information redacted). GitHub has also historically sided with developers in high profile infringement cases, such as the youtube-dl incident in 2020 (this incident involved a takedown relating to copyright circumvention, which is a different part of the DMCA law that I haven’t discussed here because it isn’t relevant to the notice that was sent to SymPy).

It’s important to understand, however, that GitHub’s hands are tied in many ways here. If they do not follow the notice and counter notice procedures exactly as outlined in the DMCA law, they risk losing their safe harbor status with the US Copyright Office. Were this to happen, it would be completely disastrous to GitHub as a business. Without safe harbor protection, GitHub could be liable for any copyrighted content that someone uploaded to its platform.

Finally, a technical note: when someone submits a takedown notice on a single part of a GitHub repository, GitHub’s response is to take down the entire repository. This is because it is technically impossible to remove only part of a repository due to the way git works. Completely scrubbing data from git history is possible, but it’s a destructive process that’s very disruptive to developers. Instead, GitHub requires the repository owners to scrub the data themselves, if they choose to, within one business day. Otherwise, they will take the entire repository down.

What happened

Below I provide a quick timeline of what happened. Times are in Mountain Daylight Time, which is where I live.

First, for background, the SymPy documentation website https://docs.sympy.org/ is hosted on GitHub Pages and mirrors the repository https://github.com/sympy/sympy_doc. It is common for websites to be hosted on GitHub Pages because it’s completely free and easy to set up for someone who is already used to using git and GitHub. This blog itself is hosted on GitHub Pages. The sympy_doc repository only contains the pre-built HTML of SymPy’s documentation. The actual source code lives in the main SymPy repository. However, the DMCA claims were only ever made against the website, so the main SymPy source code repository was unaffected.

  • Friday, Apr 15, 4:26 PM: Admins of the sympy/sympy_doc GitHub repository (myself, Ondřej Čertík, and Oscar Benjamin) received an automated email from support@githubsupport.com informing us that a DMCA takedown notice was filed against the SymPy documentation. Specifically, the claim, which can now be found on the github/dmca repository, claimed that “Someone has illegally copied our client's technical exams questions and answers from their official website https://www.hackerrank.com/ and uploaded them on their platform without permission.” It specifically referenced the page https://docs.sympy.org/latest/modules/solvers/solvers.html. The notice was made on behalf of HackerRank by an individual from “WorthIT Solutions Pvt. Ltd.” The notice also stated “the infringing website is not willing to remove our client's work”, which came as a surprise to us since, at no point in time prior to receiving this notice had we received any communications from HackerRank or WorthIT Solutions.

    The support email stated that if we do not remove the offending content within one business day, the repository, and consequently the docs website, will be taken down.

    While it is not relevant to the legality of the situation, it is worth noting that this was on Good Friday and the upcoming Sunday, April 17, was Easter Sunday, which limited our ability to effectively respond over the weekend.

  • Friday, April 15 - Tuesday, April 19: At first, we were not sure if the support email was legitimate or if it was just convincing spam. One thing that confused us was that the support email came from a domain, githubsupport.com, which did not appear to exist. Another issue is that although GitHub’s policy was to post all DMCA claims to the github/dmca repository, this particular claim had not yet been uploaded there. We reached out to GitHub support to ask if the email was legitimate, and they responded Monday, April 18, 3:18 PM that the email was indeed a legitimate email from GitHub Trust & Safety.

    I had already reached out on Friday to NumFOCUS for assistance. NumFOCUS is a 501(c)(3) non-profit organization that represents many open source scientific projects including SymPy. However, due to the limited time frame of the notice and the fact that it occurred over a holiday weekend, NumFOCUS was not able to connect us with legal counsel until Tuesday.

  • Tuesday, April 19, 12:21 PM: We received notice from GitHub that the sympy/sympy_doc repository has been taken offline. The sympy/sympy_doc repository was replaced with a notice that the documentation was taken down by the DMCA notice with a link to the notice, which had been posted to the public github/dmca repository, and the documentation website itself started to 404. Since this was now public knowledge, we publicly announced this on the SymPy mailing list and Twitter.

    We also worked with NumFOCUS and the legal counsel to send a counter notice to GitHub, since we believed that the notice was mistaken and that the claimed infringement, the code and mathematical examples on the docs page, were likely not even copyrightable.

    Not long after the takedown occurred, someone posted it to Hacker News. The Hacker News posting eventually made its way to the front page and by the end of the day it had made its way to rank 3 on the site, where it received hundreds of upvotes and comments. The story also generated a lot of buzz on Twitter at this time.

  • Tuesday, April 19, 6:40 PM: Vivek Ravisankar, the CEO of HackerRank, posted a public apology on the Hacker News thread:

    Hello, I'm Vivek, founder/CEO of HackerRank. Our intention with this initiative is to takedown plagiarized code snippets or solutions to company assessments. This was definitely an unintended consequence. We are looking into it ASAP and also going to do an RCA to ensure this doesn't happen again.
    
    Sorry, everyone!
    
  • Tuesday, April 19, 8:26 PM: Vivek posted a followup comment on Hacker News:

    Hello again, Vivek, founder/CEO here. In the interest of moving swiftly, here are the actions we are going to take:
    (1) We have withdrawn the DMCA notice for sympy; Sent a note to senior leadership in Github to act on this quickly.
    
    (2) We have stopped the whole DMCA process for now and working on internal guidelines of what constitutes a real violation so that these kind of incidents don't happen. We are going to do this in-house
    
    (3) We are going to donate $25k to the sympy project.
    
    As a company we take a lot of pride in helping developers and it sucks to see this. I'm extremely sorry for what happened here.
    

    HackerRank did follow through with the $25k donation to SymPy.

  • Wednesday, April 20, 12:00 AM (approximately): The SymPy documentation website and the sympy/sympy_doc repository went back online.

  • Wednesday, April 20, 2022, 10:11 AM: We received an email notice from GitHub that the DMCA claim against our repository has been retracted. The retraction is made available online on the github/dmca repository.

My Views

Now that I’ve stated the facts, let me take a moment to give my own thoughts on this whole thing. My belief is that the DMCA claim that was made against the SymPy repository was completely baseless and without merit. Not only is no part of the SymPy documentation, to my knowledge, taken illegally from HackerRank or any other copyrighted source, but the claim itself was completely unspecific about which parts of the documentation were supposedly infringing. This made it impossible for us to comply even if the complaint was legitimate. In addition, it is questionable whether the supposed parts of the documentation that were taken from HackerRank, the mathematical and code examples, are even copyrightable. While this may not have been an actively targeted attack against SymPy, as a community run open source project, it served as one in practice.

We have never learned any more about which parts of the SymPy documentation were supposedly infringing on HackerRank’s copyright, and I expect we never will. Some people online have speculated that HackerRank actually took examples from the SymPy documentation, not the other way around, but I do not know if this is the case or not. It seems likely, given the large number of other takedown notices WorthIT has made to GitHub on HackerRank’s behalf, that they were using some sort of automated or semi-automated detection system, which somehow detected what it thought was infringing content on our website. I have personally never used the HackerRank platform, so I only know what sorts of things are on it from second-hand accounts.

Firstly let me say that if anyone or anything is to be blamed for this, my personal belief is that the blame should primarily lie on the DMCA law itself. While the “safe harbor” provisions of the law have been critical to the development of the modern internet, allowing social websites such as GitHub to exist without the burden of legal liability for user contributed content, other provisions are hostile to those same users, even those who are operating in a completely legal manner. The DMCA works in a “guilty until proven innocent” way, and the very operation of the takedown notice policy implicitly assumes that those making claims are not abusing the system as bad actors. It also incentivizes platforms such as GitHub to step aside and not take sides in claims, even claims such as this one, which refer to likely uncopyrightable material or are too unspecific to reasonably address even if they are legitimate.

The DMCA also has had further chilling effects due to its anti-circumvention provisions. The EFF has written extensively about these. While something as radical as abolishing copyright may be considered to be practically untenable, I believe that laws like the DMCA can be rewritten so that they still serve their intended function of protecting copyright owners and providing “safe harbor” to websites like GitHub, while also protecting the rights of those accused of infringing copyright.

With that being said, I think some blame does need to lie on HackerRank and WorthIT Solutions for abusing the DMCA system and filing claims like this. They also lied in their claim when they said “the infringing website is not willing to remove our client's work.” At no point were we contacted by HackerRank or WorthIT Solutions about any copyright infringement. If we were, we would have been happy to work with them to determine whether any content in SymPy is infringing on copyright and remove it if it was. Even if it were the case that SymPy’s documentation infringed on HackerRank’s copyright, the immediate escalation to a DMCA takedown notice, which legally forced GitHub to completely disable the entire SymPy documentation, was completely inappropriate. Even submitting a counter notice is risky, because by doing so it increases the likelihood that the complaining party will bring an infringement lawsuit. Many DMCA’d repositories do not submit a counter notice for this reason, and we were only able to do so comfortably because we were able to do so via NumFOCUS. Had HackerRank not retracted the notice, we would have had to wait for our counter notice to take effect, meaning our docs website would have remained unavailable for 10-14 days. And had they instead taken down the main SymPy repository instead of the sympy_doc repo where the docs are hosted, this would have been a major disruption for our entire development workflow.

I do want to thank Vivek for quickly retracting the notice once this came to his attention, and I hope that he will be rethinking their DMCA policies at HackerRank, and for their donation to support SymPy and NumFOCUS.

Finally, while many have blamed GitHub, I believe that for the most part, they have acted as they are effectively required to by law. It is important to understand that any similar website based in the US would have likely treated this claim in a similar way. For instance, here is GitLab’s DMCA policy, which you can see is very similar to GitHub’s. By all accounts, GitHub is an industry leader here, by posting all notices and counter notices online, which they are not required to do, and by being very open about their DMCA policies ([1], [2], [3]). With that being said, there are ways that GitHub’s processes could be improved here. The one business day that we were given to respond is actually standard in the industry, and ultimately comes from the “expeditiously” wording of the DMCA. But even so, if we were given more time than this, it would have made things much easier. It may have even been possible for us to reach out to HackerRank and resolve the situation before any actual takedown occurred. It also would be helpful if GitHub, while staying within the DMCA provisions, had a higher standard of legitimacy for takedown notices such as the one we received. As I have noted, the request we received was not nearly specific enough for us to effectively act on it, which should have been apparent to GitHub, since the only information WorthIT provided was a link to the HackerRank home page. Finally, I have two small technical requests from GitHub: first is to make githubsupport.com a real website so that support emails appear more legitimate, and second is to post DMCA notices to the DMCA repository right away rather than only after the repository is taken down, to make it clearer that the notices are in fact legitimate.

Open source communities such as SymPy are under-resourced and typically ill equipped to handle situations like these, which inherently puts them in an inferior position. This is despite the fact that open source software serves a global commons that benefits all. I have estimated that SymPy itself is used by hundreds of thousands of people, and the broader scientific Python ecosystem that it is part of is used by millions of people. Abuses of copyright law against these communities are harmful to society as a whole.

Lessons Learned

I’d like to end with a few lessons that I’ve learned from this whole process.

  1. Take DMCA takedown notices seriously. If you ever receive a DMCA takedown notice from GitHub or any other website, you need to take it seriously. The website is required by law to remove your content after they receive such a notice. You can submit a counter notice, but this increases the likelihood of being sued.
  2. NumFOCUS fiscal sponsorship is invaluable. Our NumFOCUS fiscal sponsorship was invaluable here. Thanks to NumFOCUS, we were able to immediately access high quality legal advice on how to proceed here. Had HackerRank not retracted their notice, or even, heaven forbid, decided to sue, we would have had significant assistance and support from NumFOCUS. If you are part of an open source project that doesn’t have fiscal sponsorship, I would recommend looking into NumFOCUS, or other similar organizations. And if you want to support NumFOCUS, I encourage you to do so.
  3. Make backups of your online data. While the source code of any GitHub repository is effectively backed up onto every computer that has a git clone. Other data such as issues and pull request comments are not. We were lucky that the DMCA notice was sent against our documentation repository instead of our main repository. If our main repository were taken down instead, we would have lost access to all GitHub issue and pull request history. I have been looking into effective ways to backup this data. If anyone has any suggestions here, please let me know in the comments.

I do feel that being on a site like GitHub is still preferable to something like self-hosting. If you self-host, you become responsible for a ton of things which GitHub does for you, like making sure everything stays online, handling servers, and managing spam. Also, self-hosting does not magically shield you from legal threats. If you self-host content that infringes on someone’s copyright, you are still legally liable for hosting that content.

Finally, I want to thank everyone who supported SymPy during this incident. Even if you just talked about this on social media or upvoted the Hacker News story, that helped us get this into the public eye, which led to a much faster resolution than we expected. I especially want to thank Leah Silen and Arliss Collins from NumFOCUS; Ondřej Čertík, Oscar Benjamin, and other SymPy community members; Pamela Chestek, the NumFOCUS legal counsel; and Travis Oliphant for the help they provided to us. Thank you to Thomas Dohmke, GitHub’s CEO, who we have reached out to privately, and who has promised to improve GitHub’s DMCA policies. I also want to thank Vivek Ravisankar from HackerRank for retracting the DMCA claim, and for the generous donation to SymPy.

Now that this incident is over, I’m hopeful we in the SymPy community can all go back to building software.

If you wish to donate to support SymPy, you may do so here.

Switching to Utterances Comments

I've ditched Disqus as the comment system on this blog. I am now using Utterances. Utterances is an open source comment system that is backed by GitHub issues. Basically, every post has a corresponding issue opened on GitHub, and the comments on the post are the comments on the issue. Utterances automatically places the comments at the bottom of the post. For example, here is the issue corresponding to this post.

I didn't like Disqus mostly because it serves ads and tracking. Even though I had opted out from as much of it as I could in the Disqus settings, it still loads tracking scripts on every page. I run uBlock Origin always, and it's a bit hypocritical if my own side has things that are blocked by it. In some cases I can't avoid it (as far as I know), like when I embed a YouTube video, but it definitely shouldn't be on every post.

Utterances is a very nice alternative. I has lots of advantages:

  • Comments are not self-hosted. GitHub hosts them. Since you need a GitHub account to comment, this should make comment spam a non-issue.
  • Comments support full Markdown.
  • Users can edit their comments.
  • I can edit and fully moderate all comments.
  • Users log in with a federated system that proves their identity.
  • Email subscription to posts.
  • No ads or tracking.
  • It's completely open source and free to use.

If you use Nikola like I do, it natively supports Utterances (a feature which I added). Otherwise, go to the Utterances and paste the script tag generated at the bottom into your blog template. Then install the Utterances app in your repo, and you are done.

Exporting Disqus Comments

Some of my old posts had Disqus comments, which I wanted to preserve somehow. Here is guide on how I did that, since it wasn't as straightforward as I would have hoped.

The first step is to export your Disqus comments. It's very difficult to actually find the place in the Disqus site where you do this, but I finally found the URL. The export takes some time to complete (for me it took about half an hour). When it finished, Disqus will email you an XML file with all your comments. Note that the file contains all comments for all sites you have ever set up with Disqus. For me, it also included all the comments on my old Wordpress blog, as well as posts for draft blog posts that I never ended up publishing. It also contained all comments that were marked as spam, so you will need to remember to filter those.

I decided that since I only have a handful of posts with Disqus comments, I would just write a script to process them all and manually print them out, which I will then manually enter in to the Utterances comment system for those posts.

I wrote a script to process the comments, which you can find here. Disqus does provides an XML schema for the XML. I used a library called xsData, which lets you take an XML scheme and generate Python dataclasses corresponding to it, which make manipulating the parsed XML much easier than the standard library xml library. The script outputs text like

========== Comments from https://asmeurer.github.io/blog/posts/what-happens-when-you-mess-with-hashing-in-python/ ==========

These are the original comments on this post that were made when this blog used the [Disqus blog system](https://www.asmeurer.com/blog/posts/switching-to-utterances-comments/).

>**Comment from bjd2385 on 2016-08-28 12:33:12+00:00:**

><p>Very interesting post. I was just looking into hash functions (I've never formally learned what the topic entails), and since I'm most familiar with Python this post explained quite a bit, especially your early mathematical points.</p>

>**Comment from Mark Lawrence on 2016-10-03 20:26:54+00:00:**

><p>At what point does Python 3 force the override of __hash__ if you've defined __eq__?  E.g when would your</p><p>AlwaysEqual class fail?</p>

>**Replies:**

>>**Comment from asmeurer on 2016-10-03 20:38:13+00:00:**

>><p>That's a good catch. I originally wrote this post in Python 2. The example does indeed fail in Python 3. More specifically, if you override __eq__, Python 3 automatically sets __hash__ to None. I'll update the post to make this more clear.</p>

>**Comment from Erick Mendonça on 2017-07-30 03:23:55+00:00:**

><p>Great article! We must really pay attention to these details when implementing custom hashes.</p>

>**Comment from Ignacio on 2017-10-07 22:31:56+00:00:**

><p>Thanks a lot for this post! Clarified a lot of concepts.</p>

which I then manually copied to each post's Utterances page on GitHub.

Feel free to adapt my script if you find yourself in a similar situation.

Utterances Comments

Feel free to use the comments on this page to play around with the commenting system.

Note that to comment, there are two options. You can log in on this page, which will let you type your comment in the box below. This requires giving the Utterances bot access to your GitHub account. Alternately, if you don't want to give a bot access, you can just go directly to the GitHub issue page and comment there. I am currently in the process of figuring out how to add some boilerplate to each page that makes this clear (see this Utterances issue). If anyone has any suggestions on how to do this, let me know. For now, I am just going to manually add a statement about this as the first comment on each post.

Verifying the Riemann Hypothesis with SymPy and mpmath

Like most people, I've had a lot of free time recently, and I've spent some of it watching various YouTube videos about the Riemann Hypothesis. I've collected the videos I've watched into YouTube playlist. The playlist is sorted with the most mathematically approachable videos first, so even if you haven't studied complex analysis before, you can watch the first few. If you have studied complex analysis, all the videos will be within your reach (none of them are highly technical with proofs). Each video contains parts that aren't in any of the other videos, so you will get something out of watching each of them.

One of the videos near the end of the playlist is a lecture by Keith Conrad. In it, he mentioned a method by which one could go about verifying the Riemann Hypothesis with a computer. I wanted to see if I could do this with SymPy and mpmath. It turns out you can.

Background Mathematics

Euler's Product Formula

Before we get to the computations, let's go over some mathematical background. As you may know, the Riemann Hypothesis is one of the 7 Millennium Prize Problems outlined by the Clay Mathematics Institute in 2000. The problems have gained some fame because each problem comes with a $1,000,000 prize if solved. One problem, the Poincaré conjecture, has already been solved (Grigori Perelman who solved it turned down the 1 million dollar prize). The remainder remain unsolved.

The Riemann Hypothesis is one of the most famous of these problems. The reason for this is that the problem is central many open questions in number theory. There are hundreds of theorems which are only known to be true contingent on the Riemann Hypothesis, meaning that if the Riemann Hypothesis were proven, immediately hundreds of theorems would be proven as well. Also, unlike some other Millennium Prize problems, like P=NP, the Riemann Hypothesis is almost universally believed to be true by mathematicians. So it's not a question of whether or not it is true, just one of how to actually prove it. The problem has been open for over 160 years, and while many advances have been made, no one has yet come up with a proof of it (crackpot proofs aside).

To understand the statement of the hypothesis, we must first define the zeta function. Let

$$\zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s}$$

(that squiggle $\zeta$ is the lowercase Greek letter zeta). This expression makes sense if $s$ is an integer greater than or equal to 2, $s=2, 3, 4, \ldots$, since we know from simple arguments from calculus that the summation converges in those cases (it isn't important for us what those values are, only that the summation converges). The story begins with Euler, who in 1740 considered the following infinite product:

$$\prod_{\text{$p$ prime}}\frac{1}{1 - \frac{1}{p^s}}.$$

The product ranges over all prime numbers, i.e., it is $$\left(\frac{1}{1 - \frac{1}{2^s}}\right)\cdot\left(\frac{1}{1 - \frac{1}{3^s}}\right)\cdot\left(\frac{1}{1 - \frac{1}{5^s}}\right)\cdots.$$ The fraction $\frac{1}{1 - \frac{1}{p}}$ may seem odd at first, but consider the famous geometric series formula, $$\sum_{k=0}^\infty r^k = \frac{1}{1 - r},$$ which is true for $|r| < 1$. Our fraction is exactly of this form, with $r = \frac{1}{p^s}$. So substituting, we have

$$\prod_{\text{$p$ prime}}\frac{1}{1 - \frac{1}{p^s}} = \prod_{\text{$p$ prime}}\sum_{k=0}^\infty \left(\frac{1}{p^s}\right)^k = \prod_{\text{$p$ prime}}\sum_{k=0}^\infty \left(\frac{1}{p^k}\right)^s.$$

Let's take a closer look at what this is. It is

$$\left(1 + \frac{1}{p_1^s} + \frac{1}{p_1^{2s}} + \frac{1}{p_1^{3s}} + \cdots\right)\cdot\left(1 + \frac{1}{p_2^s} + \frac{1}{p_2^{2s}} + \frac{1}{p_2^{3s}} + \cdots\right)\cdot\left(1 + \frac{1}{p_3^s} + \frac{1}{p_3^{2s}} + \frac{1}{p_3^{3s}} + \cdots\right)\cdots,$$

where $p_1$ is the first prime, $p_2$ is the second prime, and so on. Now think about how to expand finite products of finite sums, for instance, $$(x_1 + x_2 + x_3)(y_1 + y_2 + y_3)(z_1 + z_2 + z_3).$$ To expand the above, you would take a sum of every combination where you pick one $x$ term, one $y$ term, and one $z$ term, giving

$$x_1y_1z_1 + x_1y_1z_2 + \cdots + x_2y_1z_3 + \cdots + x_3y_2z_1 + \cdots + x_3y_3z_3.$$

So to expand the infinite product, we do the same thing. We take every combination of picking $1/p_i^{ks}$, with one $k$ for each $i$. If we pick infinitely many non-$1$ powers, the product will be zero, so we only need to consider terms where there are finitely many primes. The resulting sum will be something like

$$\frac{1}{1^s} + \frac{1}{p_1^s} + \frac{1}{p_2^s} + \frac{1}{\left(p_1^2\right)^s} + \frac{1}{p_3^s} + \frac{1}{\left(p_1p_2\right)^s} + \cdots,$$

where each prime power combination is picked exactly once. However, we know by the Fundamental Theorem of Arithmetic that when you take all combinations of products of primes that you get each positive integer exactly once. So the above sum is just

$$\frac{1}{1^s} + \frac{1}{2^s} + \frac{1}{3^s} + \cdots,$$ which is just $\zeta(s)$ as we defined it above.

In other words,

$$\zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s} = \prod_{\text{$p$ prime}}\frac{1}{1 - \frac{1}{p^s}},$$ for $s = 2, 3, 4, \ldots$. This is known as Euler's product formula for the zeta function. Euler's product formula gives us our first clue as to why the zeta function can give us insights into prime numbers.

Analytic Continuation

In 1859, Bernhard Riemann wrote a short 9 page paper on number theory and the zeta function. It was the only paper Riemann ever wrote on the subject of number theory, but it is undoubtedly one of the most important papers every written on the subject.

In the paper, Riemann considered that the zeta function summation,

$$\zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s},$$

makes sense not just for integers $s = 2, 3, 4, \ldots$, but for any real number $s > 1$ (if $s = 1$, the summation is the harmonic series, which famously diverges). In fact, it is not hard to see that for complex $s$, the summation makes sense so long as $\mathrm{Re}(s) > 1$ (for more about what it even means for $s$ to be complex in that formula, and the basic ideas of analytic continuation, I recommend 3Blue1Brown's video from my YouTube playlist).

Riemann wanted to extend this function to the entire complex plane, not just $\mathrm{Re}(s) > 1$. The process of doing this is called analytic continuation. The theory of complex analysis tells us that if we can find an extension of $\zeta(s)$ to the whole complex plan that remains differentiable, then that extension is unique, and we can reasonably say that that is the definition of the function everywhere.

Riemann used the following approach. Consider what we might call the "completed zeta function"

$$Z(s) = \pi^{-\frac{s}{2}}\Gamma\left(\frac{s}{2}\right)\zeta(s).$$

Using Fourier analysis, Riemann gave a formula for $Z(s)$ that is defined everywhere, allowing us to use it to define $\zeta(s)$ to the left of 1. I won't repeat Riemann's formula for $Z(s)$ as the exact formula isn't important, but from it one could also see

  1. $Z(s)$ is defined everywhere in the complex plane, except for simple poles at 0 and 1.

  2. $Z(s) = Z(1 - s).$ This means if we have a value for $s$ that is right of the line $\mathrm{Re}(z) = \frac{1}{2},$ we can get a value to the left of it by reflecting it over the real-axis and the line at $\frac{1}{2}$ (to see this, note that the average of $s$ and $1 - s$ is $1/2$, so the midpoint of a line connecting the two should always go through the point $1/2$).

Reflection of s and 1 - s

(Reflection of $s$ and $1 - s$. Created with Geogebra)

Zeros

Looking at $Z(s)$, it is a product of three parts. So the zeros and poles of $Z(s)$ correspond to the zeros and poles of these parts, unless they cancel. $\pi^{-\frac{s}{2}}$ is the easiest: it has no zeros and no poles. The second part is the gamma function. $\Gamma(z)$ has no zeros and has simple poles at nonpositive integers $z=0, -1, -2, \ldots$.

So taking this, along with the fact that $Z(s)$ is entire except for simple poles at 0 and 1, we get from $$\zeta(s) = \frac{Z(s)}{\pi^{-\frac{s}{2}}\Gamma\left(\frac{s}{2}\right)}$$

  1. $Z(s)$ has a simple pole at 1, which means that $\zeta(s)$ does as well. This is not surprising, since we already know the summation formula from above diverges as $s$ approaches 1.
  2. $Z(s)$ has a simple pole at 0. Since $\Gamma\left(\frac{s}{2}\right)$ also has a simple pole at 0, they must cancel and $\zeta(s)$ must have neither a zero nor a pole at 0 (in fact, $\zeta(0) = -1/2$).
  3. Since $\Gamma\left(\frac{s}{2}\right)$ has no zeros, there are no further poles of $\zeta(s)$. Thus, $\zeta(s)$ is entire everywhere except for a simple pole at $s=1$.
  4. $\Gamma\left(\frac{s}{2}\right)$ has poles at the remaining negative even integers. Since $Z(s)$ has no poles there, these must correspond to zeros of $\zeta(s)$. These are the so-called "trivial" zeros of the zeta function, at $s=-2, -4, -6, \ldots$. The term "trivial" here is a relative one. They are trivial to see from the above formula, whereas other zeros of $\zeta(s)$ are much harder to find.
  5. $\zeta(s) \neq 0$ if $\mathrm{Re}(s) > 1$. One way to see this is from the Euler product formula. Since each term in the product is not zero, the function itself cannot be zero (this is a bit hand-wavy, but it can be made rigorous). This implies that $Z(s) \neq 0$ in this region as well. We can reflect $\mathrm{Re}(s) > 1$ over the line at $\frac{1}{2}$ by considering $\zeta(1 - s)$. Using the above formula and the fact that $Z(s) = Z(1 - s)$, we see that $\zeta(s)$ cannot be zero for $\mathrm{Re}(s) < 0$ either, with the exception of the aforementioned trivial zeros at $s=-2, -4, -6, \ldots$.

Thus, any non-trivial zeros of $\zeta(s)$ must have real part between 0 and 1. This is the so-called "critical strip". Riemann hypothesized that these zeros are not only between 0 and 1, but are in fact on the line dividing the strip at real part equal to $1/2$. This line is called the "critical line". This is Riemann's famous hypothesis: that all the non-trivial zeros of $\zeta(s)$ have real part equal to $1/2$.

Computational Verification

Whenever you have a mathematical hypothesis, it is good to check if it is true numerically. Riemann himself used some methods (not the same ones we use here) to numerically estimate the first few non-trivial zeros of $\zeta(s)$, and found that they lied on the critical line, hence the motivation for his hypothesis. Here is an English translation of his original paper if you are interested.

If we verified that all the zeros in the critical strip from, say, $\mathrm{Im}(s) = 0$ to $\mathrm{Im}(s) = N$ are in fact on the critical line for some large $N$, then it would give evidence that the Riemann Hypothesis is true. However, to be sure, this would not constitute a proof. Hardy showed in 1914 that $\zeta(s)$ has infinitely many zeros on the critical strip, so only finding finitely many of them would not suffice as a proof. (Although if we were to find a counter-example, a zero not on the critical line, that WOULD constitute a proof that the Hypothesis is false. However, there are strong reasons to believe that the hypothesis is not false, so this would be unlikely to happen.)

How would we verify that the zeros are all on the line $1/2$. We can find zeros of $\zeta(s)$ numerically, but how would we know if the real part is really exactly 0.5 and not 0.500000000000000000000000000000000001? And more importantly, just because we find some zeros, it doesn't mean that we have all of them. Maybe we can find a bunch of zeros on the critical line, but how would we be sure that there aren't other zeros lurking around elsewhere on the critical strip?

We want to find rigorous answers to these two questions:

  1. How can we count the number of zeros between $\mathrm{Im}(s) = 0$ and $\mathrm{Im}(s) = N$ of $\zeta(s)$?

  2. How can we verify that all those zeros lie on the critical line, that is, they have real part equal to exactly $1/2$?

Counting Zeros Part 1

To answer the first question, we can make use of a powerful theorem from complex analysis called the argument principle. The argument principle says that if $f$ is a meromorphic function on some closed contour $C$, and does not have any zeros or poles on $C$ itself, then

$$\frac{1}{2\pi i}\oint_C \frac{f'(z)}{f(z)}\,dz = \#\left\{\text{zeros of $f$ inside of C}\right\} - \#\left\{\text{poles of $f$ inside of C}\right\},$$ where all zeros and poles are counted with multiplicity.

In other words, the integral on the left-hand side counts the number of zeros of $f$ minus the number of poles of $f$ in a region. The argument principle is quite easy to show given the Cauchy residue theorem (see the above linked Wikipedia article for a proof). The expression $f'(z)/f(z)$ is called the "logarithmic derivative of $f$", because it equals $\frac{d}{dz} \log(f(z))$ (although it makes sense even without defining what "$\log$" means).

One should take a moment to appreciate the beauty of this result. The left-hand side is an integral, something we generally think of as being a continuous quantity. But it is always exactly equal to an integer. Results such as these give us a further glimpse at how analytic functions and complex analysis can produce theorems about number theory, a field which one would naively think can only be studied via discrete means. In fact, these methods are far more powerful than discrete methods. For many results in number theory, we only know how to prove them using complex analytic means. So-called "elementary" proofs for these results, or proofs that only use discrete methods and do not use complex analysis, have not yet been found.

Practically speaking, the fact that the above integral is exactly an integer means that if we compute it numerically and it comes out to something like 0.9999999, we know that it must in fact equal exactly 1. So as long as we get a result that is near an integer, we can round it to the exact answer.

We can integrate a contour along the critical strip up to some $\mathrm{Im}(s) = N$ to count the number of zeros up to $N$ (we have to make sure to account for the poles. I go into more details about this when I actually compute the integral below).

Counting Zeros Part 2

So using the argument principle, we can count the number of zeros in a region. Now how can we verify that they all lie on the critical line? The answer lies in the $Z(s)$ function defined above. By the points outlined in the previous section, we can see that $Z(s)$ is zero exactly where $\zeta(s)$ is zero on the critical strip, and it is not zero anywhere else. In other words,

the zeros of $Z(s)$ are exactly the non-trivial zeros of $\zeta(s)$.

This helps us because $Z(s)$ has a nice property on the critical line. First we note that $Z(s)$ commutes with conjugation, that is $\overline{Z(s)} = Z(\overline{s})$ (this isn't obvious from what I have shown, but it is true). On the critical line $\frac{1}{2} + it$, we have

$$\overline{Z\left(\frac{1}{2} + it\right)} = Z\left(\overline{\frac{1}{2} + it}\right) = Z\left(\frac{1}{2} - it\right).$$

However, $Z(s) = Z(1 - s)$, and $1 - \left(\frac{1}{2} - it\right) = \frac{1}{2} + it$, so

$$\overline{Z\left(\frac{1}{2} + it\right)} = Z\left(\frac{1}{2} + it\right),$$

which means that $Z\left(\frac{1}{2} + it\right)$ is real valued for real $t$.

This simplifies things a lot, because it is much easier to find zeros of a real function. In fact, we don't even care about finding the zeros, only counting them. Since $Z(s)$ is continuous, we can use a simple method: counting sign changes. If a continuous real function changes signs from negative to positive or from positive to negative n times in an interval, then it must have at least n zeros in that interval. It may have more, for instance, if some zeros are clustered close together, or if a zero has a multiplicity greater than 1, but we know that there must be at least n.

So our approach to verifying the Riemann Hypothesis is as such:

  1. Integrate $\frac{1}{2\pi i}\oint_C Z'(s)/Z(s)\,ds$ along a contour $C$ that runs along the critical strip up to some $\mathrm{Im}(s) = N$. The integral will tell us there are exactly $n$ zeros in the contour, counting multiplicity.

  2. Try to find $n$ sign changes of $Z(1/2 + it)$ for $t\in [0, N]$. If we can find $n$ of them, we are done. We have confirmed all the zeros are on the critical line.

Step 2 would fail if the Riemann Hypothesis is false, in which case a zero wouldn't be on the critical line. But it would also fail if a zero has a multiplicity greater than 1, since the integral would count it more times than the sign changes. Fortunately, as it turns out, the Riemann Hypothesis has been verified up to N = 10000000000000, and no one has yet found a zero of the zeta function yet that has a multiplicity greater than 1, so we should not expect that to happen (no one has yet found a counterexample to the Riemann Hypothesis either).

Verification with SymPy and mpmath

We now use SymPy and mpmath to compute the above quantities. We use SymPy to do symbolic manipulation for us, but the heavy work is done by mpmath. mpmath is a pure Python library for arbitrary precision numerics. It is used by SymPy under the hood, but it will be easier for us to use it directly. It can do, among other things, numeric integration. When I first tried to do this, I tried using the scipy.special zeta function, but unfortunately, it does not support complex arguments.

First we do some basic imports

>>> from sympy import *
>>> import mpmath
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> s = symbols('s')

Define the completed zeta function $Z = \pi^{-s/2}\Gamma(s/2)\zeta(s)$.

>>> Z = pi**(-s/2)*gamma(s/2)*zeta(s)

We can verify that Z is indeed real for $s = \frac{1}{2} + it.$

>>> Z.subs(s, 1/2 + 0.5j).evalf()
-1.97702795164031 + 5.49690501450151e-17*I

We get a small imaginary part due to the way floating point arithmetic works. Since it is below 1e-15, we can safely ignore it.

D will be the logarithmic derivative of Z.

>>> D = simplify(Z.diff(s)/Z)
>>> D
polygamma(0, s/2)/2 - log(pi)/2 + Derivative(zeta(s), s)/zeta(s)

This is $$\frac{\operatorname{polygamma}{\left(0,\frac{s}{2} \right)}}{2} - \frac{\log{\left(\pi \right)}}{2} + \frac{ \zeta'\left(s\right)}{\zeta\left(s\right)}.$$

Note that logarithmic derivatives behave similar to logarithms. The logarithmic derivative of a product is the sum of logarithmic derivatives (the $\operatorname{polygamma}$ function is the derivative of $\Gamma$).

We now use lambdify to convert the SymPy expressions Z and D into functions that are evaluated using mpmath. A technical difficulty here is that the derivative of the zeta function $\zeta'(s)$ does not have a closed-form expression. mpmath's zeta can evaluate $\zeta'$ but it doesn't yet work with sympy.lambdify (see SymPy issue 11802). So we have to manually define "Derivative" in lambdify, knowing that it will be the derivative of zeta when it is called. Beware that this is only correct for this specific expression where we know that Derivative will be Derivative(zeta(s), s).

>>> Z_func = lambdify(s, Z, 'mpmath')
>>> D_func = lambdify(s, D, modules=['mpmath',
...     {'Derivative': lambda expr, z: mpmath.zeta(z, derivative=1)}])

Now define a function to use the argument principle to count the number of zeros up to $Ni$. Due to the symmetry $Z(s) = Z(1 - s)$, it is only necessary to count zeros in the top half-plane.

We have to be careful about the poles of $Z(s)$ at 0 and 1. We can either integrate right above them, or expand the contour to include them. I chose to do the former, starting at $0.1i$. It is known that there $\zeta(s)$ has no zeros near the real axis on the critical strip. I could have also expanded the contour to go around 0 and 1, and offset the result by 2 to account for the integral counting those points as poles.

It has also been shown that there are no zeros on the lines $\mathrm{Re}(s) = 0$ or $\mathrm{Re}(s) = 1$, so we do not need to worry about that. If the upper point of our contour happens to have zeros exactly on it, we would be very unlucky, but even if this were to happen we could just adjust it up a little bit.

Our contour

(Our contour with $N=10$. Created with Geogebra)

mpmath.quad can take a list of points to compute a contour. The maxdegree parameter allows us to increase the degree of the quadrature if it becomes necessary to get an accurate result.

>>> def argument_count(func, N, maxdegree=6):
...     return 1/(2*mpmath.pi*1j)*(mpmath.quad(func,
...         [1 + 0.1j, 1 + N*1j, 0 + N*1j, 0 + 0.1j,  1 + 0.1j],
...         maxdegree=maxdegree))

Now let's test it. Lets count the zeros of $$s^2 - s + 1/2$$ in the box bounded by the above rectangle ($N = 10$).

>>> expr = s**2 - s + S(1)/2
>>> argument_count(lambdify(s, expr.diff(s)/expr), 10)
mpc(real='1.0', imag='3.4287545414000525e-24')

The integral is 1. We can confirm there is indeed one zero in this box, at $\frac{1}{2} + \frac{i}{2}$.

>>> solve(s**2 - s + S(1)/2)
[1/2 - I/2, 1/2 + I/2]

Now compute points of $Z$ along the critical line so we can count the sign changes. We also make provisions in case we have to increase the precision of mpmath to get correct results here. dps is the number of digits of precision the values are computed to. The default is 15, but mpmath can compute values to any number of digits. mpmath.chop zeros out values that are close to 0, which removes any numerically insignificant imaginary parts that arise from the floating point evaluation.

>>> def compute_points(Z_func, N, npoints=10000, dps=15):
...     import warnings
...     old_dps = mpmath.mp.dps
...     points = np.linspace(0, N, npoints)
...     try:
...         mpmath.mp.dps = dps
...         L = [mpmath.chop(Z_func(i)) for i in 1/2 + points*1j]
...     finally:
...         mpmath.mp.dps = old_dps
...     if L[-1] == 0:
...         # mpmath will give 0 if the precision is not high enough, since Z
...         # decays rapidly on the critical line.
...         warnings.warn("You may need to increase the precision")
...     return L

Next define a function to count the number of sign changes in a list of real values.

>>> def sign_changes(L):
...     """
...     Count the number of sign changes in L
...
...     Values of L should all be real.
...     """
...     changes = 0
...     assert im(L[0]) == 0, L[0]
...     s = sign(L[0])
...     for i in L[1:]:
...         assert im(i) == 0, i
...         s_ = sign(i)
...         if s_ == 0:
...             # Assume these got chopped to 0
...             continue
...         if s_ != s:
...             changes += 1
...         s = s_
...     return changes

For example, for $\sin(s)$ from -10 to 10, there are 7 zeros ($3\pi\approx 9.42$).

>>> sign_changes(lambdify(s, sin(s))(np.linspace(-10, 10)))
7

Now we can check how many zeros of $Z(s)$ (and hence non-trivial zeros of $\zeta(s)$) we can find. According to Wikipedia, the first few non-trivial zeros of $\zeta(s)$ in the upper half-plane are 14.135, 21.022, and 25.011.

First try up to $N=20$.

>>> argument_count(D_func, 20)
mpc(real='0.99999931531867581', imag='-3.2332902529067346e-24')

Mathematically, the above value must be an integer, so we know it is 1.

Now check the number of sign changes of $Z(s)$ from $\frac{1}{2} + 0i$ to $\frac{1}{2} + 20i$.

>>> L = compute_points(Z_func, 20)
>>> sign_changes(L)
1

So it checks out. There is one zero between $0$ and $20i$ on the critical strip, and it is in fact on the critical line, as expected!

Now let's verify the other two zeros from Wikipedia.

>>> argument_count(D_func, 25)
mpc(real='1.9961479945577916', imag='-3.2332902529067346e-24')
>>> L = compute_points(Z_func, 25)
>>> sign_changes(L)
2
>>> argument_count(D_func, 30)
mpc(real='2.9997317058520916', imag='-3.2332902529067346e-24')
>>> L = compute_points(Z_func, 30)
>>> sign_changes(L)
3

Both check out as well.

Since we are computing the points, we can go ahead and make a plot as well. However, there is a technical difficulty. If you naively try to plot $Z(1/2 + it)$, you will find that it decays rapidly, so fast that you cannot really tell where it crosses 0:

>>> def plot_points_bad(L, N):
...     npoints = len(L)
...     points = np.linspace(0, N, npoints)
...     plt.figure()
...     plt.plot(points, L)
...     plt.plot(points, [0]*npoints, linestyle=':')
>>> plot_points_bad(L, 30)

So instead of plotting $Z(1/2 + it)$, we plot $\log(|Z(1/2 + it)|)$. The logarithm will make the zeros go to $-\infty$, but these will be easy to see.

>>> def plot_points(L, N):
...     npoints = len(L)
...     points = np.linspace(0, N, npoints)
...     p = [mpmath.log(abs(i)) for i in L]
...     plt.figure()
...     plt.plot(points, p)
...     plt.plot(points, [0]*npoints, linestyle=':')
>>> plot_points(L, 30)

The spikes downward are the zeros.

Finally, let's check up to N=100. OEIS A072080 gives the number of zeros of $\zeta(s)$ in upper half-plane up to $10^ni$. According to it, we should get 29 zeros between $0$ and $100i$.

>>> argument_count(D_func, 100)
mpc(real='28.248036536895913', imag='-3.2332902529067346e-24')

This is not near an integer. This means we need to increase the precision of the quadrature (the maxdegree argument).

>>> argument_count(D_func, 100, maxdegree=9)
mpc(real='29.000000005970151', imag='-3.2332902529067346e-24')

And the sign changes...

>>> L = compute_points(Z_func, 100)
__main__:11: UserWarning: You may need to increase the precision

Our guard against the precision being too low was triggered. Try raising it (the default dps is 15).

>>> L = compute_points(Z_func, 100, dps=50)
>>> sign_changes(L)
29

They both give 29. So we have verified the Riemann Hypothesis up to $100i$!

Here is a plot of these 29 zeros.

>>> plot_points(L, 100)

(remember that the spikes downward are the zeros)

Conclusion

$N=100$ takes a few minutes to compute, and I imagine larger and larger values would require increasing the precision more, slowing it down even further, so I didn't go higher than this. But it is clear that this method works.

This was just me playing around with SymPy and mpmath, but if I wanted to actually verify the Riemann Hypothesis, I would try to find a more efficient method of computing the above quantities. For the sake of simplicity, I used $Z(s)$ for both the argument principle and sign changes computations, but it would have been more efficient to use $\zeta(s)$ for the argument principle integral, since it has a simpler formula. It would also be useful if there were a formula with similar properties to $Z(s)$ (real on the critical line with the same zeros as $\zeta(s)$), but that did not decay as rapidly.

Furthermore, for the argument principle integral, I would like to see precise error estimates for the integral. We saw above with $N=100$ with the default quadrature that we got a value of 28.248, which is not close to an integer. This tipped us off that we should increase the quadrature, which ended up giving us the right answer, but if the original number happened to be close to an integer, we might have been fooled. Ideally, one would like know the exact quadrature degree needed. If you can get error estimates guaranteeing the error for the integral will be less than 0.5, you can always round the answer to the nearest integer. For the sign changes, you don't need to be as rigorous, because simply seeing as many sign changes as you have zeros is sufficient. However, one could certainly be more efficient in computing the values along the interval, rather than just naively computing 10000 points and raising the precision until it works, as I have done.

One would also probably want to use a faster integrator than mpmath (like one written in C), and perhaps also find a faster to evaluate expression than the one I used for $Z(s)$. It is also possible that one could special-case the quadrature algorithm knowing that it will be computed on $\zeta'(s)/\zeta(s)$.

In this post I described the Riemann zeta function and the Riemann Hypothesis, and showed how to computationally verify it. But I didn't really go over the details of why the Riemann Hypothesis matters. I encourage you to watch the videos in my YouTube playlist if you want to know this. Among other things, the truth of the Riemann Hypothesis would give a very precise bound on the distribution of prime numbers. Also, the non-trivial zeros of $\zeta(s)$ are, in some sense, the "spectrum" of the prime numbers, meaning they exactly encode the position of every prime on the number line.