Get Root and Sub Domain From URL in Python: A Complete Guide

Have you ever wondered how websites are organized? Just like a postal address helps you find a house, a URL helps your browser locate a website. But what if you only need the neighborhood (root domain) or the building (subdomain)? In this article, we’ll explore all possible ways to extract the root domain and subdomain from a URL using Python

Understanding URLs: The Postal Address of the Internet

Understanding URLs in Python

Think of a URL as a postal address for a website. For example:

https://syntaxscenarios.com/category/python

Here’s what each part means:

  • https: This is the protocol. Think of it as the mode of transportation (like a car or bike).
  • syntaxscenarios.com: This is the domain. It’s like the full address.
  • /category/python: This is the path. It’s like the specific room in the building.

Our goal is to extract the neighborhood (root domain) and the building (subdomain) from this address.

What Are Root Domain and Subdomain?

Domain and Subdomain in URL

Let’s break it down:

Root Domain

This is the main part of the address. For example, in

www.syntaxscenarios.com

 , the root domain is syntaxscenarios.com. Think of it as the neighborhood.

Subdomain

This is an optional part that comes before the root domain. For example, in

www.syntaxscenarios.com

, the subdomain is www. And if the URL was:

https://blog.syntaxscenarios.com

, the subdomain would be blog. Think of it as the specific building in the neighborhood.

Why Extract Root and Sub Domains?

Why Extract Root and Sub Domains in Python?

Imagine you’re a delivery person. You need to know the neighborhood (root domain) to find the general area and the building (subdomain) to deliver the package to the right place. Similarly, in programming, extracting these parts helps in:

  • Analyzing website traffic.
  • Building web scrapers.
  • Organizing data for SEO tools.

Methods to Get Root and Sub Domain From URL in Python 

Below are the possible ways to get root and subdomain from URL in Python. Let’s go step by step.

Method 1: Using urlparse (The Basic Street Sign)

The urlparse module is part of Python’s standard library. It’s like a basic street sign that tells you the main address, but you have to figure out the subdomain yourself.

Syntax:

parsed_url = urlparse('url')
  • urlparse('url'): Parses the given URL and returns its components.

Example:

from urllib.parse import urlparse

url = "https://www.syntaxscenarios.com/category/python"
parsed_url = urlparse(url)
host = parsed_url.hostname  # 'www.syntaxscenarios.com'
parts = host.split('.')
subdomain = ".".join(parts[:-2])  # 'www'
root_domain = ".".join(parts[-2:])  # 'syntaxscenarios.com'

print("Subdomain:", subdomain)
print("Root Domain:", root_domain)

Output:

Subdomain: www  
Root Domain: syntaxscenarios.com

Limitations:

  • It may struggle with complex domains like .co.uk.
  • Requires manual adjustments for multi-level subdomains.

Method 2: Using tldextract (The Smart Address Book)

Think of tldextract as a smart address book. It knows exactly which part of the website is the root domain and which is the subdomain.

Syntax:

ext = extract('url')
  • extract('url'): Breaks the URL into subdomain, domain, and suffix (TLD).

Example:

  1. Install the library:
pip install tldextract
  1. Use the library in your code:
import tldextract

url = "https://www.syntaxscenarios.com/category/python"
extracted = tldextract.extract(url)
subdomain = extracted.subdomain  # 'www'
root_domain = f"{extracted.domain}.{extracted.suffix}"  # 'syntaxscenarios.com'
print("Subdomain:", subdomain)
print("Root Domain:", root_domain)

Output:

Subdomain: www  
Root Domain: syntaxscenarios.com

Why Use tldextract?

  • It’s accurate and handles complex domains like .co.uk or .gov.in.
  • It’s beginner-friendly and requires minimal code.

Method 3: Using publicsuffixlist (The Detailed Map)

This method uses the Public Suffix List to correctly identify TLDs. Think of it as a detailed map that includes every tiny detail.

Example:

  1. Install the library:
pip install publicsuffixlist

2. Use the library in your code:

from urllib.parse import urlparse
from publicsuffixlist import PublicSuffixList

psl = PublicSuffixList()
url = "https://www.syntaxscenarios.com/category/python"

parsed_url = urlparse(url)
host = parsed_url.hostname  # 'www.syntaxscenarios.com'
tld = psl.publicsuffix(host)  # 'com'
parts = host.rsplit("." + tld, 1)[0].split('.')
subdomain = ".".join(parts[:-1])  # 'www'
root_domain = parts[-1]  # 'syntaxscenarios'

print("Subdomain:", subdomain)
print("Root Domain:", root_domain)
print("TLD:", tld)

Output:

Subdomain: www  
Root Domain: syntaxscenarios
TLD: com

Why Use publicsuffixlist?

  • It handles multi-part TLDs correctly.
  • Requires an updated suffix list for accuracy.

Method 4: Using Regex (The Detective Approach)

Think of regex as a detective with a magnifying glass. It looks for patterns in the URL and extracts the root and subdomains.

Syntax:

match = re.match(r"regex_pattern", url)
  • re.match(r"regex_pattern", url): Tries to match the url string from the beginning using the specified regular expression pattern.
  • r"regex_pattern": A raw string containing the regex pattern to search for.
  • match: Stores the matched object if a match is found; otherwise, it returns None.

Example:

import re

url = "https://www.syntaxscenarios.com/category/python"
match = re.match(r"https?://(?:www\.)?(([^.]+)\.([^.]+\.[a-z]+))", url)

if match:
    subdomain = match.group(2)  # 'www'
    root_domain = match.group(3).split('.')[0]  # 'syntaxscenarios'
    print("Subdomain:", subdomain)
    print("Root Domain:", root_domain)

Output:

Subdomain: www  
Root Domain: syntaxscenarios

Limitations:

  • It may break on complex URLs.
  • Not always accurate for multi-level subdomains.

Handling Complex Addresses

Now the real challenge lies in handling multi-level subdomains (e.g., blog.syntaxscenarios.com) or country-code domains (e.g., syntaxscenarios.co.uk) which are not just simple TLDs. Here is a beginner-friendly approach to get the root and the sub domain from complex URL:

Steps:

  1. Use tldextract for accurate extraction.
  2. Add custom logic to handle multi-level subdomains.

Example:

import tldextract

def extract_domain_info(url):
    extracted = tldextract.extract(url)
    root_domain = f"{extracted.domain}.{extracted.suffix}"  # Neighborhood
    subdomain = extracted.subdomain  # Building

    # Handle multi-level subdomains
    if subdomain.count('.') > 0:
        subdomain_parts = subdomain.split('.')
        subdomain = '.'.join(subdomain_parts)  # Keep all subdomain levels
    return root_domain, subdomain

url = "https://syntaxscenarios.co.uk/"
root_domain, subdomain = extract_domain_info(url)

print("Root Domain:", root_domain) 
print("Subdomain:", subdomain)      # Output: (empty, as there is no subdomain)

Output:

Root Domain: syntaxscenarios.co.uk  
Subdomain:

Why This Approach?

  • It ensures accuracy for complex addresses.
  • It retains all levels of subdomains for further analysis.

Which Method Should You Use?

MethodBest ForAccuracyHandles Multi-level TLDs?Requires External Library?
tldextractReal-world domains✅ HighYesYes (tldextract)
urllib.parseQuick, simple domains⚠️ MediumNoNo
publicsuffixlistMulti-level TLDs👍 Very GoodYesYes (publicsuffixlist)
RegexFast, simple domains⚠️ LowNoNo

Final Thoughts

Extracting the root domain and subdomain from a URL is important for tasks like web scraping and SEO. While there are many ways to do this like urllib.parse, publicsuffixlist, and regex, tldextract is the easiest and most accurate, especially for tricky domains like .co.uk. For more complex cases, you can add custom logic to tldextract to handle multi-level subdomains. These methods will help you work with URLs easily whether you’re new to coding or an expert.

For more engaging and informative articles on Python, check out our Python tutorial series at Syntax Scenarios.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top