Knowing that someone uses a specific software can be used against the user. Sometimes, an attacker has to use sophisticated techniques to guess which software is behind the initial website. These could be 404 requests or some malformed queries to get the detailed stack trace or specific page organisation. It turns out, the number of requests required for detecting AEM equals 1. In this blog post, I will leverage that fact. Here, I present a technique for discovering Adobe Experience Manager using a regular, random web crawler.
Adobe Experience Manager Recognised Easily
AEM is a specific CMS that uses strong path organisation - including how assets, pages, and scripts are stored. In very default configuration, all pages are stored under
/etc/designs... for older versions) and assets
/content/dam. Leveraging this fact, I am considering finding out which websites use this CMS available on the public Internet.
Value of Information
How can this knowledge be used? Organisations that use AEM have high credibility, visibility, and wealth. Instances are not covering the organisation’s core activity but informative websites. Even then, disruptions, like increasing the page load time, may take part in a multi-dimensional operation against the target. If the attacker knows the target is AEM upfront, it’s certainly easier to prepare a set of attacks for the website. If this is extremely easy to find out, the attacker can pick the target based on that fact.
From another perspective, AEM vendors are fighting for each perimeter of the AEM world - which is very natural in this market. Having a list of AEM websites, they can discover potential customers. I consider this a slight market advantage. The race never ends.
AEM Hacker is a project for discovering AEM’s vulnerabilities automatically, based on the website name. Developers and testers often use the tool to verify a website’s security. One of the scripts can discover the AEM by making special HTTP requests.
In this article, however, I would like to perform a similar discovery based on reading a designed to use response so the attacker cannot be easily discovered in this stage, making a small amount of absolutely legitimate requests.
I use AEM Hacker software for checking the state of the existing website.
The hypothesis is:
Paths to assets, scripts or websites in HTML disclose the fact that AEM serves the tested page
The test has to go through the Internet, fetch random 200 responses, inspect their bodies, and mark ones that have specific paths within.
Therefore, the highest-level algorithm is as follows:
- read the website
- check if it has any links to
- if the above is true, save the website’s hostname
- move to another page (jump to the beginning if possible)
- return list of saved hostnames - here’s the list of AEM instances
Scrapping websites perform a graph exploration so that targets are unknown upfront but discovered during crawling. To crawl over the Internet, I have decided to use Scrapy, which is a similar tech stack as AEM Hacker (Python executable).
Testing The Hypothesis
This simple program checks whether the website is AEM or not. For my testing list of websites, the accuracy is 100%.
This example shows that the check is based only on regular expression that seeks the link of attribute in the entire HTML.
Implementing The Actual Program
Implementation of the program relies on Scrapy’s architecture - that it requires to running the codebase with Spider’s derivative.
Execute this code using Scrapy’s environment by:
result.jl contains new-line separated JSON objects with the results of crawling. Example:
I have manually validated entries before moving any further. They were valid links, either to scripts or assets, with AEM specific paths. Then, I wrote an additional script that combines these into a list of hostnames based on a single map to reduce duplicates:
After running the script for 30 minutes I have discovered the list of some websites:
These websites are pretty solid, AEM 6.5 and earlier versions. Can you find those you have been developing or maintaining?
This script is rudimentary, but it reveals a minor vector people can trace and list AEM instances almost instantly, making some little effort and exploiting that knowledge once it’s worth doing so.
To discover that knowledge, one can:
- start from a better website than The Economist (like some Fortune 500s summary with links to these companies)
- improve the limit of pages visited
- add more accurate checks - I would think of using AEM-specific libraries, like Granite
- use a proper infrastructure to crawl the whole Internet
Dozens of AEM instances creates a potential for further research in a field of used practices. A researcher can find out which components are popular, what approaches a developer team have taken, etc.
I would love to see proper quantitive studies created to improve overall AEM development, technical excellence and the current state of the market, to enhance the workforce and focus on the fields that matter the most. Independent conclusions will improve the state of the knowledge and future development of the CMS that seems to conquer more market’s space every year.