A team of scientists from South Korea developed an AI model trained exclusively with data from the eerie deep recesses of the internet, the Dark Web. Dubbed DarkBERT, the model can be used to identify and flag cybersecurity threats, including ransomware and data leaks.
The researchers demonstrated that DarkBERT can also be used to crawl through multiple dark web forums and monitor them for any exchange of harmful content, according to a preliminary version of the study published on arXiv.
Unlike other chatbots like ChatGPT or Bard that are multi-purpose, the new AI is used to analyze and produce answers based on a specific dataset, per the Korea Advanced Institute of Science and Technology team, which worked with data intelligence organization S2W.
🎉 Exciting news! Our talented AI team researchers have just had their paper accepted at ACL 2023, a top conference in the field! 📚🤖
— S2W (@S2W_Official) May 18, 2023
DarkBERT crawls dark parts of the internet
The dark web is a hidden part of the internet that is often used for illegal activities, such as drug trafficking, weapons sales, and human trafficking. The dark web is usually not indexed by search engines like Google and can only be accessed using special software, such as Tor.
Researchers leveraged the Tor network to help their large language model DarkBERT comb through vast amounts of raw data on the dark web. The data included material from sites like cryptocurrency, hacking, and porn. Over 1,000 pages of the dataset consisted of porn.
DarkBERT is created with this sort of training data, which was filtered for sensitive things like illicit images, victim organization name, and details of leaked user information. The AI is built upon the BERT framework developed by Google and later refined by Facebook into RoBERTa.
According to the Korean researchers, DarkBERT can demonstrate whether or not the use of the dark web as a dataset would allow AI tools to understand the kind of language used in those environments better. It did, they said, better than Google’s or Facebook’s versions.
“Our evaluation results show that DarkBERT-based classification model outperforms that of known pretrained language models,” the researchers wrote in their paper.
“…our automated web crawler takes the approach of removing any non-text media and only stores raw text data. By doing so, we do not expose ourselves to any sensitive media that is potentially illegal,” added the study.
No public access
Despite its curious name, the team says DarkBERT could be used to detect websites that sell ransomware or leaked private data. It could also make it easier for security researchers and law enforcement to identify and track down criminals who operate on the dark web.
Also read: AI Can Now Turn Thoughts Into Video
DarkBert will not be made available to the public anytime soon because of the potentially dangerous nature of dark web materials. But researchers said those looking to use the AI model for academic purposes can request for access.