Need web data? That’s how you harvest them

When Ensheng Dong co-created Johns Hopkins University’s Covid-19 dashboard in January 2020, it was a labor of love. Dong, a systems engineer at the university in Baltimore, Maryland, had friends and family in China, including some in Wuhan, the site of the first outbreak. “I really wanted to see what was happening in their environment,” he says. So Dong started collecting public health data from the cities known to be affected.

Initially, the work was manual. But as the outbreak turned into a pandemic and the COVID-19 dashboard became the go-to resource for governments and scientists seeking information about the spread of the disease, Dong and his colleagues struggled to keep up. In the United States alone, the team tracked medical reports from more than 3,000 counties, he says. “We were updating at least three to four times a day,” he recalls, and there was no way the team could manually keep up with that relentless pace. Fortunately, he and his graduate advisor, systems engineer Lauren Gardner, found a more scalable solution: web scraping.

Scraping algorithms extract relevant information from websites and report it in a spreadsheet or other user-friendly format. Dong and his colleagues developed a system that could capture COVID-19 data from around the world and update the numbers without human intervention. “For the first time in human history, we can monitor in real time what is going on with a global pandemic,” he says.

Similar tools collect data in different disciplines. Alex Luscombe, a criminologist at the University of Toronto in Canada, uses scraping to track Canadian law enforcement practices; Phill Cassey, a conservation biologist at the University of Adelaide, Australia, follows the global wildlife trade on internet forums; and Georgia Richards, an epidemiologist at the University of Oxford, UK, scans coroner’s reports for preventable causes of death. The technical proficiency required is not trivial, nor is it overwhelming – and the benefits can be enormous, allowing researchers to quickly collect large amounts of data without the errors inherent in manual transcription. “There are so many resources and so much information available online,” Richards says. “It just sits there waiting for someone to come and take advantage of it.”

Get the goods:

Modern web browsers are so polished that it’s easy to forget their underlying complexities. Websites combine code written in languages ​​such as HTML and JavaScript to determine where different text and visual elements will appear on the page, including both “static” (fixed) content and “dynamic” content that changes in response to user actions.

Some scientific databases, such as PubMed, and social networks, such as Twitter, offer Application Programming Interfaces (APIs) that provide controlled access to this data. But for other sites, what you see is what you get, and the only way to turn website data into something you can work with is to laboriously copy the visible text, images, and embedded files. Even if an API exists, websites can limit what data can be obtained and how often.

Scrapers offer an efficient alternative. After being “trained” to focus on certain elements on the page, these programs can collect data manually or automatically, and even on a schedule. Commercial tools and services often include user-friendly interfaces that simplify the selection of web page elements to target. Some, such as the Web Scraper or Data Miner web browser extensions, allow manual or automated scraping of small numbers of pages for free. But scaling can get pricey: services like Mozenda and ScrapeSimple charge a minimum of US$250 per month for scraping-based projects. These tools may also lack the flexibility needed to tackle different websites.

As a result, many academics prefer open source alternatives. The Beautiful Soup package, which extracts information from HTML and XML files, and Selenium, which can also handle dynamic JavaScript content, are compatible with the Python programming language; rvest and RSelenium provide analog functionality for R, another language. But these software libraries typically only provide the building blocks; researchers have to adapt their code for each website. “We worked with some of the tools already in existence and then adapted them,” Cassey says of the scrapers he developed. “They have become more and more custom-made over time.”

crack the code

Simple web scraping projects require relatively modest coding skills. Richards says her team solves most problems “by Googling how to fix an error.” But understanding the basics of web design and coding gives a valuable advantage, she adds.

“I mostly use developer mode now,” says Luscombe, referring to the browser setting that allows users to remove the familiar facade of a website to get to the raw HTML and other programming code below. But there are tools that can help, including the SelectorGadget browser extension, which provides a user-friendly interface to identify the ‘tags’ associated with specific website elements.

The complexity of a scraping project is largely determined by the intended location. Forums usually have pretty standard layouts and a scraper that works on one can be easily modified for another. But other sites are more problematic. Cassey and his colleagues monitor the sale of plants and animals that are environmentally illegal or potentially harmful, and forums hosting such transactions may appear and disappear without warning, or change their design. “They’re usually much more fickle to try to limit the ease with which off-the-shelf web scrapers can just get through and collect information,” Cassey says. Other websites may contain encrypted HTML elements or complex dynamic functions that are difficult to decipher. Even sloppy web design can sabotage a scraping project — a problem Luscombe often struggles with when scraping government-run websites.

The desired data may not be available as HTML encoded text. Chaowei Yang, a geospatial researcher at George Mason University in Fairfax, Virginia, oversaw the development of the COVID-Scraper tool, which pulls data on pandemic cases and deaths from around the world. He notes that in some jurisdictions this data is locked up in PDF documents and JPEG image files, which cannot be mined with conventional scraping tools. “We had to find the tools that can read the datasets, and also find local volunteers to help us,” Yang says.

Data Due Diligence

Once you know how to scrape your target site, you should think about how to do it ethically.

Websites typically specify terms of service that contain rules for collecting and reusing data. These are often permissive, but not always: Luscombe thinks some sites use terms to avoid doing good faith research. “I work against countless powerful criminal justice agencies that really have no interest in having data on the race of the people they arrest,” he says.

Many websites also provide ‘robots.txt’ files, which specify acceptable operating conditions for scrapers. These are designed in part to prevent automated queries from overwhelming the servers, but generally leave room for routine data collection. Respecting these rules is considered best practice, even if it lengthens the scraping process, for example by building delays between each page request. “We don’t extract things faster than a user would,” Cassey says. Researchers can also minimize server traffic by scheduling scraping tasks during off-peak hours, such as the middle of the night.

If personal and personally identifiable information is collected, additional precautions may be required. Researchers led by Cedric Bousquet of the University Hospital of Saint-Étienne in France developed a tool called Vigi4Med, which scrapes medical forums to identify drug-related side effects that might have escaped attention during clinical trials. “We anonymized the user IDs and separated them from the other data,” said Bissan Audeh, who helped develop the tool as a postdoctoral researcher in Bousquet’s lab. “The team working on data annotation didn’t have access to those usernames.” But context cues from online messages still allow for the re-identification of anonymized users, she says. “No anonymization is perfect.”

Order out of chaos

Scraping projects do not end when harvesting is complete. “All of a sudden you’re dealing with huge amounts of unstructured data,” says Cassey. “It’s becoming more of a data processing problem than a data acquisition problem.”

For example, the Johns Hopkins COVID dashboard requires careful fact-checking to ensure accuracy. The team eventually developed an anomaly detection system that signals unlikely shifts in numbers. “Suppose a small province that used to report 100 cases a day may be reporting 10,000 cases,” Dong says. “It could happen, but it’s very unlikely.” Such cases lead to a closer inspection of the underlying data – a task that relies on a small army of multilingual volunteers who can decipher each country’s COVID-19 reports. Even something as simple as a typo or a change in date formatting can blow up a data analysis pipeline.

For Cassey’s wildlife tracking application, the team keeps their focus by determining which species are actually being sold and whether those transactions are legal. If salespeople know they are breaking the law, they will often cover up transactions with intentionally misleading plant and animal street names, much like online drug dealers do. For example, for a particular parrot species, the team found 28 “trade names,” he says. “A lot of fuzzy data matching and natural language processing tools are required.”

Still, Richards says potential scrapers shouldn’t be afraid to explore. Start by reusing an existing web scraper. Richards’s team adapted its software to analyze coroner’s reports from a colleague’s clinical trial data tool. “There are so many platforms and there are so many online resources,” she says. “Just because you don’t have a coworker who’s had a web scraping before, don’t let that stop you from giving it a try.”

Leave a Comment

Your email address will not be published.