Our blog

16.05.2013 22:21 | Our blog

World's largest events database could predict conflict

Source: www.newscientist.com
Tags future, unstructured data
by Douglas Heaven

A database of over 200 million global events could help understand and forecast how conflicts will play out

WHEN will the civil war in Syria subside? Will there be fighting on the Korean peninsula? The answers defy the best human minds, but there may be a tool to help them: the world's largest database of geopolitical events has been released. And as it is refined, it could make forecasting events and conflicts, in particular as common as predicting the weather.

The Global Data on Events, Location and Tone data set (GDELT) contains nearly a quarter of a billion events going back to 1979 and hoovers up 100,000 new events every day. Its software scours media sources, such as the Associated Press, Agence France Presse and Xinhua, the main news agency in China. Collectively, the sources monitored cover every country in the world.

The software automatically extracts information from these news reports then uses natural-language processing to turn them into data points. For example, if a report contains the line "Sudanese students and police fought in the Egyptian capital" it codes the event as "SUDEDU fought COP". Next, the system finds the nearest mention of a city or locality in the text in this case Cairo and adds its latitude and longitude to the event data. The system can recognise different phrasings of who did what to whom and where. This helps it avoid duplicating events when they are mentioned in several news reports.

"The size and scope make this data set unique," says Kalev Leetaru at the University of Illinois at Urbana-Champaign. "Nobody had ever constructed a global event database over a long time frame." Leetaru and co-developer Paul Schrodt at Pennsylvania State University have plans to extend the data set back to 1800.

Jay Yonamine, who worked on an analysis of GDELT data at Penn State, calls it a "breakthrough data set". As part of his doctoral thesis, Yonamine used a version of a machine learning algorithm more commonly used for financial projecting to forecast the Afghanistan conflict, which has raged since 2001.

Yonamine fed data on the conflict up until 2008 into the algorithm. It works by applying a statistical model to a series of data points over a period of time and extrapolating that pattern. He found that it accurately tracked the spread of violence across the country's 317 districts month by month between 2008 and 2012. His system made successful predictions for which districts would witness violent events in 47 out of the 48 months.

Better statistical models should improve the results further. Yonamine says that a predictive model updated daily could be used by Afghan businesses to choose the safest route to transport goods, for example.

Extracting information automatically is essential, says Leetaru. "A protest is a very human thing," he says. "But if you want to look at a pattern, you have to quantify that." The system could also let you capture trends, such as the mood darkening in a region before the situation boils over. "There is simply no way for a human to take in everything that happened in Egypt and make sense of it," he says, referring to the escalating protests that ultimately led to the ousting of President Hosni Mubarak in 2011.

"I'm very optimistic about big data," says Nils Weidmann at the University of Konstanz in Germany. "But the strongest predictor of violence is previous violence. The real challenge is predicting new violence."

Weidmann thinks we will not have truly useful event forecasters until it is possible to mine data from social networks and other informal sources. Mainstream media tends to cover events only after they have happened. "Big data needs to go deeper," he says.

That's not likely to be easy. On-the-ground information is often sparse in unstable situations. However, in the case of the Syrian civil war, GDELT's combination of geographical data and a diverse collection of news sources allowed New Scientist to give a broad look at how the conflict has swept through the country since its inception in 2011 (see "Charting Syria's civil war".

As GDELT is refined and the time period it covers expands, its value is likely to go beyond questions of international policy. The financial world, for example, is increasingly relying on analyses of huge tranches of information.

That information can come from seemingly unlikely sources, such as those throwaway terms users put into search engines. Google recently opened up its records on what people are searching for and how it changes over time. For example, between 2004 and 2011 there was an increase in finance-related terms such as "debt", "dow jones", and "unemployment". Tobias Preis at the University of Warwick, UK, and colleagues analysed this data and identified patterns that they believe could be used as early warning signs of a future financial crisis
Site development SeaDesign O2