Data Collection and Prediction of Urban Transport Flow using Neural Networks

— Smart cities can use artificial neural networks to provide more accurate information about public transportation schedules, and thus help the population plan their day to day activities. In this context, this paper describes the essential steps for the acquisition and processing of data, and the creation of a neural network model capable of predicting possible delays or advances on bus lines in the city of Curitiba, Paraná. The neural network considers traffic data, climate, time and history of a public transport line. The article details all phases of collection and treatment, as well as how information is inserted into the network and what are the obtained results.

The urban public transport plays an important role in the current configuration of urban displacement as a means of transport that provides the interconnection between the various regions of the cities. It is an alternative to the reduction of serious problems found in cit ies such as: congestion, traffic accidents and environmental impacts [1].
Forecasting public transport delays can be an optimized tool that drivers and passengers could use to plan their daily tasks. This prediction can be obtained by analyzing data directly or indirectly linked to the line punctuality situation. Data collection is an important aspect of urban computing and is a determining factor in build ing smart cities [2]. This set of information could be used to create an artificial neural network that analyzes all th is data and tries to find a possible connection between them, so that it creates an algorith m to predict situations of delay or advance, becoming a tool to help p rofess ionals in the area of data analysis and even to the user of the bus network.
In the scenario of the bus lines of public transport, one of the known issues is the compliance with the established schedules. Because it is a problem that in many cases is caused by factors that can not be controlled, it is not always possible to prevent it from happening. Predict these delays allows interested parties to have this informat ion in advance and can decide how to work around the situation [3].
With population growth also increase the challenges for government, business and academia [4]. The analysis of data to create resources for intelligent cities has been the subject of several studies in both the academic and business environments. This technique of collecting and processing data can be of great value to companies and users who could benefit fro m a great amount of informat ion, planning and improving their activ ities, but also to the government that could benefit from the improvement in the service provided. Res earch shows that the greatest cause of dissatisfaction among the Brazilian population with public transportation are the problems with capillarity and frequency, slowness and frequent delays, which, according to the research [5], cause the population to us e less public transportation.
According to [6], congestion concerns all individuals. Brazilian metropolitan areas live a n ightmare d ifficu lt to measure, wh ich are urban congestion. The feeling o f wasted time in front of a huge congestion is worrying, and there are few people who know how to live with this reality naturally. In recent years, millions of people have lost money and time because of congestion [7] and there is a considerable increase in the p rice of car t rips during congestion [8,9]. Knowing this reality this article proposes a solution of public utility so that the population knows of the expected delays of a bus line. The solution brings advantages such as reduction of the issues involved, and can benefit users of transport, service provider co mpanies and government entities.
This article is organized as follo ws. Section II discusses related work. Section III describes the data collection considered in this paper, examines the data model and how they may or may not directly in fluence the final result. Section IV describes the experiments performed and how the data obtained was added to the neural network and which tool was used for this, also shows the results obtained and the network responses. Section V concludes the paper and presents future work.

II. RELATED WORK
The attempt to pred ict possible delays of collective public transport vehicles has already been considered in other works. So me of these studies involve computational intelligence, but the vast majority use only historical data and some other technique. On the other hand, some studies have more similar characteristics with this work and also use weather and traffic data.
In the work of Maciel [3], the performance of regression algorith ms in historical data for the fo recast of the start and end time of day t rips is evaluated. The main idea was to evaluate the performance of regression algorith ms and with them it was verified that, for both the start and end time of the trip, the med ian of the errors was approximately 28 and -167 seconds respectively. The work shows that the quality of the forecasts also changes over the course of the week, where the worst results were obtained on Monday and the best on Wednesday and Thursday. The same behavior o f the days of the week was verified in the hours of the day, where the start and end times of the usual Brazilian work schedule obtained the most inconsistent results. They also considered some climate data and their influences.
In the work of Moraes Filho [10], a pro ject called CittaMobi is presented, which is a set of solutions that aims to make public t ransport information available to bus users. The application provides real-time pred ictions of the arrival of the bus, the locations of the closest points, together with the lines that pass through them, and some details related to each bus, for examp le, if the bus is adapted to the holders of special needs, or not.
On the other hand, the work of Serafim [11] consists of an experiment carried out in the area of public transportation, with data obtained by an observer and collected with direct observation of the arrival of the bus. In the study, data were collected for 23 days. To evaluate the punctuality of the bus, it was admitted that the process of generating a random sequence of delays, anticipations and certain hours on a given day must be influenced by what happened in the previous days and can therefore be described by a Markov chain. Besides the estimation, some simu lated samples of these chains were also used in the work. Ho wever, it was verified that for samples of size up to 50 days, there is not sufficient informat ion to detect a dependency structure, even if the practicality of the use of modeling of a variable through the chain was evidenced.
The idea of improv ing the information offered to public transport users, based on informat ion provided by other individuals, was worked by Lucio [12]. In the wo rk, collective intelligence is used, which is described as a form of distributed intelligence, constantly improved by its users and coordinated in real time, resulting in the creation of knowledge through collaboration. The work shows how the resources provided by mobility in conjunction with collective intelligence can be used to create Intelligent Transport Systems (ITS). In this scenario, the data required for the creation of these intelligent systems are provided by the users of the public transportation through their mobile devices, providing the construction of a large collection of informat ion of the transportation system through the contribution of the users.

III. EXPERIMENTAL ANALYSIS
For the pred iction times of the path of a line of public transport in Curitiba, it is necessary to collect data about this line and additional data that may influence its path. These lines have routes that meander through the city, being of great use by the passengers, passing through terminals and streets that have a great flow of people and vehicles. The additional data is based on climate informat ion and traffic incidents in a generic way and without any specific category, which, for both, has a great impact on the flow of vehicles. The acquisition process is to: identify lines, collect real-t ime basis and time schedules of the line and of each vehicle; identify and collect weather data fro m the region of Curitiba, targeted in temperature, humidity, wind speed and description of the weather; and, finally, co llect traffic data based on line locations.
After the data is collected, the stored information is analyzed. Before submitting them for p rocessing with the neural network, it is necessary to identify and translate the informat ion. So me of the data collected is based on natural language. It is not necessary to carry out a more advanced classification based on Natural Language Processing (NLP), since the terms used are very limited. This classification is necessary only for the weather and traffic data, to simp lify data entry to a model of the neural network.

Data Collection
To perform the predict ion, historical data are needed, especially those that may have a connection with public transport bus delays. These data were collected in several ways and in several formats: weather data, history of delays of the chosen line, data of traffic and traffic flo w in the region transited by the bus at the time of collection, among others.

Climatic Data Collection
Authors report that the adverse weather conditions cause significant changes in travel decisions [13]. A relationship between weather conditions and traffic flow is addressed in [14], showing a relationship between weather conditions and traffic speed, as well as a link between these conditions and the number of accidents. This set generates a change in traffic flow, shown in Fig. 1. Also in [14], it is shown that on rainy days the number of passengers in buses decreases and the number of cars on the streets increases. The authors point out that not only the number of passengers is influenced by the weather but also the time it takes to complete the route to its destination and the time waiting for the public transport vehicle. In conclusion, precipitation, cloudiness, wind speed, high temperatures and hail can alter the intensity of traffic and underline the need to incorporate meteorological conditions into research directly o r indirectly linked to traffic.
Climatic conditions, along with brightness and visibility, and their link to traffic flow and the number of accidents are shown in studies in Orange County, Califo rnia [15]. In this work, some data about the possible influence of the weather conditions are shown in tables, which are checked on some links between the traffic flo w speed and weather conditions.
With this in mind, so me data on climat ic conditions in the region of circulation of the bus are necessary. Climate informat ion was found in the form of web service. The web service chosen for use in this research was the HG Weather -Weather Forecasting API [16] which is a project designed to disseminate in formation for free. In this web service, we can obtain data using: the city code (WOEID), Geo IP, Geolocation or by the name of the city.
The mode chosen was using the WOEID code wh ich, according to [17], is an acronym for Where On Earth Identifier, wh ich marks the location o f cities and identifies each with a specific code. For the city of Curitiba the WOEID is 455822, and this way o f obtaining data was chosen because it does not require an access key. The data provided by this web service are: temperature, date and time of data update (refresh occurs approximately every 30 minutes), a code of the current climate condition, a broad description of the current climate, a reduced description of the current climate, current weather information for 'day' or 'n ight', city name with the code entered, air humidity, wind speed and sunrise and sunset times, and the general forecast for the next few days.

Data Collection from the Bus Line
Bus timeliness data on different days and times should be collected for fo recasting. This collection should be periodic and occur long enough so that results can be observed. The authors suggest two years of data co llected daily and every 3 minutes, ensuring that a different data will be acquired fro m the last one, so that the database is sufficient fo r use and extract ion of useful data. It is suggested that one year of data could already bring satisfactory results. This is possible since the Curitiba City Hall provides documents and government informat ion for web services through an action called Open Data Portal [18]. This data is available in open format for use and unrestricted editing of its users, thus being in the public domain and free use, and are intended to produce new information and d igital applications for society.
The service is in its first version, and it provides databases of the various organs of the Municipal Govern ment of Curitiba. These bases are available through the web site to download, or via web services with direct access. The informat ion availab le for down load is updated every month, and can be accessed without the need for a term signature or personal identification, with or without commercial purpose. The information co ming fro m the web service is released through the delivery of a document containing the user's login and password by To request access to the data of a certain line it is necessary to inform the code of the line, which are 3 characters and can be found in the service itself. When entering the code line, the following data are available: prefix of the vehicle, which is the specific code of each vehicle in the network, the time o f the update, latitude and longitude data in floating point, the line prefix, which is the code entered when requesting data, information if are adapted for wheelchair users (1 for yes, 0 fo r no), type of the bus, the timetable that the vehicle is performing (normal or Sundays and holidays), a situation of the vehicle timetable (late, early, on time) and the counter of cycles without updating vehicle information, since the informat ion is updated every two minutes. At each cycle of two minutes without update this counter is increased by 1 (updated information has code 1). The line chosen for the work was 022 -Inter 2.

Traffic Data Collection
Congestion in the city makes everyone involved slow down and increase the time spent in traffic. Considering this, traffic data and traffic flow in the region trafficked by the bus are also important and should be considered.
One way to collect this data is the Bing Transit [19] web service that also responds with a JSON file with some informat ion about accidents or impediments in a rectangular area fo rmed by two latitudes and two longitudes that represent the four sides of that area. The following syntax is used to specify this area: a south latitude, a west longitude, a north latitude, and an east longitude. The information provided after specifying the region is: t ime and type of accident or impediment (closed street, construction on the road, collisions of vehicles, fallen tree). For use in this work only the nu mber o f events in the area was used.
The Web Scrap ing that was used in this research is a way of requesting data, collecting and analyzing it to extract desired information by writing a simp le code to perform the task [20]. In [21] it is said that web services are the standard, in fact, for data collect ion. However, there are scenarios where data is not available through web services and the use of Web Scraping becomes necessary.
Web Scraping can be used on real-time map and traffic sites, since there are many of them that show the current flow of traffic in a particu lar location, wh ich could be used at the time of co llect ion. An examp le is the Google Maps tool [22] that reports the time between two points in real time and whether the traffic flow is flo wing slo wly o r quickly. For use in this work it was decided to use only the current time between the arrival and departure points of the Inter 2 line, that is, when the time was higher than the average the flow is slow and when the arrival time is lower the flow it is faster and therefore faster the bus ride.
The current time data between the starting point of the line and the endpoint was initially acquired using the Requests library which is a Python HTTP library that aims to make HTTP requests simp ler and more hu man friendly according to documentation in [23]. One of the uses of the library is the return of the HTML code of the chosen web page and within that code the information of the time is in the fo rm of text and that piece o f text that is the number of minutes between the points is extracted. This value can then be used to describe the flow o f current traffic in the region.

Additional Data Collection
A data of great importance to the network is the day of the week in wh ich the collection was performed. The day of the week is important because on the Friday before a holiday, for examp le, there is a very d ifferent flow o f traffic fro m co mmon Tuesdays. First, the day-of-the-week data can be obtained in Python (a programming language chosen for being one of the options for using Keras that will be used for the neural network and for having support for all the services used, making only one needed) using the Calendar library. To get the day of the week we should move the date to a function called "weekday" and it returns the day of the week fro m the informed date. Information about special dates or holidays were obtained using a web service called "Rest-API with Holidays fro m all cit ies of Brazil" [24] and in it is informed the IBGE (Brazilian Institute of Geography and Statistics) code of the chosen city and has as return the national, state and municipal holidays of the city in question. The code of Curitiba is 4106902. When collecting the data a simp le comparison of the current day with the holidays is done to verify three situations and the answer is transformed into 3 b its: the first bit fo r holiday eve, another for holiday day, and the last for a day after the holiday, being bit 1 fo r true answer and bit 0 for false.
Other data could also influence, such as the occurrence of large events in the reg ion, and even others of the human conviviality itself. Event information could be co llected in d igital newspapers in the region or on news websites,

Pre-processing of data
Some data such as the day of the week, the climate description and the bus situation, are in text format and should be changed to number, since the neural network model can use numbers as input data to become more optimized. In the first case the following transformation was made, Sunday for nu mber 1, Monday for nu mber 2, Tuesday for number 3 and so on. For the climate description the following criterion was adopted, all possible answers were listed and for each one assigned a number, for examp le "Cloudy weather" was transformed into 1 and "Sunny" in nu mber 4. Fo r the situation of the bus the same technique was used, however using 3 numbers each being 0 or 1 depending on the situation, delayed became 100, early 010 and on schedule became 001.
An examp le o f collection is shown in Table 1 and in it the following data are present: day of the week, which in the example is the number 6 which is equivalent to a Friday, day and month of collection which in this case is a day 24 of August, the hour and minute of collection, in case 12:11, the temperature in the city at the time of collection, 29 degrees celsius in the example, also the description of the current climate, in case 4 that is "sunny", the humidity of the air, in the collection equal to 40%, condition slug 1 that is equal to "clear day", then a code of the climate condition in question (code generated by the web service itself), the nu mber of events collected in the region, holiday and, lastly, the current time between the start and end of the line, at that moment was 29 minutes.

After Holiday 0
Time 29 Table 1: A collection of input data held on August 24 at noon and eleven minutes.
The output data is three, the first one being a bit representing delay or not, the second is ahead or not, and the last is on time or not. In no case two of these bits can have the value 1, since the bus can not be delayed and advanced at the same time, for example. In Table 2 we see an examp le of output data, where the condition is 100, that is, delay at the time of collection.

Output Data Value
Late 1 Early 0 On Time 0 When performing the first tests it was observed that it would be better to change the qualitative data also for the binary form, since the neural network works with weights and sizes when it co mes to numbers. The quantitative data were kept in their decima l form. Leaving in the qualitative form might seem to the neural network that Monday is less than Saturday for examp le, or that description 4 is larger than description 1, which is not a truth, the idea that should be passed to network is another, it should be something like "it's Monday", yes or no. Then the following change was made, changing the day -ofweek fields, description, quick description and condition code to a binary form that would be, yes or no for each possible case, 1 or 0, respectively. For the day of the week, for examp le, the nu mber of the day has become 7 values, each one being equivalent to one day of the week. Monday, for examp le, was 1000000, and Tuesday was 0100000. This formatting was us ed for all cas es cited.
A data acquisition was done for 3 months, only to verify the operation and then continue the data collection, resulting in that time in 3000 data obtained. After this collection the data were used to create the neural network.

International Journal of Advanced Engineering Research and Science (IJAERS)
[

IV.EXPERIMENTS EXECUTION
In order to imp lement the object ive of predicting delays, it is necessary to predict events. Prediction is to make affirmat ions about something that will happen, usually based on information fro m the past and current state. Neural networks can be used for predict ion, having advantages such as automatic learn ing of dependencies, requiring only measured data without any need to add more informat ion. Moreover, the network can be trained fro m historical data, not having to be represented by an explicitly given model.

Neural Network
According to [25], Neural Net works, or Artificial Neural Networks, find applicat ions in very diverse fields. By virtue of their ability to learn fro m input data, with or without a teacher, and by representing a technology rooted in various disciplines (such as neuroscience, math, statistics, physics, computer science, and engineering) .
Some examples o f these fields are modeling, time series analysis, pattern recognition, signal processing, and control. As stated in [26], artificial neural networks can be considered as a methodology to solve problems characteristic of artificial intelligence.
Neural networks are massive and parallel systems, composed of simple processing units that compute certain mathematical functions [27]. Using a set of examples presented, the networks are able to generalize the assimilated knowledge to a set of unknown data. They also have the ability to ext ract non-explicit characteristics fro m a set of information provided to them as examples [28].

Experiment setup
Keras is described in its documentation [29] as an open source neural network library written in Python. It is able to work with tools like Google TensorFlow [30]. Designed to enable rapid experimentation of deep neural networks, it focuses on being easy to use, modular and extensible.
It is an open source library for nu merical computation and machine learn ing [31], and used as the neural network of this work.
To make use of the tools a code in Python language was developed, with data input and output in a Commaseparated values (CSV) file that allo ws the creation of tables with data separated by commas. The number o f training times was defined, a hit and error quantity classifier was created and an interface showing the response of the system to an input (late, early or on time).
The best result was obtained without changing the optimizer and with 10 training periods. The model uses only two layers of training being the first with the input data and the second with the output with 15 and 3 neurons each.
The final configurations used were: two layers (at first the settings were "normal" in the kernel init ializer option, "relu" in the activation option, the second the same option in the kernel init ializer and "softmax" in the activation option). To co mp ile the model the settings were: "categorical crossentropy" for the loss configuration and "sgd" optimizer. In the model train ing settings were: 10 epochs, 100 batch_size and 0 verbose.

Results obtained
The network, after the train ing, obtained a response with 84.59% accuracy in the validation data and 92% in the training data so far, that is, the network used some of the data collected to train and the rest to verify, where 90% was for training. When comparing the results obtained with the results collected in 84.59% of the cases, the network obtained a correct answer (the highest percentage was the correct answer).
These data are presented as a chance to occur, for example, a forecast for June 20 with rain at 12:00 was made and the following results were obtained: 23.74% chances that the bus is late, 9.81% chance of can be early and 66.43% chances that the bus will be on time. So the final response fro m the netwo rk is that the bus will probably be on time on June 20 at noon. The result that can be verified in the day and time in question, if the climatic conditions are predicted correctly.
In view of the results presented here, the network presents a reasonable response considering some field tests with positive results and possibly when performing a larger data collection the network may present an even better response.

V.CONCLUSION
The population satisfaction with public services is fro m great importance for imp roving the quality of life, facilitating day-to-day living, and raising the level o f satisfaction with the government. The area of public transportation has a huge problem with delays and requires methods that obtain good accuracy in their predictions. Considering this need, this work proposed an approach for the city of Curitiba, focused on the collection of informat ion that may be direct ly related to delays. The proposed approach is based on the collection of data fro m various mo ments and sources in a way that it makes possible the use of neural networks for prediction. The results achieved have been satisfactory at first and fro m them a more in-depth research can be done, and then the data can be distributed to users as a way to improve the level of satisfaction with public transportation.
The paper presented here exposed a methodology for collecting data linked to possible changes in the flow of buses in the city of Curit iba and how these data can be used to predict these variations in advance using neural networks, notify users and whom they care about. The next objective is to develop a platform in which users would be able to identify the bus that they will use and what time they want to arrive and the p latform could notify the user of the ideal time to board the bus or show the user a table with the schedules for departure and arrival of the chosen bus.