USA Spending.gov Predictions with Machine Learning in Python

Machine learning turns information into knowledge by using techniques to automatically find valuable underlying patterns with complex data that we would otherwise struggle to discover. The hidden patterns and knowledge about a problem can be used to predict future events and perform various complex decision making to help achieve results that would have been very difficult to obtain in the past.

Supervised machine learning includes such algorithms as linear and logistic regression, multi-class classification, and support vector machines. Supervised learning requires that the algorithm’s possible outputs are already known and that the data used to design the algorithm is already labeled with correct answers. Once developed, the algorithm is provided with a new dataset, so that it analyzes the training data and produces a correct outcome from labeled data.

Unsupervised learning is the training of algorithms using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Therefore, machine learning is restricted to find the hidden structure in the data by its self. The algorithm is given data that it can categorize according to their similarities, patterns, and differences.

Deep learning models use neural network architecture that resembles the networked structure of neurons in the brain, with layers of connected nodes. Deep learning can achieve state-of-the-art accuracy, exceeding human-level performance when recognizing patterns, classifying data, and forecasting future events.

Neural Networks like (LSTM) Long Short-Term Memory process sequences at a time while retaining a memory of what has come previously in the sequence.

In the following example we will use Python, LSTM, and DHS Contract Awards data to forecast government spending and determine if there are any seasonal patterns. The source for the data is USASpending.gov which allows data to be downloaded in CSV format or live data with an API connection.

Power BI Data Visualization

We first use a traditional business intelligence tool to perform the baseline data visualization for the awards data .

The following report shows live awards data collected from USA Spending website from 2008 to 2020. The data is collected in real time by an API for Department of Homeland Security. The report is displayed in Microsoft Power BI by using Power Query.

Python Data Visualization

A similar report can be visualized in Python using the Matplotlib library with the following code:

Now let us analyze the results further and determine additional insights that can be gained from the python-based data visualization.

Seasonal Patterns

Python has a method called time-series decomposition which allows to decompose the time series into three distinct components: trend, seasonality, and noise.

Original report:

Trend:

Seasonal Pattern:

Residual noise used for Machine Learning:
The residual errors from the visualization on a time scale provide another source of information that we can model through python.

Example Code for time-series decomposition:

The following shows how Python determines seasonal patterns in the time series by breaking it into sections.

Split the Data into a Training set and a Test set

When you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.

After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model’s guesses are correct.

Predicting and forecasting

Using the imported data set from USA Spending, the following graph shows how accurate Machine Learning and Neural Networks can be at predicting time series events. Updating the training code and supplying more data can lead to very accurate results.

Forecasting

Based on the previous machine learning and seasonal patterns we can forecast DHS spending past 2020 in the following graph showing an uptrend and seasonal pattern:

Python can be used to forecast and predict time series events and play a major role in understanding details on specific factors with respect to time. Programming languages like Python and R are very powerful and provide a much wider range or features that can not only take on the traditional business intelligence platforms and solutions but can also include machine learning algorithms to produce highly rich data analysis, visualization and reports to analyze major patterns such as trends, seasonality, cyclicity, and irregularity. Time series analysis is used for various applications such as stock market analysis, pattern recognition, earthquake prediction, economic forecasting, census analysis and so on.

Conclusion

Machine Learning has emerged as a critical component of automation. Business organizations rely on accurate information to make the right decisions at the right time. Machine Learning allows organizations to transform large data sets into knowledge and actionable intelligence with the help of tools like Python, which comes with a host of out of the box libraries to perform Machine Learning automation. The advantages of these technologies can be applied to a variety of use cases, especially when data is at the core of the service offering. The technology is quickly replacing manual operations and helping businesses run successfully. Machine Learning tools such as Python are very effective in solving some of the toughest data challenges of the day.