Stumbling Through Data Science

1

votes
1

answer
781

views

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout...
1

votes
1

answer
209

views

PySFTP and get_r using Python - “No such file or directory”

So I have a 'simple' process that needs to go out and grab data from another server and then copy the directory (and all sub-directories) to my server The code is as follows: import pysftp dbfs_path = '/dbfs/mnt/aaa/bbb/output/{}/'.format(dbutils.widgets.get('run_name')) remote_path = '/mst_bbb/{}/o...
1

votes
1

answer
668

views

Getting Large Dataset Out of MySQL into Pandas Dataframe keeps Failing , Even With Chunksize

I am trying to pull ~700k rows out of mysql into a Pandas dataframe. I kept getting the same error over and over again: Traceback (most recent call last): File 'C:\Anaconda3\lib\site-packages\mysql\connector\network.py', line 245, in recv_plain read = self.sock.recv_into(packet_view, rest) Connecti...
1

votes
1

answer
731

views

Pandas - Aggregating and Plotting Results

I have what I think should be a fairly simple question, but I have been fighting with it for hours I want to do an aggregation on a pandas dataframe and then plot it using matplotlib I start with a huge table of years and models of cars. I then want to calculate the aggregate sales price and a per...
1

votes
1

answer
54

views

Pandas: Multiple Grand Totals in a Summary Dataframe

Apologies for the noob question as I try to learn Python. Looking forward to getting up to speed and giving back Assuming I have the following data, YEAR SECTOR PROFIT STARTMVYEAR TOTALPROFIT STARTMV IBM TECHNOLOGY -500 2500 500 1500 APPLE TECHNOLOGY...
1

votes
1

answer
117

views

Pandas Python - Finding Time Series Not Covered

Hoping someone can help me out with this one because I don't even know where to start. Given a data frame that contains a series of start and end times, such as: Order Start Time End Time 1 2016-08-18 09:30:00.000 2016-08-18 09:30:05.000 1 2016-08-18 09:30:00.005 2016-08-1...
1

votes
1

answer
715

views

Seaborn/Matplotlib - Only Showing Certain X Values in FacetGrid

I am trying to create a facet of charts that show total scores over time, in seconds. X axis is the time in seconds and y axis is the total score. As you can see, I am restricting the output to 2 1/2 minutes via xlim . What I would like to do is to only show values on the xaxis for every 30 secon...
1

votes
1

answer
387

views

Pandas Grouping - Values as Percent of Grouped Totals Based on Another Column

This question is an extension of a question I asked yesterday, but I will rephrase Using a data frame and pandas, I am trying to figure out what the tip percentage is for each category in a group by. So, using the tips database, I want to see, for each sex/smoker, what the tip percentage is is for f...
1

votes
1

answer
609

views

Rounding Microseconds to Milliseconds MySQL

How is the best way to round microseconds to milliseconds in mysql? For example, in a datetime(6) column, how do we round 2016-12-01 12:30:01.122456 to 2016-12-01 12:30:01.122 and 2016-12-01 12:30:01.122501 to 2016-12-01 12:30:01.123 Thanks
1

votes
2

answer
872

views

Splitting a Pandas Data Frame into Rows by Count

I need to output data from Pandas into CSV files to interact with a 3rd party developed process. The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash). That said, how can I write something that takes a dataframe in Pandas and...
1

votes
1

answer
837

views

Recurring Custom Color Palette for Seaborn/Python/Matplotlib

I am looking to see if there is a way to set up a color palette to make sure that any time I graph, the colors of the bars will be consistent with the values on the x axis. I am using Seaborn I'm sure this is somewhere online, but have been searching for over an hour with no luck For example, I woul...
1

votes
2

answer
93

views

Pandas Merge of Dataframes

I am looking to compare a group of data to the rolled up aggregate of that data. In the example below, I want to know how much money each restaurant makes as compared to the total for all restaurants. I want to know this by day. If a restaurant is closed that day, I still want to return the name o...
5

votes
2

answer
3.2k

views

Pandas and Rolling_Mean with Offset (Average Daily Volume Calculation)

When I pull stock data into a dataframe from Yahoo, I want to be able to calculate the 5 day average of volume, excluding the current date. Is there a way to use rolling mean with an offset? For example, a 5 day mean that excludes current day and is based on the prior 5 days. When I run the follow...
2

votes
1

answer
2.8k

views

Pandas Grouping - Values as Percent of Grouped Totals Not Working

Using a data frame and pandas, I am trying to figure out what each value is as a percentage of the grand total for the 'group by' category So, using the tips database, I want to see, for each sex/smoker, what the proportion of the total bill is for female smoker / all female and for female non smoke...
2

votes
1

answer
1.3k

views

Using a Custom Color Palette in Stacked Bar Chart (Python)

So am trying to create a stacked bar chart where all of the slices of the chart will remain constant throughout the program, but I cannot figure out how to get df.plot to use a custom palette. I want to make sure that if I do 20 different reports out of this program, Freeze will always be, for examp...
3

votes
3

answer
16.7k

views

Seaborn Barplot - Displaying Values

I'm looking to see how to do two things in Seaborn with using a bar chart to display values that are in the dataframe, but not in the graph 1) I'm looking to display the values of one field in a dataframe while graphing another. For example, below, I'm graphing 'tip', but I would like to place the...
3

votes
1

answer
515

views

Cumulative Ranking of Values in Pandas with Ties

I am trying to find a way to do a cumulative total that accounts for ties in Pandas. Lets take hypothetical data from a track meet, where I have people, races, heats, and time. Each person's placement is according to the following: For a given race/heat combination: The person person with the lowest...
2

votes
0

answer
288

views

XLSX Writer - Saving Tables as Picture Files - Python

This is a bit of a conceptual problem and I have to believe others have the same issue. We use XLSXWriter to create very heavily formatted tables using Excel - we use fonts, data bars, conditional formatting, etc. In short, we use XLSX writer to create presentation quality output. We create a doze...
5

votes
2

answer
1.2k

views

xlsxwriter not applying format to header row of dataframe - Python Pandas

I am trying to take a dataframe and create a spreadsheet from that dataframe using the xlsxwriter I am trying to do some formatting to the header row, but the only formatting that seems to be working on that row is for the row height. The exact same formatting options work on the other rows of the d...
3

votes
2

answer
732

views

Pandas Scatterplot Using Data Frame Fields to Derive Colors and Legend

I want to create a scatterplot which shows two columns mapped against each other in pandas, a third for size, and then the color of the point based on the label (in the case below, last_name). I then want a legend that shows a dot for the color and then the last_name value Each last name should be a...
3

votes
1

answer
554

views

Pandas - Conditional Calculation Based on Shift Values from Two Other Columns

I'm sure this question is easy, but it has been stumping me for too long, so would REALLY appreciate some direction I'm looking to add a column to a dataframe based on the results of two other columns I want to identify if the stock is equal to the stock in the prior row and the date is equal to the...
1

votes
2

answer
1.2k

views

XLSX Writer Python- 3 Color Scale with Number as Midpoint

I'm trying to conditional formatting in XLSX writer with a 3 color scale with a 0 midpoint value in the middle. I want all negative values to scale from red (lowest number) to yellow (when the value is zero) and all positive numbers to scale from yellow (at zero) to green (at the highest). The sc...
1

votes
2

answer
861

views

Pandas and Python Dataframes and Conditional Shift Function

Is there a conditional 'shift' parameter in data frames? For example, Assume I own a used car lot and I have data as follows SaleDate Car 12/1/2016 Wrangler 12/2/2016 Camry 12/3/2016 Wrangler 12/7/2016 Prius 12/10/2016 Prius 12/12/2016 Wrangler I want to find two things out from this li...
4

votes
1

answer
743

views

MySQL Stored Procedures, Pandas, and “Use multi=True when executing multiple statements”

Note - as MaxU suggested below, the problem is specific to mysql.connector and does not occur if you use pymysql. Hope this saves someone else some headaches Using Python, Pandas, and mySQL and cannot get a stored procedure to return results at all, let alone into a data frame. I keep receiving er...