horatio1701d

1

votes
2

answer
612

views

Pushing down filter predicate in Spark JDBC Properties

How can I setup my spark jdbc options to make sure I push down a filter predicate to the database and not load everything first? I'm using spark 2.1. Can't get the right syntax to use and I know I can add a where clause after the load() but that would obviously load everything first. I'm trying the...
horatio1701d
1

votes
0

answer
135

views

Filtering Dataframe with predicate pushdown from another dataframe

How can I push down a filter to a dataframe reading based on another dataframe I have? Basically want to avoid reading the second dataframe entirely and then doing an inner join. Instead I would like to just submit a filter on the reading to filter at source. Even if I use an inner joined wrapped up...
horatio1701d
0

votes
0

answer
3

views

io.fabric8.kubernetes.client.KubernetesClientException Message: unable to parse requirement: invalid label value

Not able to figure out why I get an invalid label value error when deploying my spark job through kubernetes using spark-submit. The log error below it stating that it sees my class name appended with a dollar sign but I don't have anything that isn't alphanumeric in my class name. io.fabric8.kuber...
horatio1701d
1

votes
1

answer
1.5k

views

Combining Python Pandas Dataframe outputs from program Into One Dataframe

After several weeks of refining this I have the following code thanks to awesome folks on SO which produces dataframes as I need but I'm not sure how to concat the dataframes in the program into one for the final dataframe object variable. I just assign the concat statement to a variable then I end...
horatio1701d
1

votes
2

answer
972

views

Iterating through Pandas Groupby and Merging DataFrames

This seems like it should be straightforward but is stumping me. Really love being able to iterate through the groups of a groupby operation and I am getting the result I want from the groupby but I am unable to merge the final result into one dataframe. So essentially I have the below code which ca...
horatio1701d
1

votes
1

answer
160

views

Combine Python Pandas Dataframe more Efficiently than List Append Method

I keep having to do the below to build up dataframes from a small pipeline processing individual json lines. Is there a more efficient way to do this instead of relying on appending them to a list and then concatenating? Also I don't even need the column labels below represented as 'key' but wasn't...
horatio1701d
1

votes
1

answer
414

views

Maven not recognizing JAVA_HOME being set in bash_profile

Going on nearly 2 days wasted trying to get maven properly installed. These are my outputs. Why in the world is maven defaulting to some jdk I don't even have??? I have tried a million different solutios proposed from goole searches and nothing!!! Help. I have not made any changes to any config file...
horatio1701d
1

votes
1

answer
347

views

Pandas Multiple Conditions Function based on Column

Just trying to find the most elegant way to apply a really simple transformation to values in different columns with each column having it's own condition. So given a dataframe like this: A B C D E F 0 1 2013-01-02 1 3 test foo 1 1 2013-01-02 1 3 train foo 2 1 2013-01-0...
horatio1701d
1

votes
2

answer
2.9k

views

Compiling Spark Scala Program into jar file using installed spark and maven

Still trying to get familiar with maven and compiling my source code into jar files for spark-submit. I know how to use IntelliJ for this but would like to understand how this actually works. I have an EC2 server with all of the latest software such as spark and scala already installed and have the...
horatio1701d
1

votes
1

answer
709

views

Much more Efficient way to Parse and Process Large files with Json Objects

This is by far the craziest question I have asked on SO but I am going to give it a shot in the hope of getting some advice about whether or not I am leveraging the right tools and methods for processing large amounts of data efficiently. I'm not necessarily looking for help on optimizing my code un...
horatio1701d
1

votes
2

answer
1.5k

views

Trying to change directory for UltiSnips snippets

What is the correct way to change where UltiSnips searches for snippets. I tried the below with no success: let g:UltiSnipsSnippetsDir = '/newfolder/snippets/' let g:UltiSnipsSnippetDirectories=['UltiSnipsNewDir']
horatio1701d
1

votes
2

answer
388

views

spark 2.x write to parquet by partition compared to hdfs extremely slow

Can't figure out where to start troubleshooting why a simple write to parquet by partition from spark/scala into hdfs would be matter of a few seconds versus a few minutes when I write to s3 instead. def saveDF(df: org.apache.spark.sql.DataFrame) : Unit = { df.write .mode('overwrite') .option('comp...
horatio1701d
1

votes
2

answer
108

views

Pandas DataFrame to Dict Format with new Keys

What would be the best way to convert this: deviceid devicetype 0 b569dcb7-4498-4cb4-81be-333a7f89e65f Google 1 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f Android 2 cf7391c5-a82f-4889-8d9e-0a423f132026 Android into this: 0 {'deviceid':'b569dcb7-4498-4cb4-81be-333a7f89e65f','devicetype':['Goog...
horatio1701d
20

votes
2

answer
32.1k

views

Search for String in all Pandas DataFrame columns and filter

Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^') to an entire dataframe at once and filter down to any rows that have records...
horatio1701d
1

votes
1

answer
670

views

Combining Rows of Lists in Column containing Integers with Python

What would be the most Pythonic way to work with lists containing integers within a pandas dataframe as below? My first objective is to just retrieve a list of all unique values in all of the lists across all of the rows. index col1...
horatio1701d
4

votes
1

answer
2.6k

views

AWS s3api json formatting error: Error parsing parameter 'cli-input-json': Invalid JSON: Expecting value: line 1 column 1 (char 0)

Not sure what I'm getting wrong with my json format. Just trying to test out aws cli and run aws s3api list-objects --cli-input-json .json --profile where is below but getting: Error parsing parameter 'cli-input-json': Invalid JSON: Expecting value: line 1 column 1 (char 0) JSON received: {'Bucke...
horatio1701d
4

votes
2

answer
682

views

Using aws credentials profiles with spark scala app

I would like to be able to use the ~/.aws/credentials file I maintain with different profiles with my spark scala application if that is possible. I know how to set hadoop configurations for s3a inside my app but I don't want to keep using different keys hardcoded and would rather just use my creden...
horatio1701d
2

votes
2

answer
5.5k

views

Spark-sql: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

After quite some time, can't figure out how to identify root cause of below error when running spark-sql binary: 15/12/08 14:48:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread 'main' java.lang.Runtime...
horatio1701d
6

votes
0

answer
594

views

Vim Crashing with Python 3: “Caught deadly signal ABRT glibc detected…”

Something is very wrong with my vim to python 3 setup where when I type :py3 import sys; print(sys.version) I just get a complete crash with something along the lines of: Vim: Caught deadly signal ABRT... *** glibc detected *** /usr/local/bin/vim: corrupted double-linked list: 0x00000000015f9a50 *...
horatio1701d
4

votes
1

answer
1.9k

views

Ipython Notebook: Open & Edit Files

I am scratching my head on what the standard workflow is for opening, editing and executing a scripts directly from within the ipython notebook? I know that you can use %edit from ipython terminal but this doesn't seem to work from notebook. thank you
horatio1701d
23

votes
2

answer
23.3k

views

Python: Trying to Deserialize Multiple JSON objects in a file with each object spanning multiple but consistently spaced number of lines

Ok, after nearly a week of research I'm going to give SO a shot. I have a text file that looks as follows (showing 3 separate json objects as an example but file has 50K of these): { 'zipcode':'00544', 'current':{'canwc':null,'cig':7000,'class':'observation'}, 'triggers':[178,30,176,103,179,112,21,2...
horatio1701d
2

votes
1

answer
770

views

Iterating through large file and appending chunks on disk taking up all memory

I have a file approximately 5GB compressed (32GB Uncompressed) with gzip and has approx 200+ Million rows. I have the below pandas code going through chunks and applying some processing and saving as one csv iteratively. I don't understand however why this gradually uses up all of my RAM before it c...
horatio1701d
5

votes
4

answer
967

views

Pandas Time Series Holiday Rule Offset

Trying to define a set of rules using pandas.tseries.holidays class but can't figure out how to just create a rule based on another rule. I have the below rule but then want to just create another rule that offsets original rule by one business day: Thanksgiving: Holiday('Thanksgiving Day', month=11...
horatio1701d
7

votes
1

answer
13.1k

views

Pandas Dataframe add header without replacing current header

How can I add a header to a DF without replacing the current one? In other words I just want to shift the current header down and just add it to the dataframe as another record. *secondary question: How do I add tables (example dataframe) to stackoverflow question? I have this (Note header and how...
horatio1701d
2

votes
2

answer
2.2k

views

Python Pandas Reading CSV file with Specific Line Terminators

I am trying to create a dataframe from the below sample csv I've been given but I am getting Error tokenizing data. C error: EOF inside string starting at line 0. I haven't had very much practise with treating bad lines but would really like to learn the best way to handle something like this. I hav...
horatio1701d
5

votes
1

answer
15.1k

views

Create pandas dataframe from json objects

I finally have output of data I need from a file with many json objects but I need some help with converting the below output into a single dataframe as it loops through the data. Here is the code to produce the output including a sample of what the output looks like: original data: { 'zipcode':'089...
horatio1701d
2

votes
1

answer
414

views

Reading .aws/credentials file with Scala for hadoop conf setting from spark

How can I just read my different aws profiles I have located in my credentials file within .aws directory? Just want to have my app read in the access key and secret such as below but not sure how to make this point to the credentials file. object S3KeyStore extends Serializable { private val keyMa...
horatio1701d
2

votes
1

answer
105

views

Upsource Error: Github Url not Detected when using github enterprise server

Can't figure how to resolve an error when trying to setup upsource on EC2 where everything is fine except when trying to integrate with GitHub. I simply use ssh for authentication and then input my GitHub url as [email protected]:/.git But I get the message: GitHub URL is not detected Also tried http...
horatio1701d
2

votes
3

answer
85

views

Carat Z symbol in text file

Does anyone know what a ^Z symbol in a text file represents and how I might go about cleaning that (regex/python?). example string: 5411 Grocery Stores,www.sentryfoods.com,WI,6am ^Z 11pm 7 days a week How can I find this with regex in vim or adapt the below python command to treat? df['col1'].appl...
horatio1701d
2

votes
1

answer
5.7k

views

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStatistics class. http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/api/org/apache/hado...
horatio1701d
2

votes
1

answer
278

views

Excluding spark from uber jar with maven

How can I just compile my spark code to put onto remote server without also packaging spark which I have defined in maven? I don't need any of the spark dependency since I obviously already have it installed on the server when I run spark-submit. Still pretty new to maven. So far I'm thinking of pos...
horatio1701d
2

votes
2

answer
245

views

Intellij and Maven provided scope

I need to define my dependencies with provided scope but I also need my applications to be able to run and tested within Intellij. How can I setup my workflow so that I can download all of the dependencies for Intellij to use for both auto-complete and being able to run them?
horatio1701d
2

votes
3

answer
8.4k

views

Pandas Very Simple Percent of total size from Group by

I'm having trouble for a seemingly incredibly easy operation. What is the most succint way to just get a percent of total from a group by operation such as df.groupby['col1'].size(). My DF after grouping looks like this and I just want a percent of total. I remember using a variation of this stateme...
horatio1701d
2

votes
1

answer
1k

views

Custom Data Types for DataFrame columns when using Spark JDBC

I know I can use a custom dialect for having a correct mapping between my db and spark but how can I create a custom table schema with specific field data types and lengths when I use spark's jdbc.write options? I would like to have granular control over my table schemas when I load a table from spa...
horatio1701d
3

votes
1

answer
712

views

Python Data Pipelines and Streaming

My work involves a lot of data processing and streaming and processing data from various sources often times a lot of data. I use Python for everything and was wondering what area of Python should I be researching in order to optimize and build batch processing pipelines? I know there are some open...
horatio1701d
3

votes
1

answer
974

views

How to configure hadoop to use non-default port: “0.0.0.0: ssh: connect to host 0.0.0.0 port 22: Connection refused”

When I run start-dfs I get the below error and it looks like I need to tell hadoop to use a different port since that is what I require when I ssh into localhost. In other words the following works successfully: ssh -p 2020 localhost. [Wed Jan 06 16:57:34 [email protected]~]# start-dfs.sh 16/01/06 16:57:53 WARN...
horatio1701d
2

votes
1

answer
57

views

Iterating through min and max dates in bash by month

What is the easiest way to just generate an iteration that outputs %Y-%m* strings from a min and max date values in the form '%Y-%m-%d`? I am able to get min and max dates from my file system with ST_DT=${6-`hdfs dfs -ls /filepath/key=* | head -2 | tail -1 | cut -d '/' -f6 | cut -d '=' -f2`} EN_DT...
horatio1701d
2

votes
2

answer
1.1k

views

Scala Typesafe Config Printing file path, keys and values

Trying to find a very simple and clean way to just print the filepath and keys along with values that are present inside my application.conf file when using typesafe config library in scala. I have found many examples that almost do what I need but can't figure out how to properly filter down to the...
horatio1701d
78

votes
4

answer
66.8k

views

Format / Suppress Scientific Notation from Python Pandas Aggregation Results

How can one modify the format for the output from a groupby operation in pandas that produces scientific notation for very large numbers. I know how to do string formatting in python but I'm at a loss when it comes to applying it here. df1.groupby('dept')['data1'].sum() dept value1 1.192433e+...
horatio1701d
2

votes
1

answer
426

views

Spark createTableColumnTypes Not Resulting in user supplied schema

Not sure why this isn't working but I'm just trying to apply the below but still getting spark's version of the schema for the table (mysql) containing text instead of varchar(128) I'm trying to specify. Trying to just create custom datatypes for my columns with jdbc write. Trying with spark 2.1.0:...
horatio1701d

View additional