Questions tagged [bigdata]

0

votes
0

answer
5

Views

How to handle big textual data to create WordCloud?

I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours...
talha06
1

votes
1

answer
57

Views

why boolean field is not working in Hive?

I have a column in my hive table which datatype is boolean. when I tried to import data from csv, it stored as NULL. This is my sample table : CREATE tABLE if not exists Engineanalysis( EngineModel String, EnginePartNo String , Location String, Position String, InspectionReq boolean) ROW FORMAT DELI...
0

votes
0

answer
17

Views

Loop takes too long to replace values

I have a 9 mln line of code where I have to replace every ' . ' with the number in the line above. Meaning that if column 1 contains ' 7 ', the dot below should be replaced by that. But if column 3 contains ' 44 ', the subsequent values have to be replaced by 44 and so on. Problem: at the the moment...
Patrick Lombardo
1

votes
2

answer
65

Views

append multiple columns to existing dataframe in spark

I need to append multiple columns to the existing spark dataframe where column names are given in List assuming values for new columns are constant, for example given input columns and dataframe are val columnsNames=List('col1','col2') val data = Seq(('one', 1), ('two', 2), ('three', 3), ('four', 4...
nat
1

votes
1

answer
73

Views

Create neighborhood list of large dataset / fasten up

I want to create a weight matrix based on distance. My code for the moment looks as follows and functions for a smaller sample of the data. However, with the large dataset (569424 individuals in 24077 locations) it doesn't go through. The problem arise at the nb2blocknb fuction. So my question would...
Kerstin
1

votes
1

answer
693

Views

Time series classification using LSTM - How to approach?

I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions: Is the main idea for learning the LSTM to take a same sample from every time series? E.g. if I have time series A (with sam...
Andre444
1

votes
2

answer
1.1k

Views

Hive View Not Opening

In the Ambari UI of the hortonworks sandbox, I was trying to open Hive View through the account of maria_dev. But however, I was getting the following error: Service Hive check failed: Cannot open a hive connection with connect string jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscovery...
Witty Counsel
1

votes
1

answer
505

Views

Large data with pivot table using Pandas

I’m currently using Postgres database to store survey answers. My problem I’m facing is that I need to generate pivot table from Postgres database. When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table. However, my current database now has a...
Dat Nguyen
1

votes
0

answer
22

Views

Formal criterion of availability to implement algorithm using MapReduce paradigm

Computational model supported by MapReduce is expressive enough to compute almost any function on your data. If I understood correctly there are algorithms which can not be implemented using MapReduce paradigm. Is it correct ? Are there any criteria, formal or otherwise, to discern whether an algori...
demas
1

votes
2

answer
144

Views

nutch1.14 deduplication failed

I have integrated nutch 1.14 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial [[email protected] apache-nutch-1.14]# bin/nutch dedup http://ip:8983/solr/ DeduplicationJob: starting...
SMJ
1

votes
2

answer
84

Views

How to remove rows after a particular observation is seen for the first time

I have a dataset wherein I have account number and 'days past due' with every observation. For every account number, as soon as the 'days past due' column hits a code like 'DLQ3', I want to remove rest of the rows for that account (even if DLQ3 is the first observation for that account). My dataset...
Nupur Jain
1

votes
0

answer
318

Views

Reading a large PostgreSQL table into Python

I have data as below: ID date Net Total Class 11 201706 XN 0.607500 P 53 201709 M9 0.989722 V 68 201709 FM 3.736944 P 61 201701 ZK 1.121388 B 17 201705 F 1.969722 V This is a huge table (0.5 billion records) in PosteGreSQL and I need to pull a subset of it in...
Shuvayan Das
1

votes
2

answer
107

Views

find couple of objects from a dataframe

How can I avoid the two for loops and optimize my code to be able to handle big data? import pandas as pd import numpy as np array = np.array([[1,'aaa','bbb'],[2,'ccc','bbb'],[3,'zzzz','bbb'],[4,'eee','zzzz'],[5,'ccc','bbb'],[6,'zzzz','bbb'],[7,'aaa','bbb']]) df= pd.DataFrame(array) l=[] for i in r...
c'est moi
1

votes
1

answer
1.3k

Views

Spark : Job stuck on the last 2 tasks of 100

I am new to Spark and I must support an application that has been written by our consultant. I read and watched a tons of information about Spark but still I'm struggling with the little details to tune the job correctly. The scenario: Java class that contains 5 cleansing rules that we apply on a R...
Geoff L.
1

votes
1

answer
70

Views

How to aggregate event for denormalization?

A user clickstream is represented by events with type and event_timestamp properties. For example: userid type event_timestamp (yyyy-MM-ddThh:mm:ss.SSS) 01 install 2018-01-01T00:00:00.000 01 level_up 2018-01-15T00:00:00.000 01 new_item 2018-02-03T00:00:00.000 All inp...
Cherry
1

votes
0

answer
214

Views

Caret training options for big data. Is there something like partial_fit in sklearn?

This seems like and obvious question but I couldn't find anything so far. I want to train a random forest but my data set is very big. It has only a few features but about 3 million rows. If I train with a smaller sample everything works nicely but if I use the whole data set my system runs out of...
creyesk
1

votes
0

answer
134

Views

(Spark Schedular) What's the difference between Fair and FIFO inside a spark job pool?

I know that for spark we could set different pools as Fair or FIFO and the behavior can be different. However, inside the fairscheduler.xml we could also set individual pool to be Fair or FIFO and I tested several times as their behavior seems to be the same. Then I took a look at the spark source c...
Yi Steven
1

votes
0

answer
54

Views

tJdbcOutput component Delay for Table insert in Talend 6.4

I am using a one to one mapping for loading data into a MSsql server table. This mapping is simple and contains only two components. tFileinputDelimited for reading and tJdBCoutput for inserting into table. we are using a tJdbcoutput component to load data into table. We have created the mapping in...
TomG
1

votes
0

answer
33

Views

why does hbase need nodemanager when it uses coprocessors

Node Manager is used to start, execute and monitor containers on YARN(containers are assigned to execute map-red jobs). Co-processor on the other hand is a framework which does distributed computation directly within the HBase server processes. I have tables in HBase which I query using phoenix. My...
Aditya
1

votes
0

answer
68

Views

No command builder registered for name extractHBaseCells

I am trying to index data from HBase to Solr and using Hbase-Indexer for the same. But I am getting 'No command builder registered for name: extractHBaseCells' any suggestions. My morphline.conf is morphlines : [{ id : morphline1 importCommands : ['org.kitesdk.morphline.**', 'com.ngdata.**']...
Prayalankar Ashutosh
1

votes
0

answer
168

Views

How to set yarn.nodemanager.resource.cpu-vcores (number of virtual cores)

I am little confuse What is the yarn.nodemanager.resource.cpu-vcores value that should be ? ( based on the picture down ...) real total CPU CORE on worker machine? OR set the value to 80% of real total CPU CORE on worker machine? ( as some sites recommended ) for now I set it to 8 ( on each worker...
enodmilvado
1

votes
0

answer
112

Views

Scalding Execution Monad - What is it & how to use it

I am working on Big Data technologies using MR based on Java. But recently my company has moved to Scalding framework. I am not able get my head around the Scalding Execution Monad. What it is and how it works. Cannot find much material on it on internet as well. Would somebody please throw some lig...
learner4life
1

votes
0

answer
116

Views

Run independent, Parallel or Multithread to increase speed

I have 2 separate collections. Each has information of hotels in the world , each provided by different companies, but both contain same information, each collection have information like gps, name , country, city , email , fax and tel the problem is name(gps , info and ...) are changed, I wrote a s...
koorosh safeashrafi
1

votes
0

answer
273

Views

Is hive.groupby.skewindata depend on hive.optimize.skewjoin?

According to hive template : hive.optimize.skewjoin : Whether to enable skew join optimization. The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce job, process those skew...
Ashish Doneriya
1

votes
1

answer
883

Views

AWS Glue convert files from JSON to Parquet with same partitions as source table

We are using AWS glue to convert JSON files stored in our S3 datalake. Here are the steps that I followed, Created a crawler for generating table on Glue from our datalake bucket which has JSON data. The newly created tables have partitions as follows, Name, Year, Month, day, hour Created a glue...
Vishnu Prassad
1

votes
0

answer
112

Views

Elastic Search Bulk Query Takes 30 minutes, is this normal?

I am storing about 20 million documents every day into an elastic search (6x) index with varying paramaters for my primary shard and replica counts ranging from 2 to 5 on both (this is running in a 5 node cluster with fast hardware) Every time I run a batch to pull all the documents from the index i...
Duncan Krebs
1

votes
0

answer
1.3k

Views

ocalhost: ERROR: Cannot set priority of resourcemanager process 9799

after using this command hdfs namenode -format. localhost: ERROR: Cannot set priority of resourcemanager process 9799 similarly in case of datanode resourcemanager nodemanager. not able to remove these errors
Jitender Singh
1

votes
0

answer
563

Views

geopandas dataframe to json

I have a geodataframe called SchooolDistrictDf that has more than 19,814,822 rows and looks like the following: FIPS SrcName crate_code geohash ncessch sLevel schnam shape stAbbrev 0 13820.0 NaN birmingh djfjrrw 010000700091 1 TRACE CROSSINGS ELEM SCH {u'type': u'Point', u'coord...
M3105
1

votes
1

answer
195

Views

Keras fit_generator only useful for data augmentation, or also for reading from disk(/network)

Data might not fit in GPU-memory (including activations and gradients), for which one uses mini-batches, and it might not fit in RAM, for which one uses fit_generator. Or at least, the latter is my hypothesis I would like to validate here. Is it true that Keras applies a producer-consumer strategy t...
Herbert
1

votes
0

answer
45

Views

Constructing routing matrix based on given dataset with constraints

I am working on a data frame named dbus that describes the fixed-routes of a public bus over certain time periods. A small portion of the data frame is shown below. The meaning of this data frame is pretty simple: column Stop.Name labels all the bus stops a bus would visit, in the same order as show...
user177196
1

votes
2

answer
124

Views

Parallel processing or optimization of latent class analysis

I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected....
tatami
1

votes
0

answer
45

Views

start oozie coorinator job via API with start_time default as current date

I am calling oozie coordinator job, in which I want as soon as coordinator job is call the start_time should be set as todays date.
Sagan Pariyar
1

votes
1

answer
121

Views

Solr search in the nested all childs of the parent element, Nested Query, Block join.

I am struggling with the question. How can I search in the nested all childs of the parent element in Solr version 7.2. Searching in the single field I was able but in all fields no solution. I have read all documentation but exact solution doesn't existed can anybody help me with this? my query is...
user1524615
1

votes
1

answer
1.6k

Views

Copy files from local to hdfs

I'm trying to copy a file from my local machine into hdfs. I'm using this command hadoop fs -put Desktop/unsed cubes.txt /user/file And I'm getting this exception -put: java.net.UnknownHostException: sandbox.hortonworks Usage: hadoop fs [generic options] -put [-f] [-p] [-l] ... I've tried using th...
Teddy
1

votes
0

answer
23

Views

Resource usage by Spark Receivers

As per Spark Streaming Guide, A DStream is associated with a single receiver. For attaining read parallelism multiple receivers i.e. multiple DStreams need to be created. A receiver is run within an executor. It occupies one core. Ensure that there are enough cores for processing after receiver slo...
Saheb
1

votes
2

answer
562

Views

Average over 2000 values with PySpark Dataframe

I have a PySpark dataframe with about a billion rows. I want to average over every 2000 values, like average of rows with indeces 0-1999, average of rows with indeces 2000-3999, and so on. How do I do this? Alternatively, I could also average 10 values for every 2000, like average of rows with indec...
A. R.
1

votes
1

answer
125

Views

Console Input for pyspark

Is there a input() function in Pyspark through which i can take console input. If yes, can u please elaborate on it. How do i write the following code in PySpark : directory_change = input('Do you want to change your working directory ? (Y/N)') sc.input('Do you want to change your working directory...
kcvizer
1

votes
0

answer
260

Views

Prons and Cons of using relational Databases (like SQL) for real-time time series analysis?

I know that the relational databases are not scalable to Big-Data. I also know that memcaching is a bit complicated in them. I also know that you may need to unnest and flatten data to prepare it for analysis. However, I still think that for structured data, for some scenarios, it is better to apply...
Shahin Vakilinia
1

votes
1

answer
153

Views

Why is there no serving layer for the speed layer in the lambda architecture?

Nathan Marz uses the following picture for explaining the lambda architecture Lambda Architecture Visualization by Marz However, on the internet I often find the following architecture, in which the serving layer is not only the next step after the batch layer, but also the streaming layer, i.e. DZo...
Franz
1

votes
1

answer
43

Views

Caching on big data

I have a website with many transactions records, about 2M rows on MySQL I often need to erase the data because its getting slower when fetching the data Database : MYSQL Lang : PHP 5.4 OS : Ubuntu 16.04 The first user will do some order, and then it will be saved in database, user will then be redir...

View additional questions